avatars.processors.ExpectedMeanProcessor¶
- class avatars.processors.ExpectedMeanProcessor(target_variables: List[str], *, groupby_variables: List[str] | None = None, same_std: bool = False)¶
Processor to force values to have similar mean to original data.
Means and standard deviations are computed for groups of variables and the processor ensures that the transformed data has similar mean and std than in the original data for each group. Care should be taken when using this processor as it only targets enhancement of unimodal utility. This may occur at the expense of multi-modal utility and privacy.
- Parameters:
target_variables – variables to transform
- Keyword Arguments:
groupby_variables – variables to use to group values in different distributions
same_std – Set to True to force the variables to transform to have the same standard deviation as the reference data. default: False.
Examples
>>> import numpy as np >>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6], [4, 5, 6], [1, 2, 3])), ... columns=['one', 'two', 'three']) >>> df = df.astype('int') >>> processor = ExpectedMeanProcessor(target_variables = ['one']) >>> processed = processor.preprocess(df)
The processor forces your synthetic dataset to have the same mean as the original.
>>> avatar = pd.DataFrame(np.array(([3, 2, 3], [3, 5, 6], [8, 5, 6], [8, 2, 3])), ... columns=['one', 'two', 'three']) >>> avatar.one.mean() 5.5 >>> avatar = processor.postprocess(df, avatar) >>> avatar.one.mean() 2.5
You can also force the mean by category using
`groupby_variables`
>>> df = pd.DataFrame( ... { ... "variable_1": [11, 24, 23.5, 12], ... "variable_2": ["red", "blue", "blue", "red"], ... } ... ) >>> df variable_1 variable_2 0 11.0 red 1 24.0 blue 2 23.5 blue 3 12.0 red >>> df.groupby("variable_2").mean() ... variable_1 variable_2 blue 23.75 red 11.50 >>> processor = ExpectedMeanProcessor( ... target_variables = ['variable_1'], groupby_variables= ['variable_2'], ... ) >>> processor.preprocess(df) variable_1 variable_2 0 11.0 red 1 24.0 blue 2 23.5 blue 3 12.0 red >>> avatar = pd.DataFrame( ... { ... "variable_1": [12, 13.5, 23.5, 22], ... "variable_2": ["red", "red", "blue", "blue"], ... } ... ) >>> avatar variable_1 variable_2 0 12.0 red 1 13.5 red 2 23.5 blue 3 22.0 blue >>> avatar.groupby("variable_2").mean() ... variable_1 variable_2 blue 22.75 red 12.75 >>> avatar = processor.postprocess(df, avatar) >>> avatar variable_1 variable_2 0 10.75 red 1 12.25 red 2 24.50 blue 3 23.00 blue >>> avatar.groupby("variable_2").mean() ... variable_1 variable_2 blue 23.75 red 11.50
- preprocess(df: DataFrame) DataFrame ¶
Compute the reference mean and standard deviations.
- Parameters:
df – reference dataframe
- Returns:
the unaltered reference dataframe.
- Return type:
df
- postprocess(source: DataFrame, dest: DataFrame) DataFrame ¶
Force the data to have the reference mean.
- Parameters:
source – not used
dest – dataframe to transform
- Returns:
a dataframe with the transformed target columns
- Return type:
dest