avatars.processors.RelativeDifferenceProcessor

class avatars.processors.RelativeDifferenceProcessor(target: str, references: List[str], scaling_unit: int | None = None, target_rename: str | None = None, drop_original_target: bool | None = False)

Express numeric variables as a difference relative to the sum of other variables.

Even if the avatarization is keeping relation and correlation, it will not guarantee mathematical relation retention. You can apply the RelativeDifferenceProcessor to retain this relation between variables.

Parameters:
  • target – variables to transform

  • references – the variable of reference

Keyword Arguments:
  • scaling_unit – divide difference by factor to handle unit variation. Eg. if scaling_unit=1000, a difference in meters will be expressed in kilometers.

  • target_rename – target name after preprocess.

  • drop_original_target – drop original_target. Can only be set to True if target_rename is specified

Examples

>>> import numpy as np
>>> df = pd.DataFrame(
...    {
...        "variable_1": [100, 150, 120, 100],
...        "variable_2": [110, 180, 130, np.nan]
...        }
...    )
>>> processor = RelativeDifferenceProcessor(target="variable_2", references=["variable_1"])
>>> df = processor.preprocess(df)
>>> df
   variable_1  variable_2
0         100        10.0
1         150        30.0
2         120        10.0
3         100         NaN

This preprocess allows you to convert some variable as a difference of other. It can useful when there is a relation between variables when variable_2 >= variable_1

>>> avatar = pd.DataFrame(
...    {
...        "variable_1": [110, 105, 115, 107],
...        "variable_2": [12, np.nan, 23, 15],
...        }
...    )
>>> avatar
   variable_1  variable_2
0         110        12.0
1         105         NaN
2         115        23.0
3         107        15.0
>>> avatar = processor.postprocess(df, avatar)
>>> avatar
   variable_1  variable_2
0         110       122.0
1         105         NaN
2         115       138.0
3         107       122.0

This processor can be useful when you have a relation between three variables. Lets suppose you have three variable with such as:

  • age_at_t0

  • age_at_t1

  • age_at_t2

The relation is age_at_t0 < age_at_t1 < age_at_t2, for all the individuals.

>>> df = pd.DataFrame(
...    {
...        "age_at_t0": [20, 40, 34, 56],
...        "age_at_t1": [23, 46, 37, 57],
...        "age_at_t2": [29, 54, 39, 64],
...        }
...    )
>>> df
   age_at_t0  age_at_t1  age_at_t2
0         20         23         29
1         40         46         54
2         34         37         39
3         56         57         64
>>> processor_1 = RelativeDifferenceProcessor( target="age_at_t2", references=["age_at_t1"])
>>> processor_2 = RelativeDifferenceProcessor( target="age_at_t1", references=["age_at_t0"])

Note

Be careful about the order of application of the processors

>>> processed = processor_1.preprocess(df)
>>> processor_2.preprocess(processed)
   age_at_t0  age_at_t1  age_at_t2
0         20        3.0        6.0
1         40        6.0        8.0
2         34        3.0        2.0
3         56        1.0        7.0
>>> avatar = pd.DataFrame(
...    {
...        "age_at_t0": [22, 38, 34, 56],
...        "age_at_t1": [4.0, 5.0, 1.0, 5.0],
...        "age_at_t2": [5.0, 3.0, 7.0, 6.0],
...        }
...    )
>>> avatar
   age_at_t0  age_at_t1  age_at_t2
0         22        4.0        5.0
1         38        5.0        3.0
2         34        1.0        7.0
3         56        5.0        6.0
>>> post_avatar = processor_2.postprocess(df, avatar)
>>> processor_1.postprocess(df, post_avatar)
   age_at_t0  age_at_t1  age_at_t2
0         22       26.0       31.0
1         38       43.0       46.0
2         34       35.0       42.0
3         56       61.0       67.0
preprocess(df: DataFrame) DataFrame

Transform a numeric variable into a difference relative to the sum of other variables.

Parameters:

df (dataframe to transform)

Return type:

a dataframe with the transformed version of wanted columns

postprocess(source: DataFrame, dest: DataFrame) DataFrame

Transform a difference relative to the sum of variables into an absolute numeric value.

Parameters:
  • source (not used)

  • dest (dataframe to transform)

Return type:

a dataframe with the transformed version of wanted columns