avatars.processors.InterRecordCumulatedDifferenceProcessor

class avatars.processors.InterRecordCumulatedDifferenceProcessor(*, id_variable: str, target_variable: str, new_first_variable_name: str, new_difference_variable_name: str, keep_record_order: bool = False)

Processor to express the value of a variable as the difference from the previous value.

This processor can be used only on data where there are several records for each individual. By this transformation, a variable whose value is cumulative will be expressed as: - a variable containing its first value. - a variable containing the difference from the previous record

This processor is not suitable for data where the target or the id variable contain missing values.

Keyword Arguments:
  • id_variable – variable indicating which individual each row belongs to

  • target_variable – variable to transform

  • new_first_variable_name – name of the variable to be created to contain the first value of the target variable

  • new_difference_variable_name – name of the variable to be created to contain the difference value

  • keep_record_order – If set to True, the postprocess will decode values respecting the record order given by id_variable and sort_by_variable from the source dataframe. This can only be set to True if the indices are the same between the source and dest dataframes passed as arguments to postprocess.

Examples

>>> df = pd.DataFrame({
...    "id": [1, 2, 1, 1, 2, 2],
...    "value": [1025, 20042, 1000, 1130, 20000, 20040],
... })
>>> processor = InterRecordCumulatedDifferenceProcessor(
...    id_variable='id',
...    target_variable='value',
...    new_first_variable_name='first_value',
...    new_difference_variable_name='value_difference',
...    keep_record_order=True
...    )
>>> processor.preprocess(df)
   id  first_value  value_difference
0   1         1000              25.0
1   2        20000               2.0
2   1         1000               0.0
3   1         1000             105.0
4   2        20000               0.0
5   2        20000              40.0

The postprocess allows you to transform some preprocessed data back into its original format

>>> preprocessed_df = pd.DataFrame({
...    "id": [1, 2, 1, 1, 2, 2],
...    "first_value": [1000, 20000, 1000, 1000, 20000, 20000],
...    "value_difference": [25, 2, 0, 105, 0, 40],
... })
>>> processor.postprocess(df, preprocessed_df)
   id  value
0   1   1025
1   2  20042
2   1   1000
3   1   1130
4   2  20000
5   2  20040

The postprocess can also be used on data where the number of records per individual is different than the original one. In such cases, the processor should be instantiated with the keep_record_order argument set to its default value False. In the example below, there is an extra record with the id 2.

>>> processor = InterRecordCumulatedDifferenceProcessor(
...    id_variable='id',
...    target_variable='value',
...    new_first_variable_name='first_value',
...    new_difference_variable_name='value_difference',
...    keep_record_order=False
... )
>>> preprocessed_df = pd.DataFrame({
...    "id": [1, 2, 1, 1, 2, 2, 2],
...    "first_value": [1000, 20000, 1000, 1000, 20000, 20000, 20000],
...    "value_difference": [25, 2, 0, 105, 0, 40, 8],
... })
>>> processor.postprocess(df, preprocessed_df)
   id  value
0   1   1025
1   2  20002
2   1   1025
3   1   1130
4   2  20002
5   2  20042
6   2  20050
preprocess(df: DataFrame) DataFrame
postprocess(source: DataFrame, dest: DataFrame) DataFrame