avatars.processors.InterRecordBoundedRangeDifferenceProcessor

class avatars.processors.InterRecordBoundedRangeDifferenceProcessor(*, id_variable: str, target_start_variable: str, target_end_variable: str, new_first_variable_name: str, new_range_variable: str, new_difference_variable_name: str, sort_by_variable: str | None = None, should_round_output: bool = True)

Processor to express two related bounded variables relative to previous records.

This processor can be used only on data where there are several records for each individual. By this transformation, variables such as var_a and var_b whose values are cumulative over successive records t in the following way: var_a_t <= var_b_t <= var_a_t+1 <= var_b_t+1 <= var_a_t+2 …

will be expressed as: - a variable containing the first value of var_a. - a variable containing the difference from the previous record - a variable containing the range between the start and end variables

Difference and range variables are expressed as proportion of possible change between the value and the bound (upper or lower). For example, for a variable whose value only spreads from 10 (lower bound) to 100 (upper bound), if the previous records value is 60 and the new value is 30, the proportion will be calculated as (30 - 60) / (60 - 10) = -0.6

This processor is not suitable for data where any of the variables passed as args contain missing values.

Keyword Arguments:
  • id_variable – variable indicating which individual each row belongs to

  • target_start_variable – variable representing the start of the range to transform

  • target_end_variable – variable representing the end of the range to transform

  • sort_by_variable – variable used to sort records for each id

  • new_first_variable – name of the variable to be created to contain the first value of the target variable

  • new_range_variable – name of the variable to be created to contain the range value

  • new_difference_variable – name of the variable to be created to contain the difference value

  • should_round_output – set to True to force post-processed values to be integer.

Examples

>>> df = pd.DataFrame(
...    {
...       'quantity_start': [30, 100, 80, 70, 40, 70],
...       'quantity_end': [10, 80, 70, 60, 30, 5],
...       'b': [4, 3, 0, 0, 2, 4],
...       'id': [1,1,1,2,2,2]
...    }
... )
>>> processor = InterRecordBoundedRangeDifferenceProcessor(
...    id_variable='id',
...    target_start_variable='quantity_start',
...    target_end_variable='quantity_end',
...    new_first_variable_name='quantity_s_first_val',
...    new_difference_variable_name="quantity_diff_to_bound",
...    new_range_variable="quantity_range",
...    should_round_output=True
... )
>>> preprocessed_df = processor.preprocess(df)
>>> print(preprocessed_df)
   b  id  quantity_range  quantity_s_first_val  quantity_diff_to_bound
0  4   1       -0.800000                    30                0.000000
1  3   1       -0.210526                    30                1.000000
2  0   1       -0.133333                    30                0.000000
3  0   2       -0.153846                    70                0.000000
4  2   2       -0.285714                    70               -0.363636
5  4   2       -1.000000                    70                0.571429

The postprocess allows you to transform some preprocessed data back into its original format.

>>> processor.postprocess(df, preprocessed_df)
   quantity_start  quantity_end  b  id
0              30            10  4   1
1             100            80  3   1
2              80            70  0   1
3              70            60  0   2
4              40            30  2   2
5              70             5  4   2
preprocess(df: DataFrame) DataFrame
postprocess(source: DataFrame, dest: DataFrame) DataFrame