avatars.processors.InterRecordBoundedCumulatedDifferenceProcessor¶
- class avatars.processors.InterRecordBoundedCumulatedDifferenceProcessor(id_variable: str, target_variable: str, new_first_variable_name: str, new_difference_variable_name: str, should_round_output: bool = False)¶
Processor to express the value of a variable as the difference from the previous value.
This processor can be used only on data where there are several records for each individual. By this transformation, a variable whose value is cumulative will be expressed as: - a variable containing its first value. - a variable containing the difference from the previous record
The difference variable is expressed as the proportion of possible change between the value and the bound (upper or lower). For example, for a variable whose value only spreads from 10 (lower bound) to 100 (upper bound), if the previous records value is 60 and the new value is 30, the proportion will be calculated as (30 - 60) / (60 - 10) = -0.6. this ensures that bounds are respected during the pre-processing and post-processing of the data.
This processor is not suitable for data where the target or the id variable contain missing values.
- Keyword Arguments:
id_variable – variable indicating which individual each row belongs to
target_variable – variable to transform
new_first_variable_name – name of the variable to be created to contain the first value of the target variable
new_difference_variable_name – name of the variable to be created to contain the difference value
should_round_output – set to True to force post-processed values to be integers.
Examples
>>> df = pd.DataFrame({ ... "id": [1, 2, 1, 1, 2, 2], ... "value": [1025, 20042, 1000, 1130, 20000, 20040], ... }) >>> processor = InterRecordBoundedCumulatedDifferenceProcessor( ... id_variable='id', ... target_variable='value', ... new_first_variable_name='first_value', ... new_difference_variable_name='difference_to_bound', ... should_round_output=True ... ) >>> processor.preprocess(df) id first_value difference_to_bound 0 1 1025 0.000000 1 2 20042 0.000000 2 1 1025 -1.000000 3 1 1025 0.006827 4 2 20042 -0.002206 5 2 20042 0.952381
The postprocess allows you to transform some preprocessed data back into its original format
>>> preprocessed_df = pd.DataFrame({ ... "id": [1, 2, 1, 1, 2, 2], ... "first_value": [1025, 20042, 1025, 1025, 20042, 20042], ... "difference_to_bound": [0, 0, -1, 0.006827, -0.002206, 0.952381], ... }) >>> processor.postprocess(df, preprocessed_df) id value 0 1 1025 1 2 20042 2 1 1000 3 1 1130 4 2 20000 5 2 20040
- preprocess(df: DataFrame) DataFrame ¶
- postprocess(source: DataFrame, dest: DataFrame) DataFrame ¶