avatars.processors.InterRecordRangeDifferenceProcessor¶
- class avatars.processors.InterRecordRangeDifferenceProcessor(*, id_variable: str, target_start_variable: str, target_end_variable: str, sort_by_variable: str, new_first_variable: str, new_range_variable: str, new_difference_variable: str, keep_record_order: bool = False)¶
Processor to express the values of two related variables relative to previous records.
This processor can be used only on data where there are several records for each individual. By this transformation, variables such as var_a and var_b whose values are cumulative over successive records t in the following way: var_a_t <= var_b_t <= var_a_t+1 <= var_b_t+1 <= var_a_t+2 …
will be expressed as: - a variable containing the first value of var_a. - a variable containing the difference from the previous record (i.e. var_a_t - var_b_t-1) - a variable containing the range between the start and end variables (i.e. var_b_t - var_a_t)
This processor can be used to express a quantity that varies at each event (variation defined by a start and an end) but also that varies across successive events.
This processor is not suitable for data where any of the variables passed as args contain missing values.
- Keyword Arguments:
id_variable – variable indicating which individual each row belongs to
target_start_variable – variable representing the start of the range to transform
target_end_variable – variable representing the end of the range to transform
sort_by_variable – variable used to sort records for each id
new_first_variable – name of the variable to be created to contain the first value of the target variable
new_range_variable – name of the variable to be created to contain the range value
new_difference_variable – name of the variable to be created to contain the difference value
keep_record_order – If set to True, the postprocess will decode values respecting the record order given by id_variable and sort_by_variable from the source dataframe. This can only be set to True if the indices are the same between the source and dest dataframes passed as arguments to postprocess.
Examples
>>> df = pd.DataFrame( ... { ... "id": [1, 2, 1, 1, 2, 2], ... "start": [7, 14, 6, 12, 10, 23], ... "end": [8, 18, 7, 15, 12, 24], ... } ... ) >>> processor = InterRecordRangeDifferenceProcessor( ... id_variable="id", ... target_start_variable='start', ... target_end_variable='end', ... sort_by_variable="start", ... new_first_variable="first_value", ... new_range_variable='range_value', ... new_difference_variable="value_difference", ... keep_record_order=True ... ) >>> processor.preprocess(df) id range_value first_value value_difference 0 1 1 6 0.0 1 2 4 10 2.0 2 1 1 6 0.0 3 1 3 6 4.0 4 2 2 10 0.0 5 2 1 10 5.0
The postprocess allows you to transform some preprocessed data back into its original format >>> preprocessed_df = pd.DataFrame( … { … “id”: [1, 2, 1, 1, 2, 2], … “range_value”: [1, 4, 1, 3, 2, 1], … “first_value”: [6, 10, 6, 6, 10, 10], … “value_difference”: [0, 2, 0, 4, 0, 5], … } … )
>>> processor.postprocess(df, preprocessed_df) id start end 0 1 7.0 8.0 1 2 14.0 18.0 2 1 6.0 7.0 3 1 12.0 15.0 4 2 10.0 12.0 5 2 23.0 24.0
The postprocess can also be used on data where the number of records per individual is different than the original one. In such cases, the processor should be instantiated with the keep_record_order argument set to its default value False. In the example below, there is an extra record with the id 2.
>>> processor = InterRecordRangeDifferenceProcessor( ... id_variable="id", ... target_start_variable='start', ... target_end_variable='end', ... sort_by_variable="start", ... new_first_variable="first_value", ... new_range_variable='range_value', ... new_difference_variable="value_difference", ... keep_record_order=False ... ) >>> preprocessed_df = pd.DataFrame( ... { ... "id": [1, 2, 1, 1, 2, 2, 2], ... "range_value": [1, 4, 1, 3, 2, 1, 2], ... "first_value": [6, 10, 6, 6, 10, 10, 10], ... "value_difference": [0, 2, 0, 4, 0, 5, 1], ... } ... ) >>> processor.postprocess(df, preprocessed_df) id start end 0 1 6.0 7.0 1 2 12.0 16.0 2 1 7.0 8.0 3 1 12.0 15.0 4 2 16.0 18.0 5 2 23.0 24.0 6 2 25.0 27.0
- preprocess(df: DataFrame) DataFrame ¶
- postprocess(source: DataFrame, dest: DataFrame) DataFrame ¶