avatars.processors.ToCategoricalProcessor

class avatars.processors.ToCategoricalProcessor(to_categorical_threshold: int, *, keep_continuous: bool = False, continuous_suffix: str = '__cont', category: str = 'other')

Processor to model selected numeric variables as categorical variables.

Parameters:

to_categorical_threshold – threshold of the number of distinct value to consider a continuous variable as categorical.

Keyword Arguments:
  • keep_continuous – if True, continuous variables will be kept and

  • continuous_suffix. (suffixed with)

  • continuous_suffix – suffix for the continuous variable created during preprocess.

  • category – if keep_continuous=True, name of the new category, needed for some specific avatarization cases with the use of group_modalities processor

Examples

With keep_continuous=False it only convert the variable to object. By this you ensure to keep all values during the avatarization.

>>> df = pd.DataFrame(
...    {
...        "variable_1": [1, 7, 7, 1],
...        "variable_2": [1, 2, 7, 1]
...        }
...    )
>>> processor = ToCategoricalProcessor(to_categorical_threshold = 2)
>>> processor.preprocess(df).dtypes
variable_1    object
variable_2     int64
dtype: object
>>> avatar = pd.DataFrame(
...    {
...        "variable_1": [2, 1, 4, 1],
...        "variable_2": [2, 1, 4, 1]
...        }
...    )
>>> avatar["variable_1"] = avatar["variable_1"].astype('object')
>>> avatar.dtypes
variable_1    object
variable_2     int64
dtype: object
>>> processor.postprocess(df, avatar).dtypes
variable_1    int64
variable_2    int64
dtype: object

With keep_continuous=True, you duplicate the variable and keep it as continuous. This can be useful for other uses.

>>> df = pd.DataFrame(
...    {
...        "variable_1": [1, 7, 7, 1],
...        "variable_2": [1, 2, 7, 1]
...        }
...    )
>>> processor = ToCategoricalProcessor(to_categorical_threshold=2, keep_continuous=True)
>>> processor.preprocess(df).dtypes
variable_1          object
variable_2           int64
variable_1__cont     int64
dtype: object
preprocess(df: DataFrame) DataFrame

Transform numeric variables into categorical variables.

Parameters:

df (dataframe to transform)

Returns:

DataFrame

Return type:

transformed dataframe

postprocess(source: DataFrame, dest: DataFrame) DataFrame

Transform converted categorical variables back to numeric.

Parameters:
  • source (reference data frame)

  • dest (data frame to transform)

Returns:

DataFrame

Return type:

transformed data frame