Runner¶
The Runner orchestrates an avatarization workflow: data upload, parameter
configuration, job submission, status polling, and result retrieval.
It encapsulates a configuration object (avatar_yaml.Config) that mirrors the
YAML structure used by batch operations.
Responsibilities¶
Collect and upload source (and optional avatar) tables (
add_table)Advise or customize parameters (
advise_parameters/set_parameters)Create links between tables (only needed for multitable) (
add_link)Launch jobs individually or end-to-end (
runand specialized methods)Download results (
get_all_resultsand specialized methods)
Design notes¶
The Runner is stateful: it remembers tables, generated parameters, created jobs and
download results. You usually create one per anonymization project (“set”) via
manager.create_runner(set_name=...). Internally it delegates network calls to the
shared authenticated ApiClient.
Minimal flow¶
runner = manager.create_runner("demo")
runner.add_table("wbcd", "fixtures/wbcd.csv")
runner.advise_parameters() # optional
runner.set_parameters("wbcd", k=15)
runner.run() # runs the full pipeline
runner.get_all_results() # downloads all results
Detailed reference¶
- class avatars.runner.Runner(api_client: ApiClient, display_name: str, seed: int | None = None, max_distribution_plots: int | None = None, pia_data_recipient: DataRecipient = DataRecipient.UNKNOWN, pia_data_type: DataType = DataType.UNKNOWN, pia_data_subject: DataSubject = DataSubject.UNKNOWN, pia_sensitivity_level: SensitivityLevel = SensitivityLevel.UNDEFINED, report_language: ReportLanguage = ReportLanguage.EN)[source]
Bases:
object- add_annotations(annotations: dict[str, str]) None[source]
Add metadata annotations to the config.
- Parameters:
annotations – A dictionary of annotations to add to the metadata.
- add_table(table_name: str, data: str | DataFrame, primary_key: str | None = None, foreign_keys: list | None = None, time_series_time: str | None = None, types: dict[str, ColumnType] = {}, individual_level: bool | None = None, avatar_data: str | DataFrame | None = None)[source]
Add a table to the config and upload the data in the server.
- Parameters:
table_name – The name of the table.
data – The data to add to the table. Can be a path to a file or a pandas DataFrame.
primary_key – The primary key of the table.
foreign_keys – Foreign keys of the table.
time_series_time – name of the time column in the table (time series case).
types – A dictionary of column types with the column name as the key and the type as the value.
individual_level – A boolean as true if the table is at individual level or not. An individual level table is a table where each row corresponds to an individual (ex: patient, customer, etc.). Default behavior is True.
avatar_data – The avatar table if there is one. Can be a path to a file or a pandas DataFrame.
- advise_parameters(table_name: str | None = None) None[source]
Fill the parameters set with the server recommendation.
- Parameters:
table_name – The name of the table. If None, all tables will be used.
- upload_file(table_name: str, data: str | DataFrame, avatar_data: str | DataFrame | None = None)[source]
Upload a file to the server.
- Parameters:
data – The data to upload. Can be a path to a file or a pandas DataFrame.
file_name – The name of the file.
- add_link(parent_table_name: str, parent_field: str, child_table_name: str, child_field: str, method: LinkMethod = LinkMethod())[source]
Add a table link to the config.
- Parameters:
parent_table_name – The name of the parent table.
child_table_name – The name of the child table.
parent_field – The parent link key field (primary key) in the parent table.
child_field – The child link key field (foreign key)in the child table.
method – The method to use for linking the tables. Defaults to “linear_sum_assignment”.
- set_parameters(table_name: str, k: int | None = None, ncp: int | None = None, use_categorical_reduction: bool | None = None, column_weights: dict[str, float] | None = None, exclude_variable_names: list[str] | None = None, exclude_replacement_strategy: ExcludeVariablesMethod | None = None, exclude_variable_method: ExcludeVariablesMethod | None = None, imputation_method: ImputeMethod | None = None, imputation_k: int | None = None, imputation_training_fraction: float | None = None, imputation_return_data_imputed: bool | None = None, open_dp_epsilon: float | None = None, fast_dp_epsilon: float | None = None, time_series_nf: int | None = None, time_series_projection_type: ProjectionType | None = None, time_series_nb_points: int | None = None, time_series_method: AlignmentMethod | None = None, known_variables: list[str] | None = None, target: str | None = None, quantile_threshold: int | None = None, data_augmentation_strategy: float | AugmentationStrategy | dict[str, float] | None = None, data_augmentation_target_column: str | None = None, data_augmentation_should_anonymize_original_table: bool | None = None, processors: list[AvatarizationProcessorParameters] | None = None, pseudonymized_columns: dict[str, PseudonymizationColumnConfig] | None = None, use_excluded_variables_in_metrics: bool = False)[source]
Set the parameters for a given table.
This will overwrite any existing parameters for the table, including parameters set using advise_parameter().
- Parameters:
table_name – The name of the table.
k – Number of nearest neighbors to consider for KNN-based methods.
ncp – Number of dimensions to consider for the KNN algorithm.
use_categorical_reduction – Whether to transform categorical variables into a latent numerical space before projection.
column_weights – Dictionary mapping column names to their respective weights, indicating the importance of each variable during the projection process.
exclude_variable_names – List of variable names to exclude from the projection.
exclude_replacement_strategy – DEPRECATED: use exclude_variable_method instead.
exclude_variable_method – Strategy for replacing excluded variables. Options: ROW_ORDER, COORDINATE_SIMILARITY.
imputation_method – Method for imputing missing values. Options:
ImputeMethod.KNN,ImputeMethod.MODE,ImputeMethod.MEDIAN,ImputeMethod.MEAN,ImputeMethod.FAST_KNN.imputation_k – Number of neighbors to use for imputation if the method is KNN or FAST_KNN.
imputation_training_fraction – Fraction of the dataset to use for training the imputation model when using KNN or FAST_KNN.
imputation_return_data_imputed – Whether to return the data with imputed values.
open_dp_epsilon – Epsilon value for differential privacy using OpenDP implementation.
fast_dp_epsilon – Epsilon value for fastDP avatarization.
time_series_nf – In time series context, number of degrees of freedom to retain in time series projections.
time_series_projection_type – In time series context, type of projection for time series. Options:
ProjectionType.FCPA(default) orProjectionType.FLATTEN.time_series_method – In time series context, method for aligning series. Options:
AlignmentMethod.SPECIFIED,AlignmentMethod.MAX,AlignmentMethod.MIN,AlignmentMethod.MEAN.time_series_nb_points – In time series context, number of points to generate for time series.
known_variables – List of known variables to be used for privacy metrics. These are variables that could be easily known by an attacker.
target – Target variable to predict, used for signal metrics.
quantile_threshold – Quantile threshold for privacy metrics calculations.
data_augmentation_strategy – Strategy for data augmentation. Can be a float representing the augmentation ratio, an AugmentationStrategy enum, or a dictionary mapping modality to their respective augmentation ratios.
data_augmentation_target_column – Target column for data augmentation when using a dictionary strategy or AugmentationStrategy.
data_augmentation_should_anonymize_original_table – SENSITIVE: Whether to anonymize the original table during data augmentation. Default is True.
processors –
List of processor parameter objects (subclasses of AvatarizationProcessorParameters). Supports InterRecordRangeDifferenceParameters and RelativeDifferenceParameters. The order of the list determines the order in which processors are applied during preprocessing (and reversed during postprocessing).
The transformations are transparent to the user - input and output have the same column structure at the end.
pseudonymized_columns – A mapping of column name to
PseudonymizationColumnConfigdescribing the PII type and pseudonymization strategy to apply to each column. Only columns that should be pseudonymized need to be listed. Foreign key columns in child tables automatically inherit the pseudonymization mapping of the referenced parent primary key — no explicit configuration is needed on child FK columns.use_excluded_variables_in_metrics – When True, excluded variables are NOT passed to metrics parameters, allowing privacy and signal metrics to include them in calculations. When False (default), excluded variables are also excluded from metrics calculations.
- update_parameters(table_name: str, **kwargs) None[source]
Update specific parameters for the table while preserving other existing parameters. Only updates the parameters that are provided, keeping existing values for others.
- Parameters:
table_name – The name of the table.
**kwargs – The parameters to update. Only parameters that are provided will be updated. See set_parameters for the full list of available parameters.
- delete_parameters(table_name: str, parameters_names: list[str] | None = None)[source]
Delete parameters from the config.
- Parameters:
table_name – The name of the table.
parameters_names – The names of the parameters to delete. If None, all parameters will be deleted.
- delete_link(parent_table_name: str, child_table_name: str)[source]
Delete a link from the config.
- Parameters:
parent_table_name – The name of the parent table.
child_table_name – The name of the child table.
- delete_table(table_name: str)[source]
Delete a table from the config.
- Parameters:
table_name – The name of the table.
- get_yaml(path: str | None = None)[source]
Get the yaml config.
- Parameters:
path – The path to the yaml file. If None, the default config will be returned.
- run(jobs_to_run: list[JobKind] = [JobKind.standard, JobKind.signal_metrics, JobKind.privacy_metrics, JobKind.report], ignore_warnings: bool = False)[source]
Run avatarization jobs.
This method creates resources and launches the specified jobs in the correct order. If this runner has existing results from a previous run, a
UserWarningis emitted with the oldset_nameso you can recover previous results if needed. Local state is then cleared and a new run starts.- Parameters:
jobs_to_run (list[JobKind]) – List of job types to run. Defaults to all jobs in execution order.
ignore_warnings (bool) – Whether to ignore warnings about existing results when re-running a runner. Defaults to False.
Examples
>>> runner = manager.create_runner("my_dataset") >>> runner.add_table("customers", data=df) >>> runner.run() >>> >>> # Running again emits a UserWarning with the old set_name, then proceeds >>> runner.run()
- get_status(job_name: JobKind)[source]
Get the status of a job by name. :param job_name: The name of the job to get.
- delete(job_names: JobKind | str | list[JobKind | str] | None = None) BulkDeleteResponse[source]
Delete one or more jobs launched by this runner.
When called without arguments, every job launched by this runner is deleted.
- Parameters:
job_names – A single job kind/name, a list of job kinds/names, or
Noneto delete all launched jobs.- Returns:
Response containing deleted and failed jobs.
- Return type:
Examples
>>> runner.delete() # all jobs >>> runner.delete(JobKind.standard) # single job >>> runner.delete([JobKind.standard, JobKind.privacy_metrics]) # multiple jobs
- get_specific_result(table_name: str, job_name: JobKind, result: Results = Results.SHUFFLED) TypeResults[source]
- get_all_results()[source]
Get all results.
- Returns:
dict – A dictionary with the results of each job on every table.
Each job is a dictionary with the table name as key and the results as value.
The results are a dictionary with the result name as key and the data as value.
The data can be a pandas DataFrame or a dictionary depending on the result type.
- download_report(path: str | None = None, report_type: ReportType = ReportType.BASIC)[source]
Download the report.
- Parameters:
path – The path to save the report. For a single report this is used as-is. When multiple PIA reports are returned (one per table), an index prefix is added to the filename only:
dir/0_report.pdf,dir/1_report.pdf, etc.
- print_parameters(table_name: str | None = None) None[source]
Print the parameters for a table.
- Parameters:
table_name – The name of the table. If None, all parameters will be printed.
- kill()[source]
Method not implemented yet.
- shuffled(table_name: str) DataFrame[source]
Get the shuffled data.
- Parameters:
table_name – The name of the table to get the shuffled data from.
- Returns:
The shuffled data as a pandas DataFrame.
- Return type:
pd.DataFrame
- sensitive_unshuffled(table_name: str) DataFrame[source]
Get the unshuffled data. This is sensitive data and should be used with caution.
- Parameters:
table_name – The name of the table to get the unshuffled data from.
- Returns:
The unshuffled data as a pandas DataFrame.
- Return type:
pd.DataFrame
- privacy_metrics(table_name: str) list[dict][source]
Get the privacy metrics.
- Parameters:
table_name – The name of the table to get the privacy metrics from.
- Returns:
The privacy metrics as a list of dictionary.
- Return type:
dict
- signal_metrics(table_name: str) list[dict][source]
Get the signal metrics.
- Parameters:
table_name – The name of the table to get the signal metrics from.
- Returns:
The signal metrics as a list of dictionary.
- Return type:
dict
- render_privacy_metrics_summary(open_in_browser: bool = False) dict[source]
Get the aggregated privacy metrics summary across tables.
Only available for multi-table jobs. :param open_in_browser: Whether to save the summary to a file and open it in a browser. deprecated and will be
removed in the future, as the summary is now returned as a dict instead of an HTML file
- Returns:
A nested dict
{table_name: {reference: meta_metric}}.- Return type:
dict
- render_signal_metrics_summary() dict[source]
Get the aggregated signal metrics summary across tables.
Only available for multi-table jobs.
- Returns:
A nested dict
{table_name: {reference: meta_metric}}.- Return type:
dict
- metrics_summary() DataFrame[source]
Get the combined privacy and signal metrics summary across tables as a DataFrame.
Only available for multi-table jobs.
- Returns:
A DataFrame indexed by
table_namewith MultiIndex columns(reference, metric)wherereferenceis the top level andmetricisprivacyorsignal, combining both meta-metrics for each table/reference pair.- Return type:
pd.DataFrame
- render_plot(table_name: str, plot_kind: PlotKind, open_in_browser: bool = False)[source]
Render a plot for a given table. The different plot kinds are defined in the PlotKind enum.
- Parameters:
table_name – The name of the table to get the plot from.
plot_kind – The kind of plot to render.
open_in_browser – Whether to save the plot to a file and open it in a browser.
- projections(table_name: str) tuple[DataFrame, DataFrame][source]
Get the projections.
- Parameters:
table_name – The name of the table to get the projections from.
- Returns:
The projections as a pandas DataFrame.
- Return type:
pd.DataFrame
- table_summary(table_name: str) DataFrame[source]
Get the table summary.
- Parameters:
table_name – The name of the table to get the summary from.
- Returns:
The table summary as a dataframe.
- Return type:
pd.DataFrame
- from_yaml(yaml_path: str) None[source]
Load configuration from a YAML file.
This replaces the current runner’s configuration with the one from the file.
Note: Table data files are not automatically loaded. Use
upload_file()to provide data before running jobs.- Parameters:
yaml_path (str) – The path to the YAML configuration file.