User guide

How to setup your email account

This section is only needed if the use of emails to login is activated in the global configuration.

At the moment, you have to get in touch with your Octopize contact so that they can create your account.

Our current email provider is AWS. They need to verify an email address before our platform can send emails to it.

You’ll thus get an email from AWS asking you to verify your email by clicking on a link. Once you have verified your email address by clicking on that link, you can follow the steps in the section about reset password.

How to reset your password

NB: This section is only available if the use of emails to login is activated in the global configuration. It is not the case by default.

If you forgot your password or if you need to set one, first call the forgotten_password endpoint:

from avatars.client import Manager

manager = Manager()
manager.forgotten_password("yourmail@mail.com")

You’ll then receive an email containing a token. This token is only valid once, and expires after 24 hours. Use it to reset your password:

from avatars.client import Manager

manager = Manager()
manager.reset_password("yourmail@mail.com", "new_password", "new_password", "token-received-by-mail")

You’ll receive an email confirming your password was reset.

How to log in to the server

Avatar supports two authentication methods that are mutually exclusive:

Username/Password Authentication

⚠️ Deprecated — Username/password authentication is deprecated and will be removed in a future release. It is still functional for now, but you will receive a deprecation warning on each successful login. Please migrate to API key authentication as soon as possible. See How to create an API key for migration steps.

import os
from avatars.manager import Manager

manager = Manager()
manager.authenticate(
    username=os.environ.get("AVATAR_USERNAME"),
    password=os.environ.get("AVATAR_PASSWORD"),
)
# ⚠️ A DeprecationWarning will be emitted. See below for how to create an API key.

API Key Authentication

The recommended authentication method. If you have an API key, pass it directly — no call to authenticate() is needed:

import os
from avatars.manager import Manager

manager = Manager(api_key=os.environ.get("AVATAR_API_KEY"))
# Ready to use immediately.

Note: API key and username/password authentication are mutually exclusive. Choose one method based on your credentials.

How to create an API key

API keys are the recommended way to authenticate with Avatar. They are long-lived credentials that do not require you to store your password in scripts or notebooks.

Prerequisites: You must be logged in with a username/password session to create your first API key (you need the session to call the creation endpoint). Once you have an API key you will never need to log in with a password again.

Step 1 — Log in once with your password:

import os
from avatars.manager import Manager

manager = Manager()
manager.authenticate(
    username=os.environ.get("AVATAR_USERNAME"),
    password=os.environ.get("AVATAR_PASSWORD"),
    should_verify_compatibility=False,
)

Step 2 — Create an API key:

from avatars.models import CreateApiKeyRequest, ExpirationDays

api_key_response = manager.auth_client.api_keys.create_api_key(
    CreateApiKeyRequest(
        name="my-key", expiration_days=ExpirationDays.integer_365
    )
)
# ⚠️  Save this value — it is shown only once and cannot be retrieved
# later! If you lose it, you can simply create a new API key by
# repeating steps 1 and 2.
print(api_key_response.get('api_key').get('plaintext'))

Step 3 — Store the key securely:

Add the key to your .env file or set it as an environment variable:

AVATAR_API_KEY=<your-api-key>

Step 4 — Use the API key for all future sessions:

import os
from avatars.manager import Manager

manager = Manager(api_key=os.environ.get("AVATAR_API_KEY"))
# No authenticate() call needed — the manager is ready to use.

How to launch my first avatarization

When using Avatar, you will interact with an object called runner. This object serves as interface for managing the avatarization process. With the runner, you can upload your datasets, configure parameters, execute the avatarization, and retrieve the results.

import secrets
runner = manager.create_runner(set_name=f"test_wbcd_{secrets.token_hex(4)}")
runner.add_table("wbcd", "fixtures/wbcd.csv") # upload the data
runner.set_parameters("wbcd", k=15) # choose parameters
runner.run() # execute the avatarization

How to handle a large dataset

Due to the server limit, you can be limited by the number of rows and the number of dimensions.

Handle large amount of rows

If your dataset contains a large amount of rows, it will automatically be split into batches and each batch will be anonymized independently from the others. It is then merged back, so that the final dataset is the result of the anonymization of the whole dataset.

Handle large amount of dimensions

The number of dimensions is the number of continuous variables plus the number of modalities in categorical variables. The limit of dimension is frequently reached due to a large number of modalities in one/sample of categorical variables (high cardinality variables).

There are several solutions to bypass this limitation:

  • Encode the categorical variable into a continuous variable (frequency encoding, target encoding, …).

  • Reduce the number of modalities by grouping some into more general modalities. You can use the processor GroupModalities.

  • Use the argument use_categorical_reduction

The parameter use_categorical_reduction=True will reduce the dimension of the categorical variable by encoding them as vectors. This step is using the word embedding cat2vec. This solution could reduce the utility of your dataset.

Understanding errors

Most of your actions will have a successfull outcome. However, sometimes there will be errors, and this section is here to explain the kinds of errors that can happen, and how to correct them.

  1. Timeout("The call timed out. Consider increasing the timeout with the 'timeout' parameter.")

    You’ll encounter this error when the call is taking too long to complete on the server. Most of the time, this will be during job execution or dataset upload/download. I’ll encourage you to read up on the `handling timeouts <#handling-timeouts>`__ section to deal with these kind of errors.

  2. Validation errors

    Validation errors happen due to bad user input. Our error message rely heavily on HTTP status codes. In short, codes in the 400-499 range are user errors, and 500-599 are server errors. More on those later.

    Here we’ll cover the user errors, than you can remedy by modifying your parameters and trying again. The syntax of the error message will always be of the following form:

    Got error in HTTP request: POST https://company.octopize.app/reports.
    Error status 400 - privacy_metrics job status is not success:
    JobStatus.FAILURE
    

    You’ll have:

    • the HTTP request method (POST, GET, etc…)

    • the endpoint that was affected (/reports)

    • the status (400)

    • an informational message that details the exact error that is happening (privacy_metrics job status is not success: JobStatus.FAILURE)

    In this particular case, the user is calling the /reports endpoint, trying to generate a report. Generating a report needs a privacy metrics job to be successful to be able to show the metrics. However, in this case, the privacy job was in the JobStatus.FAILURE state. The fix is then to go look at the error message that the privacy job threw up, launch another privacy job that is successful, and launch the generation of the report with the new privacy job once it is successful.

  3. JobStatus.FAILURE

    Jobs that fail do not throw an exception. Rather, you have to inspect the JobStatus that is in the status property.

    job=runner.get_job(JobKind.standard)
    print(job.status)  # JobStatus.FAILURE
    print(job.exception)
    

    If the status is JobStatus.FAILURE, the exception property will contain an explanation of the error. You’ll have to relaunch the job again with the appropriate modifications to your input.

  4. Internal error

    Internal errors happen when there is an error on the server, meaning that we did not handle the error on our side, and something unexpected happened, for which we cannot give you an exact error message. These come with a 500 HTTP status code, and the message is internal error. In these cases, there is not much you can do except trying again with different parameters, hoping to not trigger the error again.

    When these happen, our error monitoring software catches these and notifies us instantly. You can reach out to your Octopize contact (support@octopize.io) for more information and help for troubleshooting, while we investigate on our side. We’ll be hard at work trying to resolve the bug, and push out a new version with the fix.

Handling timeouts

Asynchronous calls

A lot of endpoints of the Avatar API are asynchronous, meaning that you request something that will run in the background, and will return a result after some time using another method, like runner.get_all_results for runner.run.

The default timeout for most of the calls to the engine is not very high, i.e. a few seconds long. You will quite quickly reach a point where a job on the server is taking longer than that to run.

The calls being asynchronous, you don’t need to sit and wait for the job to finish, you can simply take a break, come back after some time, and run the method requesting the result again.

Example:

from avatars.models import JobKind
job = runner.run(
    jobs_to_run=[
        JobKind.standard,
        JobKind.privacy_metrics,
        JobKind.signal_metrics
    ]
)

print(job)
# Take a coffee break, close the script, come back in 3 minutes

finished_job = runner.get_all_results()

print(finished_job)  # JobStatus.success

How to view my last jobs

A user can view all the jobs that they created. Jobs are listed by creation date. Attributes of the jobs such as job ID, creation date and job status can be used to enable their management (job deletion for example).

manager.get_last_jobs(count = 10) # get the last 10 jobs

To retrieve a summary of the last results instead:

manager.get_last_results(count = 10) # get the last 10 results

Or it is also possible to see last jobs and results in the web application.

How to reload results from a previous run

If you have already run an avatarization and want to retrieve its results in a new session, use create_runner_from_name with the name you originally gave the run. This is also how you reload a job that was launched from the web application.

runner = manager.create_runner_from_name("runner_name")
df = runner.shuffled("wbcd")
metrics = runner.privacy_metrics("wbcd")

Advanced: load by UUID

If you know the exact UUID of a run (from runner.set_name), you can load it directly using create_runner_from_id:

runner = manager.create_runner_from_id(
    "550e8400-e29b-41d4-a716-446655440000"
)
df = runner.shuffled("wbcd")

Or if multiple runs share the same name, the most recent one is used automatically by create_runner_from_name. To inspect a specific version, use find_ids_by_name:

results = manager.find_ids_by_name("runner_name")
for set_name_id, jobs in results:
    print(set_name_id, [j.kind for j in jobs])

runner = manager.create_runner_from_id(results[5][0])
df = runner.shuffled("wbcd")

Re-running from a loaded runner

A loaded runner is primarily for accessing existing results. When run() is called on a runner that already has results, a warning is emitted with the old UUID so you can always recover the previous results:

runner = manager.create_runner_from_name("my_dataset")
df = runner.shuffled("wbcd")  # access old results

# To re-run :
runner.run(ignore_warnings=True)
# UserWarning: This runner already has results for set_name
# 'abc-123...'
# To access previous results later:
#   manager.create_runner_from_id('abc-123...')

How to delete jobs

Jobs can be deleted individually or in bulk using either the Runner or the Manager.

Delete one or more jobs from the current runner (after running it):

runner.run()
runner.delete()  # all jobs
runner.delete(
    [JobKind.standard, JobKind.privacy_metrics]
)  # multiple jobs

Delete a run by name using the manager. If multiple runs share the same name, a ValueError is raised with the commands to disambiguate:

manager.delete_job("my_dataset")

How to launch a job from a yaml

A user can launch a job using a YAML configuration file. If you are using the web application to run jobs, you can download the configuration of a job and reuse it with the Python client. This approach is particularly helpful when iterating the anonymization process.

Here is an example of a python script :

job_name = "from_yaml_" + secrets.token_hex(4)
runner = manager.create_runner(job_name)
runner.from_yaml("fixtures/yaml_from_web.yaml")
# If needed, upload your data for each table.
# Data could have been deleted from the server if the job was launched
# a certain amount of time ago.
runner.upload_file("iris", data="fixtures/iris.csv")
runner.run()

How to change variables types

Sometimes, it is helpful to change the type of your variables. For instance, a numeric variable might only contain a few unique values, making it more appropriate to treat it as a categorical variable. This can optimize utility performance in your avatarization.

from avatar_yaml.models.schema import ColumnType
runner.add_table(
    "wbcd",
    data="fixtures/wbcd.csv",
    types={"Clump_Thickness": ColumnType.CATEGORY}
)

or either use pandas to do it :

df = pd.read_csv("fixtures/wbcd.csv")
df["Clump_Thickness"]=df["Clump_Thickness"].astype("string")
runner.add_table("wbcd", data=df)

How to launch an avatarization using differential privacy

You can use differential privacy in the avatarization pipeline.

runner.set_parameters("wbcd", open_dp_epsilon=10)

How to launch metrics independently

Data quality evaluation can be performed independently of the anonymization process by running only the metrics jobs on both the original and the anonymized datasets. For an accurate assessment, the data should not be shuffled.

For more details about our metrics, refer to our public documentation.

job_name = "only_metrics" + secrets.token_hex(4)
runner = manager.create_runner(job_name)
# Original and anonymized data should be in the same order
runner.add_table(
    "iris",
    data="fixtures/iris.csv",
    avatar_data="fixtures/iris_avatarized.csv"
)
runner.set_parameters("iris", k=10)
runner.run(
    jobs_to_run=[JobKind.privacy_metrics, JobKind.signal_metrics]
)

How to render plots

You can generate plots to assess how well utility is preserved during the avatarization process, using four levels of analysis:

  • Univariate: Compare distributions with PlotKind.DISTRIBUTION or review mean and standard deviation summaries for the first 10 columns using PlotKind.AGGREGATE_STATS.

  • Bivariate: Examine correlations with PlotKind.CORRELATION and analyze correlation differences with PlotKind.CORRELATION_DIFFERENCE.

  • Multivariate: Visualize data projections using PlotKind.PROJECTION_2D and PlotKind.PROJECTION_3D.

  • Data structure: Explore variable contributions within the model using PlotKind.CONTRIBUTION.

Visualizations are a great way to fine tuned the parameters and understand your results.

To render a plot, simply use:

runner.render_plot("iris", PlotKind.PROJECTION_3D)

If you encounter issues displaying plots directly in your notebook (for example, in VS Code), or if you prefer to download the plot as an HTML file, you can use the open_in_browser=True parameter. This will save the plot as an HTML file and open it in your default web browser:

runner.render_plot("iris", PlotKind.PROJECTION_3D, open_in_browser=True)

How to create a PIA report

When running the avatarization, you can generate 2 reports: a technical report and a PIA (Privacy Impact Assessment) report. The PIA report helps determine whether the residual risk is compatible with the intended use of the data and with its level of sensitivity. While the technical evaluation quantifies the re-identification risk associated with potential attacks, the risk analysis report evaluates the likelihood of such attacks occurring and the resources an adversary would need to carry them out.

The PIA report can be customized with information about your dataset and the avatarization context:

  • SensitivityLevel: indicates the sensitivity level of the data being processed. The sensitivity increases as the data are or are not personal, sensitive…

  • DataType: indicates the type of data being processed (e.g., health data, financial data, etc.). Different data types may have different privacy implications and regulatory requirements.

  • DataSubject: identifies the individual or group to whom the data relates, helping to contextualize privacy considerations.

  • DataRecipient: describes the intended recipients of the data, which can influence the assessment of privacy risks and necessary safeguards.

You can set these parameters when creating the runner:

from avatar_yaml.models.avatar_metadata import (
    SensitivityLevel,
    DataType,
    DataSubject,
    DataRecipient
)

runner = manager.create_runner(
    set_name="test_pia_report",
    pia_data_recipient=DataRecipient.OPENDATA,
    pia_data_type=DataType.HEALTH,
    pia_data_subject=DataSubject.PATIENTS,
    pia_sensitivity_level=SensitivityLevel.HIGH,
)

You can then download the PIA report as a DOCX file:

from avatar_yaml.models.parameters import ReportType
runner.download_report("pia_report.docx", report_type=ReportType.PIA)

How to set the report language

You can configure the language of the generated report. Two languages are available:

  • ReportLanguage.EN → English (default)

  • ReportLanguage.FR → French

When creating a runner, you can specify the report_language parameter:

from avatars import Manager
from avatar_yaml.models.parameters import ReportLanguage

manager = Manager(...)
runner = manager.create_runner(
    name="my_job",
    report_language=ReportLanguage.FR  # or ReportLanguage.EN
)