.. DO NOT EDIT: auto-generated from doc/source/user_guide.md — edit the .md source and run `just format-doc`. User guide ========== - `User guide <#user-guide>`__ - `How to setup your email account <#how-to-setup-your-email-account>`__ - `How to reset your password <#how-to-reset-your-password>`__ - `How to log in to the server <#how-to-log-in-to-the-server>`__ - `Username/Password Authentication <#usernamepassword-authentication>`__ - `API Key Authentication <#api-key-authentication>`__ - `How to create an API key <#how-to-create-an-api-key>`__ - `How to launch my first avatarization <#how-to-launch-my-first-avatarization>`__ - `How to handle a large dataset <#how-to-handle-a-large-dataset>`__ - `Handle large amount of rows <#handle-large-amount-of-rows>`__ - `Handle large amount of dimensions <#handle-large-amount-of-dimensions>`__ - `Understanding errors <#understanding-errors>`__ - `Handling timeouts <#handling-timeouts>`__ - `Asynchronous calls <#asynchronous-calls>`__ - `How to view my last jobs <#how-to-view-my-last-jobs>`__ - `How to reload results from a previous run <#how-to-reload-results-from-a-previous-run>`__ - `How to delete jobs <#how-to-delete-jobs>`__ - `How to launch a job from a yaml <#how-to-launch-a-job-from-a-yaml>`__ - `How to change variables types <#how-to-change-variables-types>`__ - `How to launch an avatarization using differential privacy <#how-to-launch-an-avatarization-using-differential-privacy>`__ - `How to launch metrics independently <#how-to-launch-metrics-independently>`__ - `How to render plots <#how-to-render-plots>`__ - `How to create a PIA report <#how-to-create-a-pia-report>`__ How to setup your email account ------------------------------- *This section is only needed if the use of emails to login is activated in the global configuration.* At the moment, you have to get in touch with your Octopize contact so that they can create your account. Our current email provider is AWS. They need to verify an email address before our platform can send emails to it. You’ll thus get an email from AWS asking you to verify your email by clicking on a link. Once you have verified your email address by clicking on that link, you can follow the steps in the section about `reset password <#how-to-reset-your-password>`__. How to reset your password -------------------------- **NB**: This section is only available if the use of emails to login is activated in the global configuration. It is not the case by default. If you forgot your password or if you need to set one, first call the forgotten_password endpoint: .. raw:: html .. code:: javascript from avatars.client import Manager manager = Manager() manager.forgotten_password("yourmail@mail.com") You’ll then receive an email containing a token. This token is only valid once, and expires after 24 hours. Use it to reset your password: .. code:: javascript from avatars.client import Manager manager = Manager() manager.reset_password("yourmail@mail.com", "new_password", "new_password", "token-received-by-mail") You’ll receive an email confirming your password was reset. How to log in to the server --------------------------- Avatar supports two authentication methods that are mutually exclusive: Username/Password Authentication ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **⚠️ Deprecated** — Username/password authentication is deprecated and will be removed in a future release. It is still functional for now, but you will receive a deprecation warning on each successful login. Please migrate to `API key authentication <#api-key-authentication>`__ as soon as possible. See `How to create an API key <#how-to-create-an-api-key>`__ for migration steps. .. raw:: html .. code:: python import os from avatars.manager import Manager manager = Manager() manager.authenticate( username=os.environ.get("AVATAR_USERNAME"), password=os.environ.get("AVATAR_PASSWORD"), ) # ⚠️ A DeprecationWarning will be emitted. See below for how to create an API key. API Key Authentication ~~~~~~~~~~~~~~~~~~~~~~ The recommended authentication method. If you have an API key, pass it directly — no call to ``authenticate()`` is needed: .. code:: python import os from avatars.manager import Manager manager = Manager(api_key=os.environ.get("AVATAR_API_KEY")) # Ready to use immediately. **Note:** API key and username/password authentication are mutually exclusive. Choose one method based on your credentials. How to create an API key ------------------------ API keys are the recommended way to authenticate with Avatar. They are long-lived credentials that do not require you to store your password in scripts or notebooks. **Prerequisites:** You must be logged in with a username/password session to create your first API key (you need the session to call the creation endpoint). Once you have an API key you will never need to log in with a password again. **Step 1 — Log in once with your password:** .. code:: python import os from avatars.manager import Manager manager = Manager() manager.authenticate( username=os.environ.get("AVATAR_USERNAME"), password=os.environ.get("AVATAR_PASSWORD"), should_verify_compatibility=False, ) **Step 2 — Create an API key:** .. code:: python from avatars.models import CreateApiKeyRequest, ExpirationDays api_key_response = manager.auth_client.api_keys.create_api_key( CreateApiKeyRequest(name="my-key", expiration_days=ExpirationDays.integer_365) ) print(api_key_response.get('api_key').get('plaintext')) # ⚠️ Save this value — it is shown only once and cannot be retrieved later! # If you lose it, you can simply create a new API key by repeating steps 1 and 2. **Step 3 — Store the key securely:** Add the key to your ``.env`` file or set it as an environment variable: .. code:: bash AVATAR_API_KEY= **Step 4 — Use the API key for all future sessions:** .. code:: python import os from avatars.manager import Manager manager = Manager(api_key=os.environ.get("AVATAR_API_KEY")) # No authenticate() call needed — the manager is ready to use. How to launch my first avatarization ------------------------------------ When using Avatar, you will interact with an object called ``runner``. This object serves as interface for managing the avatarization process. With the ``runner``, you can upload your datasets, configure parameters, execute the avatarization, and retrieve the results. .. code:: python import secrets runner = manager.create_runner(set_name=f"test_wbcd_{secrets.token_hex(4)}") runner.add_table("wbcd", "fixtures/wbcd.csv") # upload the data runner.set_parameters("wbcd", k=15) # choose parameters runner.run() # execute the avatarization How to handle a large dataset ----------------------------- Due to the server limit, you can be limited by the number of rows and the number of dimensions. Handle large amount of rows ~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your dataset contains a large amount of rows, it will automatically be split into batches and each batch will be anonymized independently from the others. It is then merged back, so that the final dataset is the result of the anonymization of the whole dataset. Handle large amount of dimensions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The number of dimensions is the number of continuous variables plus the number of modalities in categorical variables. The limit of dimension is frequently reached due to a large number of modalities in one/sample of categorical variables (high cardinality variables). There are several solutions to bypass this limitation: - Encode the categorical variable into a continuous variable (frequency encoding, target encoding, …). - Reduce the number of modalities by grouping some into more general modalities. You can use the processor GroupModalities. - Use the argument ``use_categorical_reduction`` The parameter ``use_categorical_reduction=True`` will reduce the dimension of the categorical variable by encoding them as vectors. This step is using the word embedding cat2vec. This solution could reduce the utility of your dataset. Understanding errors -------------------- Most of your actions will have a successfull outcome. However, sometimes there will be errors, and this section is here to explain the kinds of errors that can happen, and how to correct them. 1. ``Timeout("The call timed out. Consider increasing the timeout with the 'timeout' parameter.")`` You’ll encounter this error when the call is taking too long to complete on the server. Most of the time, this will be during job execution or dataset upload/download. I’ll encourage you to read up on the ```handling timeouts`` <#handling-timeouts>`__ section to deal with these kind of errors. 2. Validation errors Validation errors happen due to bad user input. Our error message rely heavily on `HTTP status codes `__. In short, codes in the 400-499 range are user errors, and 500-599 are server errors. More on those later. Here we’ll cover the user errors, than you can remedy by modifying your parameters and trying again. The syntax of the error message will always be of the following form: .. code:: text Got error in HTTP request: POST https://company.octopize.app/reports. Error status 400 - privacy_metrics job status is not success: JobStatus.FAILURE You’ll have: - the HTTP request method (``POST``, ``GET``, etc…) - the endpoint that was affected (``/reports``) - the status (``400``) - an informational message that details the exact error that is happening (``privacy_metrics job status is not success: JobStatus.FAILURE``) In this particular case, the user is calling the ``/reports`` endpoint, trying to generate a report. Generating a report needs a privacy metrics job to be successful to be able to show the metrics. However, in this case, the privacy job was in the ``JobStatus.FAILURE`` state. The fix is then to go look at the error message that the privacy job threw up, launch another privacy job that is successful, and launch the generation of the report with the new privacy job once it is successful. 3. ``JobStatus.FAILURE`` Jobs that fail do not throw an exception. Rather, you have to inspect the ``JobStatus`` that is in the ``status`` property. .. code:: python job=runner.get_job(JobKind.standard) print(job.status) # JobStatus.FAILURE print(job.exception) If the status is ``JobStatus.FAILURE``, the ``exception`` property will contain an explanation of the error. You’ll have to relaunch the job again with the appropriate modifications to your input. 4. Internal error Internal errors happen when there is an error on the server, meaning that we did not handle the error on our side, and something unexpected happened, for which we cannot give you an exact error message. These come with a 500 HTTP status code, and the message is ``internal error``. In these cases, there is not much you can do except trying again with different parameters, hoping to not trigger the error again. When these happen, our error monitoring software catches these and notifies us instantly. You can reach out to your Octopize contact (support@octopize.io) for more information and help for troubleshooting, while we investigate on our side. We’ll be hard at work trying to resolve the bug, and push out a new version with the fix. Handling timeouts ----------------- Asynchronous calls ~~~~~~~~~~~~~~~~~~ A lot of endpoints of the Avatar API are asynchronous, meaning that you request something that will run in the background, and will return a result after some time using another method, like ``runner.get_all_results`` for ``runner.run``. The default timeout for most of the calls to the engine is not very high, i.e. a few seconds long. You will quite quickly reach a point where a job on the server is taking longer than that to run. The calls being asynchronous, you don’t need to sit and wait for the job to finish, you can simply take a break, come back after some time, and run the method requesting the result again. Example: .. code:: python from avatars.models import JobKind job = runner.run(jobs_to_run = [JobKind.standard, JobKind.privacy_metrics, JobKind.signal_metrics]) print(job) .. code:: python # Take a coffee break, close the script, come back in 3 minutes finished_job = runner.get_all_results() print(finished_job) # JobStatus.success How to view my last jobs ------------------------ A user can view all the jobs that they created. Jobs are listed by creation date. Attributes of the jobs such as job ID, creation date and job status can be used to enable their management (job deletion for example). .. code:: python manager.get_last_results(count = 10) # get the last 10 jobs Or it is also possible to see last jobs and results in the `web application `__. How to reload results from a previous run ----------------------------------------- If you have already run an avatarization and want to retrieve its results in a new session, use ``create_runner_from_name`` with the name you originally gave the run. This is also how you reload a job that was launched from the `web application `__. .. code:: python runner = manager.create_runner_from_name("runner_name") df = runner.shuffled("wbcd") metrics = runner.privacy_metrics("wbcd") Advanced: load by UUID ~~~~~~~~~~~~~~~~~~~~~~ If you know the exact UUID of a run (from ``runner.set_name``), you can load it directly using ``create_runner_from_id``: .. code:: python runner = manager.create_runner_from_id("550e8400-e29b-41d4-a716-446655440000") df = runner.shuffled("wbcd") Or if multiple runs share the same name, the most recent one is used automatically by ``create_runner_from_name``. To inspect a specific version, use ``find_ids_by_name``: .. code:: python results = manager.find_ids_by_name("runner_name") for set_name_id, jobs in results: print(set_name_id, [j.kind for j in jobs]) runner = manager.create_runner_from_id(results[5][0]) df = runner.shuffled("wbcd") Re-running from a loaded runner ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A loaded runner is primarily for **accessing existing results**. When ``run()`` is called on a runner that already has results, a **warning** is emitted with the old UUID so you can always recover the previous results: .. code:: python runner = manager.create_runner_from_name("my_dataset") df = runner.shuffled("wbcd") # access old results # To re-run : runner.run(ignore_warnings=True) # UserWarning: This runner already has results for set_name 'abc-123...' # To access previous results later: # manager.create_runner_from_id('abc-123...') How to delete jobs ------------------ Jobs can be deleted individually or in bulk using either the ``Runner`` or the ``Manager``. **Delete one or more jobs from the current runner** (after running it): .. code:: python runner.run() runner.delete() # all jobs runner.delete([JobKind.standard, JobKind.privacy_metrics]) # multiple jobs **Delete a run by name** using the manager. If multiple runs share the same name, a ``ValueError`` is raised with the commands to disambiguate: .. code:: python manager.delete_job("my_dataset") How to launch a job from a yaml ------------------------------- A user can launch a job using a YAML configuration file. If you are using the web application to run jobs, you can download the configuration of a job and reuse it with the Python client. This approach is particularly helpful when iterating the anonymization process. Here is an example of a python script : .. code:: python job_name = "from_yaml_" + secrets.token_hex(4) runner = manager.create_runner(job_name) runner.from_yaml("fixtures/yaml_from_web.yaml") # If needed, upload your data for each table. # Data could have been deleted from the server if the job was launched a certain amount of time ago. runner.upload_file("iris", data="fixtures/iris.csv") runner.run() How to change variables types ----------------------------- Sometimes, it is helpful to change the type of your variables. For instance, a numeric variable might only contain a few unique values, making it more appropriate to treat it as a categorical variable. This can optimize utility performance in your avatarization. .. code:: python from avatar_yaml.models.schema import ColumnType runner.add_table("wbcd", data="fixtures/wbcd.csv", types={"Clump_Thickness": ColumnType.CATEGORY}) or either use pandas to do it : .. code:: python df = pd.read_csv("fixtures/wbcd.csv") df["Clump_Thickness"]=df["Clump_Thickness"].astype("string") runner.add_table("wbcd", data=df) How to launch an avatarization using differential privacy --------------------------------------------------------- You can use differential privacy in the avatarization pipeline. .. code:: python runner.set_parameters("wbcd", open_dp_epsilon=10) How to launch metrics independently ----------------------------------- Data quality evaluation can be performed independently of the anonymization process by running only the metrics jobs on both the original and the anonymized datasets. For an accurate assessment, the data should not be shuffled. For more details about our metrics, refer to our `public documentation `__. .. code:: python job_name = "only_metrics" + secrets.token_hex(4) runner = manager.create_runner(job_name) # Original and anonymized data should be in the same order runner.add_table("iris", data="fixtures/iris.csv", avatar_data="fixtures/iris_avatarized.csv") runner.set_parameters("iris", k=10) runner.run(jobs_to_run=[JobKind.privacy_metrics, JobKind.signal_metrics]) How to render plots ------------------- You can generate plots to assess how well utility is preserved during the avatarization process, using four levels of analysis: - **Univariate:** Compare distributions with ``PlotKind.DISTRIBUTION`` or review mean and standard deviation summaries for the first 10 columns using ``PlotKind.AGGREGATE_STATS``. - **Bivariate:** Examine correlations with ``PlotKind.CORRELATION`` and analyze correlation differences with ``PlotKind.CORRELATION_DIFFERENCE``. - **Multivariate:** Visualize data projections using ``PlotKind.PROJECTION_2D`` and ``PlotKind.PROJECTION_3D``. - **Data structure:** Explore variable contributions within the model using ``PlotKind.CONTRIBUTION``. Visualizations are a great way to fine tuned the parameters and understand your results. To render a plot, simply use: .. code:: python runner.render_plot("iris", PlotKind.PROJECTION_3D) If you encounter issues displaying plots directly in your notebook (for example, in VS Code), or if you prefer to download the plot as an HTML file, you can use the ``open_in_browser=True`` parameter. This will save the plot as an HTML file and open it in your default web browser: .. code:: python runner.render_plot("iris", PlotKind.PROJECTION_3D, open_in_browser=True) How to create a PIA report -------------------------- When running the avatarization, you can **generate 2 reports**: a technical report and a PIA (Privacy Impact Assessment) report. The PIA report helps determine whether the residual risk is compatible with the intended use of the data and with its level of sensitivity. While the technical evaluation quantifies the re-identification risk associated with potential attacks, the risk analysis report evaluates the likelihood of such attacks occurring and the resources an adversary would need to carry them out. The **PIA report can be customized** with information about your dataset and the avatarization context: - `SensitivityLevel `__: indicates the sensitivity level of the data being processed. The sensitivity increases as the data are or are not personal, sensitive… - `DataType `__ : indicates the type of data being processed (e.g., health data, financial data, etc.). Different data types may have different privacy implications and regulatory requirements. - `DataSubject `__ : identifies the individual or group to whom the data relates, helping to contextualize privacy considerations. - `DataRecipient `__ : describes the intended recipients of the data, which can influence the assessment of privacy risks and necessary safeguards. You can set these parameters when creating the runner: .. code:: python from avatar_yaml.models.avatar_metadata import SensitivityLevel, DataType, DataSubject, DataRecipient runner = manager.create_runner(set_name="test_pia_report", pia_data_recipient = DataRecipient.OPENDATA, pia_data_type = DataType.HEALTH, pia_data_subject = DataSubject.PATIENTS, pia_sensitivity_level = SensitivityLevel.HIGH,) You can then download the PIA report as a DOCX file: .. code:: python from avatar_yaml.models.parameters import ReportType runner.download_report("pia_report.docx", report_type=ReportType.PIA)