User guide¶

User guide

How to setup your email account¶

This section is only needed if the use of emails to login is activated in the global configuration.

At the moment, you have to get in touch with your Octopize contact so that they can create your account.

Our current email provider is AWS. They need to verify an email address before our platform can send emails to it.

You’ll thus get an email from AWS asking you to verify your email by clicking on a link. Once you have verified your email address by clicking on that link, you can follow the steps in the section about reset password.

How to reset your password¶

NB: This section is only available if the use of emails to login is activated in the global configuration. It is not the case by default.

If you forgot your password or if you need to set one, first call the forgotten_password endpoint:

from avatars.client import Manager

Manager = Manager(base_url=os.environ.get("BASE_URL"))
Manager.forgotten_password("yourmail@mail.com")

You’ll then receive an email containing a token. This token is only valid once, and expires after 24 hours. Use it to reset your password:

from avatars.client import ApiClient

client = Manager(base_url=os.environ.get("BASE_URL"))
client.reset_password("yourmail@mail.com", "new_password", "new_password", "token-received-by-mail")

You’ll receive an email confirming your password was reset.

How to log in to the server¶

import os

# This is the client that you'll be using for all of your requests
from avatars.manager import Manager

import pandas as pd
import io

# Change this to your actual server endpoint, e.g. base_url="https://avatar.company.com"
manager = Manager(base_url=os.environ.get("AVATAR_BASE_URL"))
manager.authenticate(
    username=os.environ.get("AVATAR_USERNAME"),
    password=os.environ.get("AVATAR_PASSWORD"),
)

How to handle a large dataset¶

Due to the server limit, you can be limited by the number of rows and the number of dimensions.

Handle large amount of rows¶

If your dataset contains a large amount of rows, it will automatically be split into batches and each batch will be anonymized independently from the others. It is then merged back, so that the final dataset is the result of the anonymization of the whole dataset.

Handle large amount of dimensions¶

The number of dimensions is the number of continuous variables plus the number of modalities in categorical variables. The limit of dimension is frequently reached due to a large number of modalities in one/sample of categorical variables (high cardinality variables).

There are several solutions to bypass this limitation:

Encode the categorical variable into a continuous variable (frequency encoding, target encoding, …).
Reduce the number of modalities by grouping some into more general modalities. You can use the processor GroupModalities.
Use the argument use_categorical_reduction

The parameter use_categorical_reduction=True will reduce the dimension of the categorical variable by encoding them as vectors. This step is using the word embedding cat2vec. This solution could reduce the utility of your dataset.

Understanding errors¶

Most of your actions will have a successfull outcome. However, sometimes there will be errors, and this section is here to explain the kinds of errors that can happen, and how to correct them.

Timeout("The call timed out. Consider increasing the timeout with the 'timeout' parameter.")

You’ll encounter this error when the call is taking too long to complete on the server. Most of the time, this will be during job execution or dataset upload/download. I’ll encourage you to read up on the `handling timeouts <#handling-timeouts>`__ section to deal with these kind of errors.
Validation errors

Validation errors happen due to bad user input. Our error message rely heavily on HTTP status codes. In short, codes in the 400-499 range are user errors, and 500-599 are server errors. More on those later.

Here we’ll cover the user errors, than you can remedy by modifying your parameters and trying again. The syntax of the error message will always be of the following form:
```
Got error in HTTP request: POST https://company.octopize.app/reports. Error status 400 - privacy_metrics job status is not success: JobStatus.FAILURE
```
You’ll have: - the HTTP request method (POST, GET, etc…) - the endpoint that was affected (/reports) - the status (400) - an informational message that details the exact error that is happening (privacy_metrics job status is not success: JobStatus.FAILURE)

In this particular case, the user is calling the /reports endpoint, trying to generate a report. Generating a report needs a privacy metrics job to be successful to be able to show the metrics. However, in this case, the privacy job was in the JobStatus.FAILURE state. The fix is then to go look at the error message that the privacy job threw up, launch another privacy job that is successful, and launch the generation of the report with the new privacy job once it is successful.
JobStatus.FAILURE

Jobs that fail do not throw an exception. Rather, you have to inspect the JobStatus that is in the status property.
```
job = runner.get_job(JobKind.standard)
print(job.status)  # JobStatus.FAILURE
print(job.exception)
```
If the status is JobStatus.FAILURE, the exception property will contain an explanation of the error. You’ll have to relaunch the job again with the appropriate modifications to your input.
Internal error

Internal errors happen when there is an error on the server, meaning that we did not handle the error on our side, and something unexpected happened, for which we cannot give you an exact error message. These come with a 500 HTTP status code, and the message is internal error. In these cases, there is not much you can do except trying again with different parameters, hoping to not trigger the error again.

When these happen, our error monitoring software catches these and notifies us instantly. You can reach out to your Octopize contact (support@octopize.io) for more information and help for troubleshooting, while we investigate on our side. We’ll be hard at work trying to resolve the bug, and push out a new version with the fix.

Handling timeouts¶

Asynchronous calls¶

A lot of endpoints of the Avatar API are asynchronous, meaning that you request something that will run in the background, and will return a result after some time using another method, like runner.get_all_results for runner.run.

The default timeout for most of the calls to the engine is not very high, i.e. a few seconds long. You will quite quickly reach a point where a job on the server is taking longer than that to run.

The calls being asynchronous, you don’t need to sit and wait for the job to finish, you can simply take a break, come back after some time, and run the method requesting the result again.

Example:

job = runner.run(jobs_to_run = ["standard",  "privacy_metrics", "signal_metrics" ])

print(job)

# Take a coffee break, close the script, come back in 3 minutes

finished_job = runner.get_all_results()

print(finished_job)  # JobStatus.success

However, sometimes you want your code to be blocking and wait for the job to finish, and only then return the result.

For that, you can simply increase the timeout:

# Will retry for 10 minutes, or until the job is finished.
finished_job = runner.get_all_results()

How to view my last jobs¶

A user can view all the jobs that he/she created. Jobs are listed by creation date. Attributes of the jobs such as job ID, creation date and job status can be used to enable their management (job deletion for example).

manager.get_last_results(count = 10) # get the last 10 jobs