User guide
==========

- `User guide <#user-guide>`__

  - `How to setup your email
    account <#how-to-setup-your-email-account>`__
  - `How to reset your password <#how-to-reset-your-password>`__
  - `How to log in to the server <#how-to-log-in-to-the-server>`__
  - `How to handle a large dataset <#how-to-handle-a-large-dataset>`__

    - `Handle large amount of rows <#handle-large-amount-of-rows>`__
    - `Handle large amount of
      dimensions <#handle-large-amount-of-dimensions>`__

  - `Understanding errors <#understanding-errors>`__
  - `Handling timeouts <#handling-timeouts>`__

    - `Asynchronous calls <#asynchronous-calls>`__

  - `How to view my last jobs <#how-to-view-my-last-jobs>`__

How to setup your email account
-------------------------------

*This section is only needed if the use of emails to login is activated
in the global configuration.*

At the moment, you have to get in touch with your Octopize contact so
that they can create your account.

Our current email provider is AWS. They need to verify an email address
before our platform can send emails to it.

You’ll thus get an email from AWS asking you to verify your email by
clicking on a link. Once you have verified your email address by
clicking on that link, you can follow the steps in the section about
`reset password <#how-to-reset-your-password>`__.

How to reset your password
--------------------------

**NB**: This section is only available if the use of emails to login is
activated in the global configuration. It is not the case by default.

If you forgot your password or if you need to set one, first call the
forgotten_password endpoint:

.. raw:: html

   <!-- It is python, just doing this so that test-integration does not run this code (need mail config to run)  -->

.. code:: javascript

   from avatars.client import Manager

   Manager = Manager(base_url=os.environ.get("BASE_URL"))
   Manager.forgotten_password("yourmail@mail.com")

You’ll then receive an email containing a token. This token is only
valid once, and expires after 24 hours. Use it to reset your password:

.. code:: javascript

   from avatars.client import ApiClient

   client = Manager(base_url=os.environ.get("BASE_URL"))
   client.reset_password("yourmail@mail.com", "new_password", "new_password", "token-received-by-mail")

You’ll receive an email confirming your password was reset.

How to log in to the server
---------------------------

.. code:: python

   import os

   # This is the client that you'll be using for all of your requests
   from avatars.manager import Manager

   import pandas as pd
   import io

   # Change this to your actual server endpoint, e.g. base_url="https://avatar.company.com"
   manager = Manager(base_url=os.environ.get("AVATAR_BASE_URL"))
   manager.authenticate(
       username=os.environ.get("AVATAR_USERNAME"),
       password=os.environ.get("AVATAR_PASSWORD"),
   )

How to handle a large dataset
-----------------------------

Due to the server limit, you can be limited by the number of rows and
the number of dimensions.

Handle large amount of rows
~~~~~~~~~~~~~~~~~~~~~~~~~~~

If your dataset contains a large amount of rows, it will automatically
be split into batches and each batch will be anonymized independently
from the others. It is then merged back, so that the final dataset is
the result of the anonymization of the whole dataset.

Handle large amount of dimensions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The number of dimensions is the number of continuous variables plus the
number of modalities in categorical variables. The limit of dimension is
frequently reached due to a large number of modalities in one/sample of
categorical variables (high cardinality variables).

There are several solutions to bypass this limitation:

- Encode the categorical variable into a continuous variable (frequency
  encoding, target encoding, …).
- Reduce the number of modalities by grouping some into more general
  modalities. You can use the processor GroupModalities.
- Use the argument ``use_categorical_reduction``

The parameter ``use_categorical_reduction=True`` will reduce the
dimension of the categorical variable by encoding them as vectors. This
step is using the word embedding cat2vec. This solution could reduce the
utility of your dataset.

Understanding errors
--------------------

Most of your actions will have a successfull outcome. However, sometimes
there will be errors, and this section is here to explain the kinds of
errors that can happen, and how to correct them.

1. ``Timeout("The call timed out. Consider increasing the timeout with the 'timeout' parameter.")``

   You’ll encounter this error when the call is taking too long to
   complete on the server. Most of the time, this will be during job
   execution or dataset upload/download. I’ll encourage you to read up
   on the ```handling timeouts`` <#handling-timeouts>`__ section to deal
   with these kind of errors.

2. Validation errors

   Validation errors happen due to bad user input. Our error message
   rely heavily on `HTTP status
   codes <https://en.wikipedia.org/wiki/List_of_HTTP_status_codes>`__.
   In short, codes in the 400-499 range are user errors, and 500-599 are
   server errors. More on those later.

   Here we’ll cover the user errors, than you can remedy by modifying
   your parameters and trying again. The syntax of the error message
   will always be of the following form:

   .. code:: text

      Got error in HTTP request: POST https://company.octopize.app/reports. Error status 400 - privacy_metrics job status is not success: JobStatus.FAILURE

   You’ll have: - the HTTP request method (``POST``, ``GET``, etc…) -
   the endpoint that was affected (``/reports``) - the status (``400``)
   - an informational message that details the exact error that is
   happening
   (``privacy_metrics job status is not success: JobStatus.FAILURE``)

   In this particular case, the user is calling the ``/reports``
   endpoint, trying to generate a report. Generating a report needs a
   privacy metrics job to be successful to be able to show the metrics.
   However, in this case, the privacy job was in the
   ``JobStatus.FAILURE`` state. The fix is then to go look at the error
   message that the privacy job threw up, launch another privacy job
   that is successful, and launch the generation of the report with the
   new privacy job once it is successful.

3. ``JobStatus.FAILURE``

   Jobs that fail do not throw an exception. Rather, you have to inspect
   the ``JobStatus`` that is in the ``status`` property.

   .. code:: python

      job=runner.get_job(JobKind.standard)
      print(job.status)  # JobStatus.FAILURE
      print(job.exception)

   If the status is ``JobStatus.FAILURE``, the ``exception`` property
   will contain an explanation of the error. You’ll have to relaunch the
   job again with the appropriate modifications to your input.

4. Internal error

   Internal errors happen when there is an error on the server, meaning
   that we did not handle the error on our side, and something
   unexpected happened, for which we cannot give you an exact error
   message. These come with a 500 HTTP status code, and the message is
   ``internal error``. In these cases, there is not much you can do
   except trying again with different parameters, hoping to not trigger
   the error again.

   When these happen, our error monitoring software catches these and
   notifies us instantly. You can reach out to your Octopize contact
   (support@octopize.io) for more information and help for
   troubleshooting, while we investigate on our side. We’ll be hard at
   work trying to resolve the bug, and push out a new version with the
   fix.

Handling timeouts
-----------------

Asynchronous calls
~~~~~~~~~~~~~~~~~~

A lot of endpoints of the Avatar API are asynchronous, meaning that you
request something that will run in the background, and will return a
result after some time using another method, like
``runner.get_all_results`` for ``runner.run``.

The default timeout for most of the calls to the engine is not very
high, i.e. a few seconds long. You will quite quickly reach a point
where a job on the server is taking longer than that to run.

The calls being asynchronous, you don’t need to sit and wait for the job
to finish, you can simply take a break, come back after some time, and
run the method requesting the result again.

Example:

.. code:: python

   job = runner.run(jobs_to_run = ["standard",  "privacy_metrics", "signal_metrics" ])

   print(job)

   # Take a coffee break, close the script, come back in 3 minutes

   finished_job = runner.get_all_results()

   print(finished_job)  # JobStatus.success

However, sometimes you want your code to be blocking and wait for the
job to finish, and only then return the result.

For that, you can simply increase the timeout:

.. code:: python

   # Will retry for 10 minutes, or until the job is finished.
   finished_job = runner.get_all_results()

How to view my last jobs
------------------------

A user can view all the jobs that he/she created. Jobs are listed by
creation date. Attributes of the jobs such as job ID, creation date and
job status can be used to enable their management (job deletion for
example).

.. code:: python

   manager.get_last_results(count = 10) # get the last 10 jobs