Skip to content

Developer guides – Docker & UI

Nikkel Mollenhauer edited this page Jul 28, 2022 · 10 revisions

This page introduces the usage of docker and the architecture of the user interface/webserver in greater detail than on this page.

Architecture Overview

This diagram should give a good overview of the architecture of the webserver.

API

The local client interacts with the remotely deployed code through a RESTful API. This API uses the Python Framework FastAPI with a uvicorn server. It provides the following routes:

POST

  • /start: Starts a docker container with the configuration given in the request body in .json format. The response to this request will contain the id of the started container.

GET

  • /data?id=<container_id>: Returns a stream of a .tar or .zip file of the results folder of the experiment.
  • /data/tensorboard?id=<container_id>: Returns the link to the tensorboard of the running container.
  • /health?id=<container_id>: Returns the state of the container with the given id.
  • /logs?id=<container_id>: Returns the commandline output of the running process in the container.
  • /remove?id=<container_id>: Stops and removes a running container.

Responses

The API will return a .json response with the following fields:

  • id: The id of the docker container.
  • status: The current status of the container, e.g. running or paused.
  • data: Any miscellaneous data (e.g. a link to a running TensorBoard or the container's logs)
  • stream: A stream generator, e.g. for streaming files.

Running the API

To start the API on port 8000 go to the /docker directory and run:

uvicorn app:app --reload

Don't use --reload when deploying in production. Alternatively, you can also run the app.py from the docker directory as well. Keep in mind, that all output will be written to log files located in /docker/log_files.

Security

If the API should be used an AUTHORIZATION_TOKEN must be provided in the environment variables. For each API request the value of the authorization header will be checked. The token given by the request will be compared to a calculated token on the API side. For the calculation the API uses the AUTHORIZATION_TOKEN and the current time.

WARNING: Please keep in mind that the AUTHORIZATION_TOKEN must be kept a secret. If it is revealed it is inevitable to revoke it and set a new secret. Furthermore, think about using transport encryption to ensure that the token won't get stolen on the way.

We suggest running the API with https only for additional security. Doing this is pretty simple with uvicorn. Create your own SSL-certificates and set the ssl_keyfile and the ssl_certfile argument accordingly.

Docker

We use Docker to isolate recommerce experiments. Before continue reading, please make sure, you understood the basic concepts of Docker. Make sure Docker is installed on the machine, you want to run multiple isolated recommerce experiments. There are different ways on how to use Docker with the recommerce framework.

Docker with API

The API and the Docker container must run on the same machine. This should usually be a remote machine with a lot of GPU and CPU power. There is a docker_manager, which can be used by the API to manage Docker container. Therefore the docker_manager makes use of the Docker SDK for python. Whenever the code for the recommerce framework changes, it is necessary to update the docker image. When executing the docker_manager, it will automatically update or create the image. Depending on the internet connection this might take a while.

python ./docker/docker_manager.py

Docker natively

When using Docker on your local machine , we recommend using Docker Desktop. Once Docker is available, the recommerce image can be built. This can be done by either running the docker_manager or by using the following command in the directory where the dockerfile is located:

docker build . -t recommerce

Building the image may take a while, it is about 7GB in size. To see all current images on the system use:

docker images

Container can be create and run a container from the recommerce image using the following command: Note that if the machine does not have a dedicated GPU, it might be necessary to omit the --gpus all flag.

docker run -it --entrypoint /bin/bash --gpus all recommerce

Running this command will start a container and automatically open an interactive shell. To list all runnning Docker container use:

docker ps -a

Stop a specific container with a <container_id> by using:

docker stop <container_id>

And remove it with:

docker rm <container_id>

Troubleshooting

Any errors described here should only occur when trying to deploy the webserver/docker to a new environment/virtual machine.

Running a Docker container with GPU-support

When trying to run a docker container (with a GPU device request), there might be the following error:

failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: signal: segmentation fault, stdout: , stderr:: unknown

This error is caused by a local linux distribution (on Windows this pertains to the WSL instance used by docker) not having required packages installed needed to support CUDA. A proposed workaround is to update/downgrade the following packages:

apt install libnvidia-container1=1.4.0-1 libnvidia-container-tools=1.4.0-1 nvidia-container-toolkit=1.5.1-1

Issues in the nvidia-docker-repository that describe this error can be found here, here, and here. Please note that we have not confirmed that the workaround solves this problem.

Starting a training session with GPU-support

Note: This error should no longer occur if the recommerce package was installed with the correct extra selected. We are still including this section for completeness.

When trying to start a training session on a remote or local machine (e.g. using recommerce -c training) there might be the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, apb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

This error comes from the torch installation not having CUDA support, but the machine supporting CUDA. Confirm that the correct version of recommerce is installed, see Installation. In this case, recommerce should be installed with the gpu extra, which installs the following versions of torch:

torch==1.11.0+cu115
torchvision==0.12.0+cu115
torchaudio==0.11.0+cu115

It is possible to manually update these versions using:

pip install torch==1.11.0+cu115 torchvision==0.12.0+cu115 torchaudio==0.11.0+cu115 -f https://download.pytorch.org/whl/torch_stable.html

Webserver

The provided webserver acts as a client to the API and as a UI to the users of the recommerce framework. It is implemented using the python library Django. Furthrmore JQuery 3.6 and Bootstrap v5.0 are used.

There is an App called alpha_business_app which implements all webserver logic. The second App users is an App provided by Django to implement user management. Django follows the Model-View-Controller pattern. The views are HTML-files found in /webserver/templates. Controllers and models can be found in webserver/alpha_business_app. All files in the webserver/alpha_business_app/models directory are models which represent the database.

General Commands

All listed commands can only be executed in the webserver directory. Before starting the webserver for the first time, the database needs to be initialized, therefore run:

python ./manage.py migrate

To start the webserver on 127.0.0.1:2709 use the following command:

python ./manage.py runserver 2709

When starting the webserver, there will be the login page. To create a superuser and login to the page run:

python ./manage.py createsuperuser

Django will ask you for a username a password and an E-Mail address. It should be possible to leave the E-Mail field blank. To manage other users, go to 127.0.0.1:2709/admin and login with the admin/superuser credentials.

When more fields are added to a model or existing fields are changed, the database must be modified. To do so, run:

python ./manage.py makemigrations

This will write migration files. Do not forget to apply migrations afterwards

Run tests by using the following command: Warning for CI Pipeline: If Django does not run any test, the pipeline will pass.

python ./manage.py test -v 2

Configuration Files

The configuration files are represented as database models in the webserver. This is an overview of the current classes:

Each class has all possible attributes, even if some agents or marketplaces do not implement these. There is one configuration object for each saved configuration file. So they are always "complete", not necessary values are set to None.

Generating Model and Template Files

In order to be able to have "complete" model classes, it is necessary to know all fields that can be in any configuration file. Therefore the agents and markets in the recommerce-Framework need to provide the get_configurable_fields classmethod. Whenever the parameters for an agent or a market changed, the webserver needs to update its models.

At the moment, this can be done by using the on_recommerce_change.py-Script, located in /webserver/alpha_business_app. Run this script before starting the server. It will overwrite the model files for rl-config and sim_market-config, as well as the template files for rl and sim_market. The supported types for input fields and database fields are int, float, string and boolean. After running this script make and run migrations to apply these changes-

Security

To use the webserver either an .env.txt file is needed in the BP2021/webserver directory or the environment variables SECRET_KEY and API_TOKEN must be set.

Here is an example for an .env.txt

this_line_contains_the_secret_key_for_the_django_server
this_line_contains_the_master_secret_for_the_api

Remember to change these secrets when they are leaked to the public. Both secrets should be random long strings. Keep in mind, that the master_secret for the API (API_TOKEN) should be equal to the AUTHORIZATION_TOKEN on the API side.

Monitoring Containers

Sometimes it might be useful to monitor the performance of the Docker container. To do so, there is a monitoring tool on the API side. It uses a database to store information about the system performance during an experiment and a separate process to monitor container. If monitoring in app.py (set should_run_monitoring to True) is enabled, the docker_manager will report all actions on a container to the database. Furthermore the container_health_checker is started. The container_health_checker tells the database about stopped container. This makes it possible to figure out how long a container is running. When the API is shut down, the monitoring tool will stop working as well.

Find more information about this monitoring tool in this thesis.

Websocket

There is a websocket which can be used to receive notifications about exited containers. The websocket can be started by running the container_notification_websocket.py file within the /docker directory. The websocket will run on port 8001. webserver will automatically connect to this websocket. Whenever a container exits users will receive a push notification in your interface.

There is still no stable version of the websocket, see PR #519