Draft for slim containers #386

joschrew · 2023-07-07T14:06:02Z

Description:
All processors run in their own docker-container as a processing-worker. Also there are containers: processing-server, rabbitmq and mongodb running. Executables for processors are delegators to ocrd network client.

Related issue:
#69 (comment)

Example usage:

Clone this repo:
git clone [email protected]:joschrew/ocrd_all.git

Change to repository
cd ocrd_all

Core, ocrd_cis and ocrd_tesserocr are needed for the example-run:
git submodule update --init core/ ocrd_cis/ ocrd_tesserocr/

Create the venv and docker-compose.yaml:
make -f Makefile-slim slim-venv

Create datadir (necessary to get the workspaces to the containers):
mkdir data

Start the containers:
docker-compose up -d

Get a workspace for testing:

curl "https://raw.githubusercontent.com/OCR-D/ocrd-webapi-implementation/main/things/example_ws.ocrd.zip" --output foo.zip
unzip foo.zip "data/*"

Activate the venv:
. venv2/bin/activate

Run a processor on the workspace:
ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN -m mets.xml

bertsky

Early stage, dry review – still trying to make it run for me.

slim-containers-files/delegator_template.py

bertsky · 2023-07-07T15:04:51Z

slim-containers-files/delegator_template.py

+import subprocess
+
+# Later the address (or rather the port) should be dynamic
+processing_server_address = "http://localhost:8000"


Why not the exposed OCRD_PS_PORT here? Or even better, instead of the host side (localhost), we should be fine with the Docker network's DNS:

Suggested change

processing_server_address = "http://localhost:8000"

processing_server_address = "http://ocrd-processing-server:8000"

How does that work (I mean the thing with the hostname, I think I know how to set set Port via env)? When I am on the host it cannot resolve ocrd-processing-server out of the box. What do I have to change additionally to make it work to query the container from the host via its service/host name?

Not from the host (network), but from the virtual Docker network (i.e. from inside another container). See Compose documentation. (The port then is the container internal port BTW.)

After trying some things I think I cannot solve this properly. First the hostname cannot be the service name because the delegator is run from the host. And I cannot read the port from .env because I don't know the working dir where the delegator is executed from. So I decided to set the port dynamic in the Makefile like the processor-name. The proposed suggestion at least does not work. Commit: 7736969

First the hostname cannot be the service name because the delegator is run from the host.

Oh, right! Sorry, weak thinking on my side.

And I cannot read the port from .env because I don't know the working dir where the delegator is executed from.

Ok, good point. However, we could write all the .env values into the venv's activate by make, dynamically (i.e. setting the exact values used for the servers):

%/bin/activate: $(PYTHON) -m venv $(subst /bin/activate,,$@) . $@ && pip install --upgrade pip setuptools wheel @echo ". $(CURDIR)/.env" >> $@

IMO the .env should be the central source for configuration, so you should be able to modify it to your needs after it was generated.

I see a problem with that approach, because it only works with the venv activated in bash (not csh or fish) and invoking executables directly would not work with this approach (e.g.: venv/bin/ocrd-some-processor ...). Btw in the example code this does not work: @echo ". $(CURDIR)/.env" >> $@, exporting is needed to be available to a started python process e.g. the processor.
But I think it is a good idea to use the .env for the config. For now I have implemented another approach approach: In the Makefile the env-path is set into the delegator which then reads the port from it. I know it is not ideal so I think we should talk about this again and agree on a proper solution.

Have reverted the last commit regarding this and decided to use the proposed approach

bertsky · 2023-07-07T15:08:16Z

slim-containers-files/delegator_template.py

+cmd = [
+    "ocrd", "network", "client", "processing", "processor",
+    processor_name, "--address", processing_server_address
+]
+subprocess.run(cmd + args[1:])


The Processing Server API is asynchronous, and so is its network client. So this CLI will not give the same user experience as the native CLI (for example, you cannot script these calls).

Thus, either we add some callback mechanism here and block until the job is done, or we switch to the Processor Server API which is synchronous (but has no client CLI yet).

It could be as simple as passing "--callback-url", "http://localhost:" + PORT + "/" here. The PORT should be generated before by an internal background task like so:

PORT = None async def serve_callback(): nonlocal PORT # set up web server def get(path): raise Exception("done") ... # run web server on arbitrary available port PORT = ... server = asyncio.create_task(serve_callback)

Then after the subprocess.run, you can simply block by doing await server. The whole thing must be in a function that you asyncio.run(...).

I have added a waiting mechanism with the python build-in HTTPServer. I tried using your proposed async approach but discarded that before I could completely finish. Probably async is possible but it is way more complicated (that's why I finally opted out) and I think a "non-async" http server has the same effect and the waiting-costs for running a "normal" http server vs. a async http server is neglectable.
I cannot say I have much more than an idea of how it internally works but I think a python-process with an async server gets scheduled by the operating system as well anyway. The async-stuff with polling and so on is "only" python internal. It would gain time if we had another async process running alongside which could then be run when the server is waiting.
So I don't think async is helpful here anyway, that's why I use the synchrony server for waiting.

bertsky · 2023-07-10T09:40:10Z

slim-containers-files/docker-compose.processor.template.yaml

+    depends_on:
+      - ocrd-processing-server
+      - ocrd-mongodb
+      - ocrd-rabbitmq
+    # restart: The worker creates its queue but rabbitmq needs a few seconds to be available


If timing is an issue, I suggest to change the dependency type:

Suggested change

depends_on:

- ocrd-processing-server

- ocrd-mongodb

- ocrd-rabbitmq

# restart: The worker creates its queue but rabbitmq needs a few seconds to be available

depends_on:

- ocrd-processing-server

ocrd-mongodb:

condition: service_started

ocrd-rabbitmq:

condition: service_started

The underlining problem is the following: The container for the queue is started and running, but it needs 1-3 seconds that queue creation is possible. But the processing worker tries to create it's queue right away.

This suggestion (service_started) is not working because the queue-service is considered started from docker-compose but it is in reality not ready to be used right away although it is considered started. I have a similar issue in another project and already tried a few things solving this. I think the only solution to this problem from the docker side would be to implement a (manual) health-check for the rabbitmq container. But therefore I'd have to extend the rabbit-mq image which I do not want.

For this PR to function some extension to core is needed anyway. There I want to add an optional queue-creation-timeout to the worker startup so that it waits a few seconds with adding its queue or to try again a few times. But this restart-fix was the fastest way to do that that's why it is here and I agree that it should be removed finally. I will remark this as solved as soon as the needed changes to core are made (which need one change to this PR as well).

bertsky · 2023-07-10T09:42:38Z

slim-containers-files/docker-compose.processor.template.yaml

+
+  {{ processor_name }}:
+    extends:
+      file: slim-containers-files/{{ processor_group_name}}/docker-compose.yaml


So IIUC in the final setup, when we have correct Dockerfiles and compose files in all modules, this will simply become {{ module_name }}/docker-compose.yaml?

Makefile-slim

slim-containers-files/ps-config.yaml

bertsky · 2023-07-10T12:46:16Z

Makefile-slim

+export PYTHON ?= python3
+VIRTUAL_ENV = $(CURDIR)/venv2
+BIN = $(VIRTUAL_ENV)/bin
+ACTIVATE_VENV = $(BIN)/activate
+OCRD_MODULES = OCRD_CIS OCRD_TESSEROCR
+OCRD_CIS = ocrd-cis-ocropy-binarize ocrd-cis-ocropy-dewarp
+OCRD_TESSEROCR = ocrd-tesserocr-recognize ocrd-tesserocr-segment-region
+PROCESSORS = $(foreach mod,$(OCRD_MODULES),$(foreach proc,$($(mod)), $(proc) ))
+DELEGATORS = $(foreach proc,$(PROCESSORS),$(BIN)/$(proc))
+
+slim-venv: docker-compose.yaml .env $(DELEGATORS) | $(VIRTUAL_ENV)
+


None of this would be needed if you added to the existing Makefile directly. We already have (sensible+configurable) definitions for VIRTUAL_ENV, OCRD_MODULES (not a variable but the true submodule name) and OCRD_EXECUTABLES (your PROCESSORS). There is even an existing delegator mechanism (used for sub-venvs on some modules).

Makefile-slim

kba

Looks good so far but I need a bit more explanation of all the parts involved, what gets generated and what is actually run.

slim-containers-files/Dummy-Core-Dockerfile

slim-containers-files/delegator_template.py

Changes not yet merged to core are needed for this PR to work. This must be reset later

joschrew · 2023-07-12T15:04:42Z

I have started the process of addressing all comments but this will take a while. I think I'll add them step by step. I will report back if that is finished.

Co-authored-by: Robert Sachunsky <[email protected]>

kba · 2023-07-28T08:55:15Z

slim-containers-files/delegator_template.py


-processing_server_address = "http://localhost:{{ OCRD_PS_PORT }}"
+env_path = "{{ OCRD_SLIM_ENV_PATH }}"


Maybe use https://github.com/theskumar/python-dotenv for that, so you don't have to parse sh variable assignments?

This reverts commit cfd1a79. Goal is to read .env when activating the venv and use the env-variable to set the processing server port.

kba · 2023-11-15T11:34:57Z

slim-containers-files/ocrd_tesserocr/docker-compose.yaml

+      context: ../../ocrd_tesserocr
+      dockerfile: ../slim-containers-files/ocrd_tesserocr/Dockerfile
+    command:
+      ocrd network processing-worker ocrd-tesseroc-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue


Suggested change

ocrd network processing-worker ocrd-tesseroc-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue

ocrd network processing-worker ocrd-tesserocr-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue

kba · 2023-11-15T11:44:49Z

slim-containers-files/ocrd_tesserocr/docker-compose.yaml

+      context: ../../ocrd_tesserocr
+      dockerfile: ../slim-containers-files/ocrd_tesserocr/Dockerfile
+    command:
+      ocrd network processing-worker ocrd-tesserocr-segment-region --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue


We should now use the ocrd-tesserocr-segment-region worker syntax instead to get instance caching, right?

Draft for slim containers

b7010f1

bertsky reviewed Jul 10, 2023

View reviewed changes

kba reviewed Jul 10, 2023

View reviewed changes

slim-containers-files/Dummy-Core-Dockerfile Outdated Show resolved Hide resolved

slim-containers-files/delegator_template.py Outdated Show resolved Hide resolved

bertsky mentioned this pull request Jul 12, 2023

ocrd network CLI syntax and terminology OCR-D/core#1032

Open

joschrew added 2 commits July 12, 2023 16:49

Change submodule core head, for now

5e71acf

Changes not yet merged to core are needed for this PR to work. This must be reset later

Use core Dockerfile from submodule

2fb2347

joschrew and others added 4 commits July 12, 2023 17:15

Remove metspath-check from delegator-template

fa2a477

Co-authored-by: Robert Sachunsky <[email protected]>

Make port for processing server dynamic

7736969

Make delegator blocking with a waiting http server

e866117

Get processor name in delegator from argv

10725b9

joschrew force-pushed the master branch 2 times, most recently from 1de5dbb to 10725b9 Compare July 20, 2023 07:00

joschrew and others added 10 commits July 20, 2023 09:02

Don't publish docker ports for mongo

c6da76e

Co-authored-by: Robert Sachunsky <[email protected]>

Don't publish docker ports for mongodb and rabbitmq

2526006

Co-authored-by: Robert Sachunsky <[email protected]>

Remove redundant call from Makefile-slim

3ca07bd

Co-authored-by: Robert Sachunsky <[email protected]>

Update env-var-handling for rabbitmq 1

b001777

Co-authored-by: Robert Sachunsky <[email protected]>

Update env-var-handling for rabbitmq 2

1047ba2

Co-authored-by: Robert Sachunsky <[email protected]>

Update env-var-handling for mongodb 1

ab2d1c3

Co-authored-by: Robert Sachunsky <[email protected]>

Update env-var-handling for mongodb 2

7834686

Co-authored-by: Robert Sachunsky <[email protected]>

Fix env-var-handling for mongodb and rabbitmq

d53366c

Generate ps-config dynamically with .env settings

c6e67d6

Read port for delegator from .env

cfd1a79

kba reviewed Jul 28, 2023

View reviewed changes

joschrew added 2 commits August 1, 2023 10:42

Revert last commit: read port from .env"

7ce6096

This reverts commit cfd1a79. Goal is to read .env when activating the venv and use the env-variable to set the processing server port.

Read and set .env when activating venv

a7c1328

mikegerber mentioned this pull request Aug 8, 2023

Review containerization efforts in ocrd_all qurator-spk/ocrd-galley#75

Open

kba reviewed Nov 15, 2023

View reviewed changes

bertsky mentioned this pull request May 23, 2024

Build broken with latest setuptools 70.0.0 bertsky/ocrd_detectron2#29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft for slim containers #386

Draft for slim containers #386

joschrew commented Jul 7, 2023 •

edited

Loading

bertsky left a comment

bertsky Jul 7, 2023

joschrew Jul 12, 2023

bertsky Jul 12, 2023

joschrew Jul 13, 2023

bertsky Jul 13, 2023

joschrew Jul 27, 2023 •

edited

Loading

joschrew Aug 1, 2023

bertsky Jul 7, 2023

bertsky Jul 12, 2023

joschrew Jul 19, 2023 •

edited

Loading

bertsky Jul 10, 2023

joschrew Jul 19, 2023

bertsky Jul 10, 2023

bertsky Jul 10, 2023

kba left a comment

joschrew commented Jul 12, 2023

kba Jul 28, 2023

kba Nov 15, 2023

kba Nov 15, 2023

	processing_server_address = "http://localhost:8000"
	processing_server_address = "http://ocrd-processing-server:8000"


		processing_server_address = "http://localhost:{{ OCRD_PS_PORT }}"
		env_path = "{{ OCRD_SLIM_ENV_PATH }}"

	ocrd network processing-worker ocrd-tesseroc-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue
	ocrd network processing-worker ocrd-tesserocr-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue

Draft for slim containers #386

Are you sure you want to change the base?

Draft for slim containers #386

Conversation

joschrew commented Jul 7, 2023 • edited Loading

bertsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joschrew Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joschrew Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kba left a comment

Choose a reason for hiding this comment

joschrew commented Jul 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joschrew commented Jul 7, 2023 •

edited

Loading

joschrew Jul 27, 2023 •

edited

Loading

joschrew Jul 19, 2023 •

edited

Loading