Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-39636: Migrate Ook into a FastAPI application #140

Merged
merged 70 commits into from
Jul 20, 2023
Merged

Conversation

jonathansick
Copy link
Member

@jonathansick jonathansick commented Jul 15, 2023

Backwards-incompatible changes

  • The app is rewritten as a FastAPI/Safir app, replacing its heritage as an aiohttp/Safir app. The app is also now deployed with Helm via Phalanx Because of this, Ook should be considered as an entirely new app, with no backwards compatibility with the previous version.
  • Ook no longer receives GitHub webhooks; the intent is to get GitHub webhook events from Squarebot (through Kafka) in the future.
  • Ook no longer receives Kafka messages from LTD Events since that app isn't avabile in the new Roundtable deployment. A new ingest trigger is being developed in the interim. Until then, ingests can be manually triggered by the POST /ook/ingest/ltd endpoint.

New features

  • Ook is now a FastAPI/Safir app.
  • Ook uses Pydantic models for its Kafka message schemas.
  • Ook is now built around a service/domain/handler architecture, bringing it in line with SQuaRE's modern apps. The Kafka consumer is considered a handler.

Other changes

  • The change log is maintained with scriv
  • Tests are now orchestrated through nox.

- pyproject.toml packaging
- nox for task running
- scriv change log
- markdown readme
- ruff for linting
- drop kustomize

Note that this should break CI; the next step is to start switching the
app itself over.
These won't work yet; this is just beginning to move the aiohttp pieces
in the right place.
This likely won't work yet, but starts to replace the aiohttp app.py
factory module.
This establishes our pattern of organizing the internal data model into
tthe domain. Now, the DocumentRecord is technically a persistence model.
This service works with DocumentRecord domain objects and takes actions
on the Algolia index.
This moves the classification module in ook.services and repackages it
as a service class. More work is likely needed to modernize this
service.
This implements the concept of a Kafka consumer that routes messages
based on the topic name and types of the Pydantic models, described in
https://sqr-076.lsst.io

This code should eventually be moved into Kafkit.
- Drop run command because it's not needed for our FastAPI pattern
- Migrate upload_doc_stub command to use the new AlgoliaDocIndexService.
Include Kafkit to provide Pydantic-based schema management and a Kafka
consumer/producer.
Also Fix ruff linting issues.
We'll see if this actually renders and parses correctly with current
versions of Pydantic...
FastAPI now recommends using a lifecycle context manager rather than the
on_event handlers.
This will need to be entirely re-implemented, so lets not keep it around
for now.
consumer_kafka_messages is run in a separate asyncio task that is
started when the app starts up.

This coroutine sets up the PydanticAIOKafkaConsumer.

The next step is to start hooking up the actual message routes.
The Kafka message schemas that Ook manages are now Pydantic models
(AvroBaseModel via dataclasses_avroschema)
This will be re-built later when the next-gen LTD is producing events
again.
The purpose of this service is to get metadata about a GitHub repository
as needed, often as a sub-service for document parsers.

This replaces the old github and utils modules, as well as the
githubcreationdate workflow (which is now the seed for this service).

We've optimized the algorithm for getting the date of the first commit
as well. No longer are we ignoring additional pages of content.
This service handles ingesting Lander v1 documents that use JSON-LD
based metadata.jsonld documents. It replaces the former ltdlander
workflow module.
This service replaces the ingest_ltd_sphinx_technote workflow. The
service is responsible for ingesting the original format of Sphinx
technotes.
This makes it easier to spin up a Kafka producer from either HTTP path
operations or in the handlers for the Kafka consumer.

Now the Kafka consumer gets it's PydanticSchemaManager from the factory
rather than creating an entirely new one.

This mocking and kafka producer modules are based on Squarebot's; these
need to be moved up into Kafkit.
Since Ook won't be a webhook receiver, it doesn't the GitHub webhook
secret. Instead, we'll get GitHub events through Squarebot.
This version includes FastApi dependencies for Kafka apps.
This integrates the classification service with the POST /ingest/ltd
endpoint. The classification service can now send resolved ingest
requests to the Kafka ingest topic. The new LtdMetadataService provides
basic access to the LTD v1 API (read only).
@jonathansick jonathansick force-pushed the tickets/DM-39636 branch 2 times, most recently from a093042 to f987c80 Compare July 19, 2023 21:03
This makes it easier in dev environments to understand when messsages
are being consumed, and if they're matched to a router and handled
successfully or not
This isn't parsing correctly; need to check what units its in.
The Algolia SearchClient manages an aiohttp client session. Therefore it
needs to be shut down.

It's a context manager, which is tricky to use as a dependency;
especially with our Factory pattern when it's run outside a FastAPI
response handler context. The way we work around this is to manually
call the __aenter___ and __aexit methods in the initialization and close
phases of the dependency.
When running the async version, browse_objects_async I get this error:

TypeError: 'async for' received an object from __aiter__ that does not implement __anext__: coroutine", "logger":"ook", "offset":6, "partition":0, "topic":"lsst.square-events.ook.ingest"}

See algolia/algoliasearch-client-python#559

The work-around is to use the sync API, browse_objects, until this is
resolved.
This will hopefully cause Pydantic to render the enum out into a string
with the enum's value
This adds support for the product_slugs field in the POST /ingest/ltd
endpoint; with a TaskGroup we can easily run multiple
classification/queuing tasks at once.
@jonathansick jonathansick marked this pull request as ready for review July 20, 2023 17:57
@jonathansick jonathansick merged commit a48046c into main Jul 20, 2023
4 checks passed
@jonathansick jonathansick deleted the tickets/DM-39636 branch July 20, 2023 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant