feat: ADR for incremental algolia indexing

openedx · Mar 18, 2024 · d1146b2 · d1146b2
1 parent 8c5b50d
commit d1146b2
Show file tree

Hide file tree

Showing 2 changed files with 157 additions and 0 deletions.
diff --git a/docs/decisions/0009-incremental-algolia-indexing.rst b/docs/decisions/0009-incremental-algolia-indexing.rst
@@ -0,0 +1,88 @@
+Incremental Algolia Indexing
+============================
+
+
+Status
+------
+Draft
+
+
+Context
+-------
+The Enterprise Catalog Service produces an Algolia-based search index of its Content Metadata and Course Catalog
+database. This index is entirely rebuilt at least nightly, working off a compendium of content records
+resulting in a wholesale replacement of the prior Algolia index. This job is time consuming and memory intensive.
+This job also relies heavily on separate but required processes responsible for retrieving filtered subsets of
+content from external sources of truth, primarily Course Discovery, where synchronous tasks must be regularly
+run in specific orders. This results in a system that is brittle - either entirely successful or entirely unsuccessful.
+
+
+Solution Approach
+-----------------
+The goals should include:
+- Implement new tasks that run alongside/augment the existing indexer until we’re able to entirely cut-over
+- Support all current metadata types but doesn’t need to support them all on day 1
+- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing
+update_content_metadata job, etc.
+    - Invocation of the new indexing process should not be reliant on separate processes run synchronously before hand.
+- Higher parallelization factor, i.e. 1 content item per celery task worker (and no task group coordination required)
+- Provide a content-oriented method of determining content catalog membership that's not reliant on external services.
+
+
+Decision
+--------
+We want to follow updates to content with individual and incremental updates to Algolia. To do this we both create
+new functionality and reuse some existing functionality of our Algolia indexing infrastructure.
+
+----------------------------------
+First, the existing indexing process begins with executing catalog queries against `search/all` to determine which
+courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the
+opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a
+given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query
+and a piece of content metadata and determine if the content matches the query (without the use of course discovery).
+----------------------------------
+
+First is to address the way in which and the moments when we choose to invoke the process of indexing. Previously,
+the bulk indexing logic was reliant on a completely separate task synchronously completing. In order to bulk index,
+content records needed to be bulk updated. The update_content_metadata job's purpose is two fold, one is to ingest content
+metadata from external service providers and standardize its format and enterprise representation, and two is to
+build associations between said metadata records and customer catalogs by way of catalog query inclusion. Once this
+information is entirely read and saved within the catalog service, the system is then ready to snapshot the state of
+content in the form of algolia objects and entirely rebuild and replace our algolia index.
+
+This first A then B approach to wholesale rebuilding our indices is both time and resource intensive as well as brittle
+and prone to outages. Not to mention the system is slow to fix should a partial or full error occur, as
+everything must be rerun in a specific order.
+
+To remediate these symptoms, indexing content records will be dealt with on an individual object-shard/content metadata
+object basis and will happen at the moment a record is saved to the ContentMetadata table. Tying the indexing process
+to the model ``post_save()`` will decouple the task from any other time consuming, bulk job. In order to combat
+redundant/unneeded requests, the record will be evaluated on two levels before an indexing task is kicked off. First
+the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have
+associations with queries within the service.
+
+In order to incrementally update the Algolia index we need to introduce the ability to replace individual
+object-shard documents in the index (today we just replace the whole index). This can be implemented by creating
+methods to determine which Algolia object-shards exist for a piece of content. Once we have relevant IDs we are able to
+determine if a create, update, or delete of them is required and can highjack existing processes that bulk construct
+our algolia objects except on an individual basis. For simplicity sake an update will likely be a delete followed by
+the creation of new objects.
+
+Incremental updates, through the act of saving individual records, will need to be triggered by something - such as
+polling of updated content from Course Discovery, consumption of event-bus events, and/or triggering based on a nightly
+Course Discovery crawl or Django Admin button. However it is not the responsibility of the indexer, nor this ADR
+to determine when those events should occur, and in fact the indexing process should be able to handle any source of
+content metadata record updating processes.
+
+
+Consequences
+------------
+Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will
+also provide us with more flexibility about including non-course-discovery content in catalogs because we will
+no-longer rely on a query to course-discovery's `search/all` endpoint and instead rely on the metadata records in the
+catalog service, regardless of it's source.
+
+
+Alternatives Considered
+-----------------------
+No alternatives were considered.
diff --git a/docs/decisions/0010-incremental-content-metadata-updating.rst b/docs/decisions/0010-incremental-content-metadata-updating.rst
@@ -0,0 +1,69 @@
+Incremental Content Metadata Updating
+=====================================
+
+
+Status
+------
+Draft
+
+
+Context
+-------
+The Enterprise Catalog Service implicitly relies on external services as sources of truth for content surfaced to
+organizations within the suite of enterprise products and tools. For the most part this external source of truth has
+been assumed to be the `course-discovery` service. The ``update_content_metadata`` job has relied on `course-discovery`
+to not only expose the content metadata of courses, programs and pathways but also to determine customer catalog
+associations with specific subsets of content, meaning enterprise curated content filters are evaluated externally as a
+black box solution to what content belongs to which customers. This is burdensome to both the catalog service as it has
+little control over how the underlying content filtering logic functions and to the external service as redundant data
+must be requested for each and every query filter. Should the catalog service own the responsibility of determining the
+associations between a single piece of content and any of the customers' catalogs, not only would we just have to
+request all data a single time from external sources for bulk jobs, but we could also easily support creation, updates
+and deletes of single pieces of content communicated to the catalog service on an individual basis.
+
+Decision
+--------
+The existing indexing process begins with executing catalog queries against `search/all` to determine which
+courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the
+opposite semantic and instead be able to determine catalog membership from a given piece of content (rather than
+courses from a given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a
+catalog query and a piece of content metadata and determine if the content matches the query (without the use of course
+discovery).
+
+We will implement a two sided approach to content updating that will be introduced as parallel work to existing
+``update_content_metadata`` tasks and can eventually replace old infrastructure. The first method will be a bulk
+job similar to the current ``update_content_metadata`` task to query external sources of content and update any records
+should they mismatch using `apps.catalog.filters` to determine the query-content association sets. And second, an event
+signal receiver which will process any individual content update events that are received. The intention is for the
+majority of updates in the catalog service to happen at the moment they are updated in their external source and the
+signal is fired, only to be cleaned up and verified by the bulk job later on should something go wrong.
+
+While this new process will remove the need to constantly query and burden the `course-discovery` search/all endpoint
+we will still most likely need to request the full metadata of each course/content object similar to how the current
+task handles the flow.
+
+An event receiver based approach to individual content updates also opens up our possibilities to ingesting content
+from other sources of truth that are hooked up to the edx event-bus. This means that it will be easier for enterprise
+to ingest content from many sources, instead of relying on those services first going through course-discovery.
+
+
+Consequences
+------------
+As alluded to earlier, this change means that we will no longer have to repeatedly request data from course-discovery's
+search/all endpoint as we won't need to rely on the service to do our filtering logic, which was one of the main
+contributing factors as to the long run time of the ``update_content_metadata`` task. Additionally, housing
+our own filtering logic will allow us to maintain and tweak/improve upon the functionality should we want additional
+features.
+
+The signal based individual updates will also mean that we will have a significantly smaller window of lag for content
+updates propagating throughout the enterprise system.
+
+
+Alternatives Considered
+-----------------------
+There are a number of ways that individual content updates could be communicated to the catalog service. Event-bus
+based signal handling restricts the catalog service to sources of truth that have integrated with the event bus
+service/software. We considered instead exposing an api endpoint that would take in a content update event and process
+the data as needed, however it was decided that this approach is brittle and prone to losing updates in transit as
+it would be difficult to ensure the update was fully communicated and processed by the catalog service should anything
+go wrong.