diff --git a/docs/decisions/0009-incremental-algolia-indexing.rst b/docs/decisions/0009-incremental-algolia-indexing.rst new file mode 100644 index 000000000..3de05f521 --- /dev/null +++ b/docs/decisions/0009-incremental-algolia-indexing.rst @@ -0,0 +1,88 @@ +Incremental Algolia Indexing +============================ + + +Status +------ +Draft + + +Context +------- +The Enterprise Catalog Service produces an Algolia-based search index of its Content Metadata and Course Catalog +database. This index is entirely rebuilt at least nightly, working off a compendium of content records +resulting in a wholesale replacement of the prior Algolia index. This job is time consuming and memory intensive. +This job also relies heavily on separate but required processes responsible for retrieving filtered subsets of +content from external sources of truth, primarily Course Discovery, where synchronous tasks must be regularly +run in specific orders. This results in a system that is brittle - either entirely successful or entirely unsuccessful. + + +Solution Approach +----------------- +The goals should include: +- Implement new tasks that run alongside/augment the existing indexer until we’re able to entirely cut-over +- Support all current metadata types but doesn’t need to support them all on day 1 +- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing +update_content_metadata job, etc. + - Invocation of the new indexing process should not be reliant on separate processes run synchronously before hand. +- Higher parallelization factor, i.e. 1 content item per celery task worker (and no task group coordination required) +- Provide a content-oriented method of determining content catalog membership that's not reliant on external services. + + +Decision +-------- +We want to follow updates to content with individual and incremental updates to Algolia. To do this we both create +new functionality and reuse some existing functionality of our Algolia indexing infrastructure. + +---------------------------------- +First, the existing indexing process begins with executing catalog queries against `search/all` to determine which +courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the +opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a +given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query +and a piece of content metadata and determine if the content matches the query (without the use of course discovery). +---------------------------------- + +First is to address the way in which and the moments when we choose to invoke the process of indexing. Previously, +the bulk indexing logic was reliant on a completely separate task synchronously completing. In order to bulk index, +content records needed to be bulk updated. The update_content_metadata job's purpose is two fold, one is to ingest content +metadata from external service providers and standardize its format and enterprise representation, and two is to +build associations between said metadata records and customer catalogs by way of catalog query inclusion. Once this +information is entirely read and saved within the catalog service, the system is then ready to snapshot the state of +content in the form of algolia objects and entirely rebuild and replace our algolia index. + +This first A then B approach to wholesale rebuilding our indices is both time and resource intensive as well as brittle +and prone to outages. Not to mention the system is slow to fix should a partial or full error occur, as +everything must be rerun in a specific order. + +To remediate these symptoms, indexing content records will be dealt with on an individual object-shard/content metadata +object basis and will happen at the moment a record is saved to the ContentMetadata table. Tying the indexing process +to the model ``post_save()`` will decouple the task from any other time consuming, bulk job. In order to combat +redundant/unneeded requests, the record will be evaluated on two levels before an indexing task is kicked off. First +the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have +associations with queries within the service. + +In order to incrementally update the Algolia index we need to introduce the ability to replace individual +object-shard documents in the index (today we just replace the whole index). This can be implemented by creating +methods to determine which Algolia object-shards exist for a piece of content. Once we have relevant IDs we are able to +determine if a create, update, or delete of them is required and can highjack existing processes that bulk construct +our algolia objects except on an individual basis. For simplicity sake an update will likely be a delete followed by +the creation of new objects. + +Incremental updates, through the act of saving individual records, will need to be triggered by something - such as +polling of updated content from Course Discovery, consumption of event-bus events, and/or triggering based on a nightly +Course Discovery crawl or Django Admin button. However it is not the responsibility of the indexer, nor this ADR +to determine when those events should occur, and in fact the indexing process should be able to handle any source of +content metadata record updating processes. + + +Consequences +------------ +Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will +also provide us with more flexibility about including non-course-discovery content in catalogs because we will +no-longer rely on a query to course-discovery's `search/all` endpoint and instead rely on the metadata records in the +catalog service, regardless of it's source. + + +Alternatives Considered +----------------------- +No alternatives were considered. diff --git a/docs/decisions/0010-incremental-content-metadata-updating.rst b/docs/decisions/0010-incremental-content-metadata-updating.rst new file mode 100644 index 000000000..abc921e19 --- /dev/null +++ b/docs/decisions/0010-incremental-content-metadata-updating.rst @@ -0,0 +1,69 @@ +Incremental Content Metadata Updating +===================================== + + +Status +------ +Draft + + +Context +------- +The Enterprise Catalog Service implicitly relies on external services as sources of truth for content surfaced to +organizations within the suite of enterprise products and tools. For the most part this external source of truth has +been assumed to be the `course-discovery` service. The ``update_content_metadata`` job has relied on `course-discovery` +to not only expose the content metadata of courses, programs and pathways but also to determine customer catalog +associations with specific subsets of content, meaning enterprise curated content filters are evaluated externally as a +black box solution to what content belongs to which customers. This is burdensome to both the catalog service as it has +little control over how the underlying content filtering logic functions and to the external service as redundant data +must be requested for each and every query filter. Should the catalog service own the responsibility of determining the +associations between a single piece of content and any of the customers' catalogs, not only would we just have to +request all data a single time from external sources for bulk jobs, but we could also easily support creation, updates +and deletes of single pieces of content communicated to the catalog service on an individual basis. + +Decision +-------- +The existing indexing process begins with executing catalog queries against `search/all` to determine which +courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the +opposite semantic and instead be able to determine catalog membership from a given piece of content (rather than +courses from a given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a +catalog query and a piece of content metadata and determine if the content matches the query (without the use of course +discovery). + +We will implement a two sided approach to content updating that will be introduced as parallel work to existing +``update_content_metadata`` tasks and can eventually replace old infrastructure. The first method will be a bulk +job similar to the current ``update_content_metadata`` task to query external sources of content and update any records +should they mismatch using `apps.catalog.filters` to determine the query-content association sets. And second, an event +signal receiver which will process any individual content update events that are received. The intention is for the +majority of updates in the catalog service to happen at the moment they are updated in their external source and the +signal is fired, only to be cleaned up and verified by the bulk job later on should something go wrong. + +While this new process will remove the need to constantly query and burden the `course-discovery` search/all endpoint +we will still most likely need to request the full metadata of each course/content object similar to how the current +task handles the flow. + +An event receiver based approach to individual content updates also opens up our possibilities to ingesting content +from other sources of truth that are hooked up to the edx event-bus. This means that it will be easier for enterprise +to ingest content from many sources, instead of relying on those services first going through course-discovery. + + +Consequences +------------ +As alluded to earlier, this change means that we will no longer have to repeatedly request data from course-discovery's +search/all endpoint as we won't need to rely on the service to do our filtering logic, which was one of the main +contributing factors as to the long run time of the ``update_content_metadata`` task. Additionally, housing +our own filtering logic will allow us to maintain and tweak/improve upon the functionality should we want additional +features. + +The signal based individual updates will also mean that we will have a significantly smaller window of lag for content +updates propagating throughout the enterprise system. + + +Alternatives Considered +----------------------- +There are a number of ways that individual content updates could be communicated to the catalog service. Event-bus +based signal handling restricts the catalog service to sources of truth that have integrated with the event bus +service/software. We considered instead exposing an api endpoint that would take in a content update event and process +the data as needed, however it was decided that this approach is brittle and prone to losing updates in transit as +it would be difficult to ensure the update was fully communicated and processed by the catalog service should anything +go wrong. \ No newline at end of file