Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Model Inference Caches #3055

Open
jngz-es opened this issue Oct 2, 2024 · 2 comments
Open

[RFC] Model Inference Caches #3055

jngz-es opened this issue Oct 2, 2024 · 2 comments
Assignees
Labels
RFC Request For Comments from the OpenSearch Community

Comments

@jngz-es
Copy link
Collaborator

jngz-es commented Oct 2, 2024

Problem statement

Usually model inference is expensive, especially for large models.

Motivation

With the caching feature, we can

  1. reduce the latency of model inference.
  2. save the cost of model inference

Proposed Design

Phase 0

We allow user to enable cache feature for models.

Enable cache

POST /_plugins/_ml/models/_register?deploy=true
{
    "name": "anthropic.claude-v3",
    "function_name": "remote",
    "model_group_id": "<group id>",
    "description": "claude v3 model",
    "connector_id": "<connector id>",
    "cache_enabled": true,
    "cache_config": {
        "eviction_policy": "lru",
        "ttl": 600, # 600s
        "capacity": 1000
    }
}

All cache parameters are optional. By default the cache is disabled. If enabled cache, the system will try to get cache related config. If the config not present, the system will use default values.

Config parameters

eviction_policy - determines how to do eviction when the capacity limit is reached.
ttl - determines how long an item is hold by the cache.
capacity - the soft limit on the cache volume, it will be overridden by hard limit if it is large than hard limit.

Disable cache

PUT /_plugins/_ml/models/<model_id>
{
  "cache_enabled": false
}

Disabling cache for a model will remove all data associated with the model in the cache.

Update cache

PUT /_plugins/_ml/models/<model_id>
{
    "cache_config": {
        "eviction_policy": "lru",
        "ttl": 600, # 600s
        "capacity": 1000
    }
}

Storage

We leverage on OpenSearch index to store the data.

Cache key

Model id + model config + user input

Cleanup

  • We check ttl when getting the item from the cache, if expired, remove it.
  • To avoid a cache item never being deleted as a storage leak issue, we will have a periodical job to delete those expired items.
  • The above job evicts some data in terms of the policy as well.

Security

Leverage on the existing model permission control for cache access permission control.

Phase 1

We introduce new cache APIs as a cache service to

  • provide flexibility for applying cache in different cases, like combination with models, agents etc.
  • support remote cache.
  • support more cache features like set different ttl for different items in the cache etc.

API

Create cache

POST /_plugins/_ml/cache/_create
{
    "type": "Local/Remote",
    "name": "test cache",
    "description": "test cache",
    "connector": "connector_id" # required by remote type like Elasticache
    "config": {
        "eviction_policy": "lru",
        "ttl": 600,
        "capacity": 1000
    }
}

#Response
{
    "cache_id": "gW8Aa40BfUsSoeNTvOKI"
}

Get cache meta

# Get single cache meta data
GET /_plugins/_ml/cache/<cache_id>

#Response
{
    "cache_id": "gW8Aa40BfUsSoeNTvOKI",
    "type": "Local/Remote",
    "name": "test cache",
    "description": "test cache",
    "connector": "connector_id"
}

# Get all caches
GET /_plugins/_ml/cache

#Response
{
    "caches": [
        {
            "cache_id": "gW8Aa40BfUsSoeNTvOKI",
            "type": "Local/Remote",
            "name": "test cache",
            "description": "test cache",
            "connector": "connector_id"
        }
    ]
}

Delete cache

DELETE /_plugins/_ml/cache/<cache_id>

Cache set

PUT /_plugins/_ml/cache/<cache_id>/_set?ttl=600
{
    "key": "value (ex. model response)"
}

# multiple set
PUT /_plugins/_ml/cache/<cache_id>/_mset?ttl=600
{
    "key1": "value1",
    "key2": "value2",
    ...
}

Set stores create_time field automatically for ttl calculation.

Cache get

GET /_plugins/_ml/cache/<cache_id>/_get
{
    "key": "key name"
}

# multiple get
GET /_plugins/_ml/cache/<cache_id>/_mget
{
    "keys": [key1, key2, ...]
}

If ttl expires, return null and remove the key from the cache.

Cache delete

DELETE /_plugins/_ml/cache/<cache_id>/_delete
{
    "key": "key name"
}

# multiple delete
DELETE /_plugins/_ml/cache/<cache_id>/_mdelete
{
    "keys": [key1, key2, ...]
}

Cache types

Local cache

We build cache on top of OpenSearch index functionality. To simplify the design, we don’t introduce a new distributed cache like redis or memorycache into the cluster, we use OpenSearch index as the store for caching.

Remote cache

We leverage on our existing connector mechanism to access remote cache service such as Elasticache to build a remote cache for customers. We need a new type of connector for cache, not model. As we don’t need predict action, we need get/set actions on connector.

An example connector

POST /_plugins/_ml/connectors/_create
{
    "name": "Elasticache Connector",
    "description": "The connector to Elasticache",
    "version": 1,
    "protocol": "http",
    "parameters": {
        "host": "xxx.yyy.clustercfg.zzz1.cache.amazonaws.com",
        "port": "6379"
    },
    "credential": {
        "key": "..."
    },
    "actions": [
        {
            "action_type": "cache",
            "method": "get/put"
        }
    ]
}

Use case example

Cache for models

POST /_plugins/_ml/models/_register?deploy=true
{
    "name": "anthropic.claude-v3",
    "function_name": "remote",
    "model_group_id": "<group id>",
    "description": "claude v3 model",
    "connector_id": "<connector id>",
    "cache_id": "<cache id>"
}
@jngz-es jngz-es self-assigned this Oct 2, 2024
@ylwu-amzn ylwu-amzn added RFC Request For Comments from the OpenSearch Community and removed untriaged labels Oct 7, 2024
@brianf-aws
Copy link
Contributor

Hey Jing, this feature sounds amazing. I think if you can provide an example of how this could be used like caching an embedding.

Besides LRU what other eviction polices do you plan to implement?

@zane-neo
Copy link
Collaborator

zane-neo commented Oct 8, 2024

@jngz-es, this feature looks good. Several questions:

  1. Are we going to support exact match or semantic match? If only exact match supported, do we have expected hit rate on this?
  2. Do we need to support enable/disable the cache reading on the fly? E.g. I might don't want cache data for a question because I'm seeking a different answer.
  3. Do we need to add user_id(if user_id shows up) in the cache key to avoid leaking private data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request For Comments from the OpenSearch Community
Projects
None yet
Development

No branches or pull requests

4 participants