-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLM semantic conventions #639
Conversation
Hi @nirga, thanks for the contribution! Please, refer to the CONTRIBUTION guide. The markdown attribute tables are generated automatically, and you need to define the attributes via YAML files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start, thanks @nirga !
docs/ai/llm-spans.md
Outdated
<!-- semconv ai(tag=llm-response) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `llm.response.id` | string | The unique identifier for the completion. | `chatcmpl-123` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding llm.response.duration
? This is requested to check the latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it covered already by the fact that an LLM request is a single span which as a duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean get the info just from otel span, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this information would be available from the Span. In the case of a streaming response some people want to know "time to first token", "max time/pause between tokens".
Thanks! I was merely copying and adapting #483. I'll work on converting this to a YAML as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial feedback. Once we define the yaml model files, some other efficiencies (duplication and naming conventions) become evident.
Also, can we add the openai metrics back into this PR or do you want that to be in a separate PR?
docs/ai/llm-spans.md
Outdated
<!-- semconv ai(tag=llm-response) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `llm.response.id` | string | The unique identifier for the completion. | `chatcmpl-123` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this information would be available from the Span. In the case of a streaming response some people want to know "time to first token", "max time/pause between tokens".
docs/ai/llm-spans.md
Outdated
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `llm.response.id` | string | The unique identifier for the completion. | `chatcmpl-123` | Recommended | | ||
| `llm.response.model` | string | The name of the LLM a response is being made to. If the LLM is supplied by a vendor, then the value must be the exact name of the model actually used. If the LLM is a fine-tuned custom model, the value SHOULD have a more specific name than the base model that's been fine-tuned. | `gpt-4-0613` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect llm.request.model and llm.response.model to be different? The request and response are all recorded on one span, so these would be redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for example in OpenAI you ask for gpt-4
and then get a specific version like gpt-4-0613
(I've also seen this in Ahthropic, Replicate, and others)
docs/ai/llm-spans.md
Outdated
|---|---|---|---|---| | ||
| `llm.vendor` | string | The name of the LLM foundation model vendor, if applicable. If not using a vendor-supplied model, this field is left blank. | `openai` | Recommended | | ||
| `llm.request.model` | string | The name of the LLM a request is being made to. If the LLM is supplied by a vendor, then the value must be the exact name of the model requested. If the LLM is a fine-tuned custom model, the value SHOULD have a more specific name than the base model that's been fine-tuned. | `gpt-4` | Required | | ||
| `llm.request.max_tokens` | int | The maximum number of tokens the LLM generates for a request. | `100` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_tokens
is prefixed with request
, whereas other parameters such as temperature are not prefixed. Perhaps we should remove the request
prefix or add the prefix to the others. Instead of request
, perhaps parameter
is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Parameter would sound weird for model
, no? llm.parameter.model
. I've added request
to all request
parameters.
@joaopgrassi YAML files were added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start!
Left some suggestions and questions.
brief: The name of the LLM foundation model vendor, if applicable. | ||
examples: 'openai' | ||
tag: llm-generic-request | ||
- id: request.model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just llm.model
?
Also, I assume there could be multiple model properties, perhaps llm.model.name
would be more future-proof?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is there are many providers (like OpenAI, Anthropic, Render, etc.) where you ask for a general version (like gpt-4-turbo-preview
) but then get a specific version (like gpt-4-0125-preview
). So we need a separation between the "request" model and the "actual" model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Is it a common case that request and response models are different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question still applies: llm.model.name
, could the model
contain more info other than the name? Like llm.request.model.name|version
etc? Do we want to make "model" a top namespace and then request
response
can just re-use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question still applies:
llm.model.name
, could themodel
contain more info other than the name? Likellm.request.model.name|version
etc? Do we want to make "model" a top namespace and thenrequest
response
can just re-use?
This is a single input parameter in the services I've seen, not a separate name and version. The response model could be a different qualified identifier. For example, the request could be for 'gpt4' and the response could say 'gpt4-32k-turbo'.
tag: llm-generic-request | ||
- id: request.stop_sequences | ||
type: string | ||
brief: Array of strings the LLM uses as a stop sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's an array, should the type be ``string[]`? If there are good reasons (such as perf) to keep it as string, how values are separated? Could you also provide an example of array in examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the resolution here? should it be of type string[]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should likely be string[]
- llm.content.openai.tool | ||
- llm.content.openai.completion.choice | ||
|
||
- id: llm.content.openai.prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this event?
It can be llm.content.prompt
with open-ai specific attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. Given that OpenAI (specifically) has a really different way of modeling prompts and completions, I wonder if it won't be cumbersome to use the same event for both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can start with one event and add more once it's proven to be too difficult. the spec would stay experimental for now anyway.
For the time being we can just list OpenAI-specific attributes and mention that they would appear on the events.
(Unless you already have good reasons to keep events separate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some frameworks (e.g. vllm) have OpenAI-compatible serving APIs.
Using the same event name can benefit this use case.
Adding OpenAI metrics and fixing markdown errors
A request to an LLM is modeled as a span in a trace. | ||
|
||
The **span name** SHOULD be set to a low cardinality value representing the request made to an LLM. | ||
It MAY be a name of the API endpoint for the LLM being called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't usually put endpoints in the span names. Perhaps we can stay vague and say that it should contain specific operation name (e.g. create_chat_completions
).
See also a comment on metric regarding introducing llm.operation
attribute
|
||
## Configuration | ||
|
||
Instrumentations for LLMs MUST offer the ability to turn off capture of prompts and completions. This is for three primary reasons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please set requirement levels on corresponding attributes to opt-in - then there will be no need to specify this requirement - https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/attribute-requirement-level.md
We can just say that prompts and completions could be sensitive (and keep explanation below)
2. Data size concerns. Although there is no specified limit to sizes, there are practical limitations in programming languages and telemety systems. Some LLMs allow for extremely large context windows that end users may take full advantage of. | ||
3. Performance concerns. Sending large amounts of data to a telemetry backend may cause performance issues for the application. | ||
|
||
By default, these configurations SHOULD NOT capture prompts and completions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, these configurations SHOULD NOT capture prompts and completions. |
we need to change requirement level to opt-in and then it's redundunt
<!-- semconv llm.request --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| [`llm.request.is_stream`](../attributes-registry/llm.md) | boolean | Whether the LLM responds with a stream. | `False` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, is it important for observability? How would i use it?
brief: The name of the LLM foundation model vendor, if applicable. | ||
examples: 'openai' | ||
tag: llm-generic-request | ||
- id: request.model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Is it a common case that request and response models are different?
|---|---|---|---|---| | ||
| [`llm.prompt`](../attributes-registry/llm.md) | string | The full prompt string sent to an LLM in a request. [1] | `\\n\\nHuman:You are an AI assistant that tells jokes. Can you tell me a joke about OpenTelemetry?\\n\\nAssistant:` | Recommended | | ||
|
||
**[1]:** The full prompt string sent to an LLM in a request. If the LLM accepts a more complex input like a JSON object, this field is blank, and the response is instead captured in an event determined by the specific LLM technology semantic convention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's ok to put json into this attribute. Once we have a way to specify what goes into event payload, we'll move it there and json, xml or plain text would be perfectly fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prompt vs. completion seems to be the old Completion API that is deprecated by openai and many other model providers not even start with having Completion api (e.g. Mistral: https://docs.mistral.ai/api/)
The current open ai api is ChatCompletion which has a messages array as input and one message output.
https://platform.openai.com/docs/guides/text-generation/chat-completions-api
Also if we consider the llm-span is the "base" and to be extended for different model providers and APIs, i'm not sure if we should include things like prompt/completion (input/output) as part of it, since different model/api will have totally different input/output, trying to define a base input/output here doesn't help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes more sense to define span type for each api not vendor?
Open AI has chat completion, embedding, image generation, gpt-4v ... it will be hard to capture that in a single span type 'openai'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes more sense to define span type for each api not vendor? Open AI has chat completion, embedding, image generation, gpt-4v ... it will be hard to capture that in a single span type 'openai'.
Yes, have the same concern. Do you think "inputs" and "outputs" would be the more generic representation across various apis? Then we can add api specific attributes for chatcompletions, image generation, etc. gen_ai.openai.chatcompletions.*
, gen_ai.openai.images.*
etc. We should discuss this in the working group.
|---|---|---|---|---| | ||
| [`llm.completion`](../attributes-registry/llm.md) | string | The full response string from an LLM in a response. [1] | `Why did the developer stop using OpenTelemetry? Because they couldnt trace their steps!` | Recommended | | ||
|
||
**[1]:** The full response string from an LLM. If the LLM responds with a more complex output like a JSON object made up of several pieces (such as OpenAI's message choices), this field is the content of the response. If the LLM produces multiple responses, then this field is left blank, and each response is instead captured in an event determined by the specific LLM technology semantic convention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this sentence.
Why leave this attribute blank and not out json there?
also, I think we should create even per each message in the completion, at least when response is streamed.
@@ -0,0 +1,372 @@ | |||
<!--- Hugo front matter used to generate the website version of this page: | |||
linkTitle: MEtrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linkTitle: MEtrics | |
linkTitle: LLM metrics |
(or OpenAI metrics depending on the discussion below)
@@ -0,0 +1,25 @@ | |||
<!--- Hugo front matter used to generate the website version of this page: | |||
linkTitle: AI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since it's LLM semconv, I think it should be in the llm folder and should have LLM title
to: database/README.md | ||
---> | ||
|
||
# Semantic Conventions for AI systems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Semantic Conventions for AI systems | |
# Semantic Conventions for LLM clients |
brief: The name of the LLM foundation model vendor, if applicable. | ||
examples: 'openai' | ||
tag: llm-generic-request | ||
- id: request.model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question still applies: llm.model.name
, could the model
contain more info other than the name? Like llm.request.model.name|version
etc? Do we want to make "model" a top namespace and then request
response
can just re-use?
brief: The total number of tokens used in the LLM prompt and response. | ||
examples: [280] | ||
tag: llm-generic-response | ||
- id: prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be inside request.prompt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be inside
request.prompt
?
These were intended to be attributes on span events, but we will be moving them to the Event body.
brief: The full prompt string sent to an LLM in a request. | ||
examples: ['\\n\\nHuman:You are an AI assistant that tells jokes. Can you tell me a joke about OpenTelemetry?\\n\\nAssistant:'] | ||
tag: llm-generic-events | ||
- id: completion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- id: completion | |
- id: response.completion |
No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nirga for the great work!
Added some comments.
| Value | Description | | ||
|---|---| | ||
| `prompt` | prompt | | ||
| `completion` | completion | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does embedding have completion
token type?
examples: ["stop1"] | ||
tag: llm-generic-request | ||
- id: response.id | ||
type: string[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should response.id
be of string
type (instead of string[]
)?
- ref: llm.request.max_tokens | ||
tag: tech-specific-openai-request | ||
- ref: llm.request.temperature | ||
tag: tech-specific-openai-request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the default requirement level if not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default is Recommended.
- llm.content.openai.tool | ||
- llm.content.openai.completion.choice | ||
|
||
- id: llm.content.openai.prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some frameworks (e.g. vllm) have OpenAI-compatible serving APIs.
Using the same event name can benefit this use case.
| [`llm.request.top_p`](../attributes-registry/llm.md) | double | The top_p sampling setting for the LLM request. | `1.0` | Recommended | | ||
| [`llm.response.finish_reason`](../attributes-registry/llm.md) | string | The reason the model stopped generating tokens. | `stop` | Recommended | | ||
| [`llm.response.id`](../attributes-registry/llm.md) | string[] | The unique identifier for the completion. | `[chatcmpl-123]` | Recommended | | ||
| [`llm.response.model`](../attributes-registry/llm.md) | string | The name of the LLM a response is being made to. [2] | `gpt-4-0613` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this exclusive with llm.request.model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the requested model identifier is sometimes different than the response model identifier. For example, Azure OpenAI allows for a deployment name as the request model, but responds with the actual LLM model name. Other systems will add the current variant at the end of the model name in the response.
|
||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| [`llm.prompt`](../attributes-registry/llm.md) | string | The full prompt string sent to an LLM in a request. [1] | `\\n\\nHuman:You are an AI assistant that tells jokes. Can you tell me a joke about OpenTelemetry?\\n\\nAssistant:` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the plain text prompts, OpenAI's chat completion API also supports complicated inputs/outputs like images, function calls, etc. How do we plan to record these kinds of payloads into trace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the chat completion API, OpenAI also has a lot of other APIs like Embeddings, Images, Assistants, etc. How do we plan to support those scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are discussing in the working group, requirements for an initial minimum PR to get into the semantic-conventions. After this initial merge. we will be able to create additional proposals, issues, PRs. We will likely reduce the surface area of this initial PR. Then proposals can be submitted for embeddings, images, etc.
Work continued in #825 |
Advancement towards #327
Changes
Please provide a brief description of the changes here.
Continuing the work from #483. Introduces semantic conventions for modern AI systems.
I tried focusing on a minimal set, specifically supporting LLMs in general with some specific semantic conventions for OpenAI as its API is far more complex than others like Anthropic. Future PRs will address more foundation models as well as vector DBs and frameworks.
I'm trying to match this to what we've already started building with OpenLLMetry and will make the needed changes there once this is approved.
Merge requirement checklist