You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently working on tool use PRs and I'm seeing that some models are very sensitive to the given prompt. So it would be nice to be able to detect what model is being used in the chat template and adjust the input accordingly.
For example, in LLama 3.1 the model seems to perform better if the tool list is passed in the first user message, whereas Llama 3.2 seems to prefer the tools to be in the system prompt. In the llama chat template this behavior is controlled by the tools_in_user_message flag that can be passed in the tokenizer.apply_chat_template() call:
{%- if not tools_in_user_message is defined %}
{%- set tools_in_user_message = false %}
{%- endif %}
Passing extra flags is already supported in the vllm's version of the OpenAI API using the chat_template_kwargs field in the request JSON, but this is not supported in the openai client library, making it hard to use. Therefore, it would be nice if we could have extra context inserted in the chat template to conditionally create different prompts.
To illustrate the idea, here is a simplistic PoC that adds the model name as a variable passed to the chat template:
$ git diff vllm/entrypoints/openai/serving_chat.py
diff --git a/vllm/entrypoints/openai/serving_chat.py b/vllm/entrypoints/openai/serving_chat.py
index eee8076b..69e21511 100644
--- a/vllm/entrypoints/openai/serving_chat.py
+++ b/vllm/entrypoints/openai/serving_chat.py
@@ -132,6 +132,9 @@ class OpenAIServingChat(OpenAIServing):
tool.model_dump() for tool in request.tools
]
+ chat_template_kwargs = request.chat_template_kwargs or {}
+ chat_template_kwargs["model"] = request.model
+
prompt: Union[str, List[int]]
is_mistral_tokenizer = isinstance(tokenizer, MistralTokenizer)
if is_mistral_tokenizer:
@@ -142,7 +145,7 @@ class OpenAIServingChat(OpenAIServing):
add_generation_prompt=request.add_generation_prompt,
tools=tool_dicts,
documents=request.documents,
- **(request.chat_template_kwargs or {}),
+ **chat_template_kwargs,
)
else:
prompt = apply_hf_chat_template(
@@ -152,7 +155,7 @@ class OpenAIServingChat(OpenAIServing):
add_generation_prompt=request.add_generation_prompt,
tools=tool_dicts,
documents=request.documents,
- **(request.chat_template_kwargs or {}),
+ **chat_template_kwargs,
)
except Exception as e:
logger.error("Error in applying chat template from request: %s", e)
And then the chat template can do stuff like this:
{%- if not tools_in_user_message is defined %}
{%- if model is defined and "3.1" in model %}
{%- set tools_in_user_message = true %}
{%- else %}
{%- set tools_in_user_message = false %}
{%- endif %}
{%- endif %}
But ideally it would be something more robust than the example above.
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
I'm currently working on tool use PRs and I'm seeing that some models are very sensitive to the given prompt. So it would be nice to be able to detect what model is being used in the chat template and adjust the input accordingly.
For example, in LLama 3.1 the model seems to perform better if the tool list is passed in the first user message, whereas Llama 3.2 seems to prefer the tools to be in the system prompt. In the llama chat template this behavior is controlled by the
tools_in_user_message
flag that can be passed in thetokenizer.apply_chat_template()
call:Passing extra flags is already supported in the vllm's version of the OpenAI API using the
chat_template_kwargs
field in the request JSON, but this is not supported in theopenai
client library, making it hard to use. Therefore, it would be nice if we could have extra context inserted in the chat template to conditionally create different prompts.To illustrate the idea, here is a simplistic PoC that adds the model name as a variable passed to the chat template:
And then the chat template can do stuff like this:
But ideally it would be something more robust than the example above.
cc: @njhill @K-Mistele
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: