Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still confused about dataset format used for fine-tuning model using APIs #60

Open
Rhriti opened this issue Jun 27, 2024 · 5 comments
Open

Comments

@Rhriti
Copy link

Rhriti commented Jun 27, 2024

I've been diving into fine-tuning for this virtual hackathon but I am confused about the format used to fine-tune the model using its apis.Below is the format used in the 'capabilities' section under 'fine-tuning' in the docs.
{
"messages": [
{
"role": "user",
"content": "User interaction n°1 contained in document n°2"
},
{
"role": "assistant",
"content": "Bot interaction n°1 contained in document n°2"
},
{
"role": "user",
"content": "User interaction n°2 contained in document n°1"
},
{
"role": "assistant",
"content": "Bot interaction n°2 contained in document n°1"
}
]
}

But under the 'Guide' section 'fine-tuning' a different format is used and later converted into binary and fed to the api.here is the format:
{
"prompt":"Are there any specific neighborhoods in the city that are particularly well-served by public transportation?",
"prompt_id":"ea9a0c38af7a8f9602c40a2e666283e8bb5101adec34421ec4a7df90bb081666",

"messages":[
{
"content":"Are there any specific neighborhoods in the city that are particularly well-served by public transportation?",
"role":"user"
}] , enlighten me!!

@pandora-s-git
Copy link
Collaborator

Hi, I've asked on the discord too, but where exactly did you saw the second format? It should be more like the first one.

@Rhriti
Copy link
Author

Rhriti commented Jun 27, 2024

image

this is after reformatting the jsonl file! can someone explain what is 'prompt' and 'prompt_id' doing in there

@pandora-s-git
Copy link
Collaborator

@Rhriti do you still have this issue? The script for reformatting should work, is it possible to have a copy of your notebook?

@Rhriti
Copy link
Author

Rhriti commented Jun 29, 2024

@pandora-s-git notebook is the same as attached in the guide. Just run it yourself

@pandora-s-git
Copy link
Collaborator

Ive taken a look, and its an artifact from the original dataset, the 'messages' is the important part, so the required is ur first format. We may fix that later to make it clear, thanks for the report !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants