Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The overall score is not matching with the principles #11

Open
ASC-Competition opened this issue Jan 2, 2024 · 1 comment
Open

The overall score is not matching with the principles #11

ASC-Competition opened this issue Jan 2, 2024 · 1 comment

Comments

@ASC-Competition
Copy link

Hi,
I found that some answer with higher overall_socre possessing a lower helpfulness_score in evol_instruct.jsonl dataset which the principle is 100% helpfulness.

for example, the scores of 9th sample in evol_instruct.jsonl dataset is as following:

models helpfulness honesty instruction following truthfulness overall score
gpt-3.5-turbo 4 5 4 5 7
llama-2-70b-chat 4 4 5 5 7.5
mpt-30b-chat 3 4 3 5 6.5
vicuna-33b 5 4 4 5 6.5

The answer of vicuna-33b has the highest helpfulness but lowest overall score.

My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.

Any suggestions will be appriciated, thx.

@lifan-yuan
Copy link
Collaborator

Hi,

Thanks for your interest.

The overall and fine-grained scores are annotated in different schemas and thus may not strictly match each other. Specifically, fine-grained scores are annotated according to our hand-written documentation, while overall scores totally rely on GPT-4 itself with the textual critique being the CoT rationale for scoring.

We investigated the effects of both kinds of scores in our paper (See section 4.1) and found that using fine-grained scores was slightly better. But note that the experiments were based on the previous "bugged" version of overall scores (see this issue), and we are not sure if the conclusion in the paper still apply to our updated scores.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants