Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise extraction prompts via DSPy #154

Open
slobentanzer opened this issue May 11, 2024 · 4 comments
Open

Optimise extraction prompts via DSPy #154

slobentanzer opened this issue May 11, 2024 · 4 comments

Comments

@slobentanzer
Copy link
Contributor

slobentanzer commented May 11, 2024

There remain some questions about the right prompt for the behaviour of the different models; llama series models seem to handle prompts differently than GPT. As an initial experiment, DSPy will be used to generate optimal text extraction prompts for a selection of models (GPT, llama, mi(s/x)tral), which will then be examined for their differences.

@slobentanzer
Copy link
Contributor Author

@drAbreu could you update briefly with your recent experiences?

@drAbreu
Copy link
Contributor

drAbreu commented Aug 27, 2024

A series of experiments were performed to investigate whether DSPy has the power to improve the benchmarking results, specifically among the Llama family of models.

Unfortunately, it seems that it is not possible to use system prompts in the Llama models on DSPy as of now. A bit of extra research has also shown me that the template that we are using on the system prompt for the information extraction, while understood by Open AI, it is not understood by other models. One example is Claude, where having the template

FIGURE CAPTION: {{figure legend}} ##\n\n## QUERY: {{query}} ##\n\n## ANSWER FORMAT: {{format}}. Submit your answer EXTRICTLY in the format specified by {{format}}

leads the model kind of fail, while taking the template out leads to good results. This is likely due to the fact that Claude uses XML-like tags for prompt templating, as oposed to GPT. This issue speaks clearly about prompt engineering issues that will be model dependent.

This poses the question of whether our current benchmark for information_extraction is meaningful since the issues of models other than GPT might arise from a lack of prompt understanding, and we would see this as a result instead of the actual capacity of the models on extracting the required information.

The idea of DSPy was to improve the prompt or the system prompt, increasing the quality of the LLM inferences. However, I do not see this happening in our information extraction.

I have been comparing GPT3.5, GPT4 and Claude3.5 using the baseline API results, and then some of the different solutions of DSPy.

As shown below, Claude3.5 works better than any pf the GPT models, with the surprise that gpt-4o is overperformed by GPT3.5 :hug:

Also interesting is to see that the most basic DSPy uses (Signature and ChainOfThought) just make the models worse.

Few shot learning is what provides indeed the best results.
Below are shown the results

(Rogue scores) gpt-3.5-turbo claude-3-opus-20240229 gpt-4o
Baseline 0.41 +/- 0.32 0.58 +/- 0.39 0.39 +/- 0.34
DSPy Signature 0.37 +/- 0.31 0.35 +/- 0.31 0.28 +/- 0.26
DSPy ChainOfThought 0.28 +/- 0.30 0.37 +/- 0.33 0.25 +/- 0.26
DSPy LabeledFewShot 0.48 +/- 0.37 0.66 +/- 0.34 0.44 +/- 0.30
DSPy BootstrapFewShot 0.47 +/- 0.35 0.58 +/- 0.4 0.43 +/- 0.26

Introducing the system prompt as a learnable parameter does not actually improve anything. Using this Few Shot learning process the system prompt is actually not even modifying a tiny bit by the compiler of DSPy.

The results do not change either.

@drAbreu
Copy link
Contributor

drAbreu commented Aug 27, 2024

This experiment might suggest that keeping track of the prompt engineering of different model families might be important to make the framework as universal as possible.

@slobentanzer
Copy link
Contributor Author

Very nice analysis, thanks! Aligns with my intuition that the model creators are doing many individualistic things and it would thus be valuable to know the peculiarities of each model family and account for it in the backend to get comparable results between models. I'll be off next week but let's catch up in September. :)

the issues of models other than GPT might arise from a lack of prompt understanding

In fact, I did suspect that, but I think it is still valid to test, because this is the application we use. The next step would be the extraction module I suggested, where we look at each model family and create family-specific prompts to improve their performance. This would increase BioChatter version, and we would hopefully see a positive trend in extraction performance in some of the models.

most basic DSPy uses (Signature and ChainOfThought) just make the models worse

That is very interesting and counterintuitive, although I am not surprised.

with the surprise that gpt-4o is overperformed by GPT3.5

We see this in many instances. My guess is that it has to do with the internal system instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants