Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for the HyDE method in quey analysis for RAG plates #1413

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

lanlanguai
Copy link

Features
Added the HyDE method for query-analysis in the RAG module, including an example for better understanding.
Fixed the issue with the static methods in TestRAGEmbeddingFactory not being callable. The previous code passed static methods as parameters for parameterized testing, but static methods are not callable objects, leading to a TypeError. This was resolved by converting static methods to regular functions and defining them outside the class.
Feature Docs
No additional documentation provided.

Influence
As an optional process in RAG, query-analysis will rewrite queries to enhance search results.

Result
All unit tests for the new features have passed.
The query-analysis process in the RAG module runs smoothly, effectively rewriting and optimizing queries for better search results.
Other
Added a detailed description of the changes and fixes made in the submission.

liaojianxing added 4 commits July 25, 2024 11:15
Simulation functions (mock_openai_embedding, mock_azure_embedding, mock_gemini_embedding, and mock_ollama_embedding) have been added.
Reason for adding:
Fix the issue that static methods are not callable: The previous code parameterized the static method as a parameterized test, but the static method was not a callable object, resulting in a TypeError error.Factory.py
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 15.62500% with 27 lines in your changes missing coverage. Please review.

Project coverage is 55.66%. Comparing base (c0abe17) to head (2819b2e).
Report is 12 commits behind head on main.

Files Patch % Lines
metagpt/rag/query_analysis/HyDE.py 0.00% 14 Missing ⚠️
metagpt/rag/factories/HyDEQueryTransformFactory.py 0.00% 13 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1413       +/-   ##
===========================================
+ Coverage   30.64%   55.66%   +25.01%     
===========================================
  Files         320      323        +3     
  Lines       19426    19458       +32     
===========================================
+ Hits         5954    10831     +4877     
+ Misses      13472     8627     -4845     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -20,6 +20,10 @@ embedding:
embed_batch_size: 100
dimensions: # output dimension of embedding model

# RAG Analysis
hyde:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the structure like to support more configuration inside rag
rag:
query:
hyde:
include_original: True

api_key: "YOUR_API_KEY"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to commit this file if there are no related changes.

from pydantic import BaseModel

from metagpt.const import DATA_PATH, EXAMPLE_DATA_PATH
from metagpt.logs import logger
from metagpt.rag.engines import SimpleEngine
from metagpt.rag.factories.HyDEQueryTransformFactory import HyDEQueryTransformFactory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file name usually in low case with '_'

@@ -212,6 +214,22 @@ async def init_and_query_es(self):
answer = await engine.aquery(TRAVEL_QUESTION)
self._print_query_result(answer)

async def use_HyDe(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_hyde
and keep in a uniform format, HyDE. No HyDe

@@ -51,6 +52,9 @@ class Config(CLIParams, YamlModel):
# RAG Embedding
embedding: EmbeddingConfig = EmbeddingConfig()

# RAG Analysis
hyde: HydeConfig = HydeConfig()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyDEConfig

@@ -0,0 +1,5 @@
from metagpt.utils.yaml_model import YamlModel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use rag_config.py to support independent rag configuration


if self._include_original:
embedding_strs.extend(query_bundle.embedding_strs)
logger.info(f" Hypothetical doc:{embedding_strs} ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually not to print embedding, it's too long and not a good log str

engine = SimpleEngine.from_docs(input_files=[TRAVEL_DOC_PATH])
# create HyDE query engine
hyde_query_transformr = HyDEQueryTransformFactory().create_hyde_query_transform()
hyde_query_engine = TransformQueryEngine(engine, hyde_query_transformr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to integrate with SimpleEngine, not directly TransformQueryEngine.
What I means is that one engine entrance to support like query rewrite, rerank and so on.

# 1. save docs
engine = SimpleEngine.from_docs(input_files=[TRAVEL_DOC_PATH])
# create HyDE query engine
hyde_query_transformr = HyDEQueryTransformFactory().create_hyde_query_transform()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add datasets comparison result with/without HyDE method.

@@ -23,13 +23,9 @@ rag:
# RAG Query Analysis
query_analysis:
hyde:
include_original: true # In the query rewrite, determines whether to include the original
include_original: True # In the query rewrite, determines whether to include the original
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true not True

@@ -0,0 +1,63 @@
from typing import Any, Dict, Optional
from llama_index.core.llms import LLM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why import this, not used

api_version: ""
embed_batch_size: 100
dimensions: # output dimension of embedding model
embedding:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change this embedding one.

liaojianxing added 4 commits August 20, 2024 14:54
@lanlanguai
Copy link
Author

The configuration information and results from running the configurations with and without the HyDE method using metagpt/rag/benchmark/hotpotqa.py are as follows:

Model Sample_Size HyDE_Used Exact_Match F1_Score
deepseek 20 yes 0.1 0.289846
deepseek 20 no 0.1 0.265604
gpt4-o 20 yes 0.55 0.726190
gpt4-o 20 no 0.45 0.626190
gpt4-o 100 yes 0.6 0.752560
gpt4-o 100 no 0.57 0.741560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants