Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinecone integration #189

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Pinecone integration #189

wants to merge 7 commits into from

Conversation

iuliadmtru
Copy link
Contributor

@iuliadmtru iuliadmtru commented Aug 6, 2024

This is an attempt to integrate Pinecone into PromptingTools. The preparation and retrieval steps are not necessary with Pinecone. The result of the Pinecone query serves as context in the RAGResult.

Questions

  • The new types have a PT prefix such as to not cause confusion with Pinecone.jl (there's already a PineconeIndex there, which is a different thing). That is not great.
  • I added a NoChunker type such that PTPineconeIndexer can still have a chunker. Is that best? Probably not.
  • I think the dispatch of find_tags might not be necessary.
  • Is it necessary to have a new abstract type (AbstractPTPineconeIndex)? Probably not.
  • I also added a new template, JuliaRAGAssistant, and I put that under persona-task. I'm not sure that's the best place.

@iuliadmtru iuliadmtru force-pushed the pinecone branch 2 times, most recently from 27729fc to f6c01c8 Compare August 6, 2024 18:53
@svilupp
Copy link
Owner

svilupp commented Aug 6, 2024

Thanks for the PR! I'll review it properly later.

The first reaction was that we won't need a custom index (to avoid duplication), but you might be right that it's the easiest way.

Questions for you:

  • do you keep your document chunks or use the Pinecone to maintain them? (I think so)
  • do you embed your documents with OpenAI or use Pinecone?
  • do you rerank documents inside Pinecone (I think not?)

(some notes for me)
I'll check for the following functionality

  • upload documents + upload embeddings (index composition/some postprocessing of ChunkIndex?)
  • getting embeddings (get_embeddings) if we use the embedding by Pinecone
  • find closest docs (find_closest if we only return chunk positions, maybe we could hack documents retrieved in a special CandidateChunks)
  • getting document chunks (either chunks of subindex or maybe use getindex())

@iuliadmtru
Copy link
Contributor Author

do you keep your document chunks or use the Pinecone to maintain them?

do you embed your documents with OpenAI or use Pinecone?

The chunks and embeddings get uploaded to Pinecone and then Pinecone takes care of sorting/clustering and all that when retrieving.

do you rerank documents inside Pinecone

Well, no, but it's possible I think: https://docs.pinecone.io/guides/inference/rerank

"_type": "metadatamessage"
},
{
"content": "Act as a world-class Julia language programmer with access to the latest Julia-related knowledge via Context Information. \n\n**Instructions:**\n- Answer the question based only on the provided Context.\n- Be precise and answer only when you're confident in the high quality of your answer.\n- Be brief and concise.\n\n**Context Information:**\n---\n{{context}}\n---\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be further down the line but I'd recommend adding stronger wording about refusing to answer when the context is bad / wrong to avoid weak answer. Knowing when to refuse is a key for hallucination minimization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed from here altogether. It has nothing to do with Pinecone.

Project.toml Outdated
@@ -7,15 +7,18 @@ version = "0.45.0"
AbstractTrees = "1520ce14-60c1-5f80-bbc7-55ef81b5835c"
Base64 = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
DotEnv = "4dc1fcf4-5e3b-5448-94ab-0c38ec0385c1"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just a Draft, but these deps would have to go out and Pinecone would be an extension

src/Experimental/RAGTools/generation.jl Outdated Show resolved Hide resolved
src/Experimental/RAGTools/preparation.jl Outdated Show resolved Hide resolved
src/Experimental/RAGTools/preparation.jl Outdated Show resolved Hide resolved
src/Experimental/RAGTools/retrieval.jl Outdated Show resolved Hide resolved
src/Experimental/RAGTools/retrieval.jl Outdated Show resolved Hide resolved
src/Experimental/RAGTools/retrieval.jl Outdated Show resolved Hide resolved
src/Experimental/RAGTools/types.jl Outdated Show resolved Hide resolved
src/Experimental/RAGTools/types.jl Outdated Show resolved Hide resolved
@svilupp
Copy link
Owner

svilupp commented Aug 8, 2024

Apologies for scattering comments all over!

One thing, Pinecone should be relevant for "retrieve" step. It should need very little changed for generate! step (only build_context needs work because we build overlap/window extension).
I think it would be good to step back and decide on what do you want to be handled by PT.

Eg, If you say:

  • we want to chunk documents externally (not sure why)
  • we don't want to build index in PT (ie, chunk, embed, upload)
  • we DO want to manage embeddings by PT (risking inconsistency/bad results otherwise)
  • we DON't want to use any tag/tag filtering (predicate pushdown)
  • we DON't want reranking

Option 1) The the simplest path should be to have a lightweight PineconeIndex + method for retrieve (manually writing everything) + method for build_context (to avoid creating the overlaps which are not possible with 1:1 retrieval).

Option 2) (this also promotes more re-use and optional addition of other methods as the RAG paradigm progresses)
However, the retrieve would be so butchered that it might be better to:

  • define No-op for get_tags, find_tags (we have that so just extend), rerank
  • add methods for get_embeddings, find_closest (-> uses Pinecone.query), build_context

Option 3) (I would recommend this)
I'd keep the current paradigm: the index (stores chunks + embedding/ remotely) + candidatechunks (stores query results/or its reference).
The difference would be that we create a custom candidatechunks object that will hold the chunks, metadata, sources (not the "references" to index)
We create support for chunks(index,candidatechunks) or we continue the getindex with index[candidatechunks,:chunks].
This is slightly more work/thinking upfront, but then you can re-use all existing methods for embeddings, reranking, build_context, etc. the benefit is that to improve your RAG, you can just try different methods without having to "tweak" each one of them first.

The last option requires also some methods for find_closest (your "query"). And a few extensions to support the No-Op with new Index (eg, in find_tags).

What do you think?

@iuliadmtru
Copy link
Contributor Author

Sorry for butchering your API! I didn't mean to actually hit click on creating this PR, it is definitely far from "ready"!

Thanks for all the comments. I would go for your recommendation (option 3), but I am still unsure how to go about that. @pankgeorg what do you say?

@iuliadmtru
Copy link
Contributor Author

iuliadmtru commented Aug 15, 2024

I integrated all feedback, I hope. There's still some TODOs left and some choices left to make, as we discussed. I'll try to include everything below:

Comments

  • The view should be properly implemented. Probably this will be done by adding a new struct, SubManagedIndex, similar to SubChunkIndex.
  • Names are still a bit weird, I just added them like you first suggested @svilupp cause I couldn't find something better (AbstractCandidateWithChunks).
  • Right now, build_index is just calling the PineconeIndex constructor. This could be an option for creating an Index (just calling the constructor); another option could be handling chunking and embedding as well, and adding them to Pinecone. Maybe both would be best?

indexid(cc::CandidateWithChunks) = cc.index_id
positions(cc::CandidateWithChunks) = cc.positions
scores(cc::CandidateWithChunks) = cc.scores
chunks(cc::CandidateWithChunks) = cc.chunks
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will need these accessors but we can clean it up later

Copy link
Owner

@svilupp svilupp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good so far! We can clean it up later.

As you pointed out, the build_index and the SubManagedIndex are index (for the view op).

It would be amazing to add a minimal example (you can grab some example documents from the docstrings of airag) to show how it works end-to-end

@iuliadmtru
Copy link
Contributor Author

iuliadmtru commented Aug 26, 2024

TODOs:

  • Add chunking, embedding etc in build_index
  • Add SubManagedIndex + view implementation
  • Write docs properly
  • Add examples in the docs
  • Clean up unnecessary methods
  • Clean up dependencies

@iuliadmtru
Copy link
Contributor Author

I got to the cleaning up point. There are also a bunch of TODOs scattered around, minor aspects that need fixing. I will do the end-to-end example once those are finished.

@svilupp would you mind having another look at this, please?

Copy link
Owner

@svilupp svilupp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! A few minor questions around the Pinecone index and its fields.

Btw do you have like a minimal running example? I have a feeling this wouldnt work for airag

Project.toml Outdated Show resolved Hide resolved
src/Experimental/RAGTools/generation.jl Outdated Show resolved Hide resolved

Dispatch for `AbstractManagedIndex`.
"""
function answer!(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a separate method? I believe the index isnt used so maybe we just need to allow the new type in the definition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was an ambiguity issue with this, that's why I added a different function instead of providing the index type as a Union.

src/Experimental/RAGTools/preparation.jl Outdated Show resolved Hide resolved
src/Experimental/RAGTools/retrieval.jl Outdated Show resolved Hide resolved
end
HasKeywords(::PineconeIndex) = false
HasEmbeddings(::PineconeIndex) = true
embeddings(index::PineconeIndex) = index.embeddings
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have these accessors and fields if we know that the data isn’t there? we hold the data in the individual candidatechunk objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are removed, then getting the embeddings in retrieve should be changed.

@iuliadmtru
Copy link
Contributor Author

I integrated your feedback @svilupp. I hope I understood your comments well enough. I left some TODOs and also added a MWE for airag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants