Document viewer #486

blakerosenthal · 2024-08-12T18:36:02Z

Add API endpoints for document content
Add source content viewer to UI

Closes #466

pmeier · 2024-08-12T20:00:39Z

ragna/deploy/_api/core.py

+        with get_session() as session:
+            _, metadata = database.get_document(session, user=user, id=id)
+            if "path" not in metadata:
+                raise HTTPException(
+                    status_code=400,
+                    detail="Document path not found",
+                )
+            with aiofiles.open(metadata["path"], "rb") as file:
+                while content := await file.read(1024):
+                    yield content


Our documents have a .read() method that should be used here:

ragna/ragna/core/_document.py

Lines 75 to 77 in 2066dcd

@abc.abstractmethod

def read(self) -> bytes: ...

It is not async for now, but there is no need to require documents to be on the local file system. We just need the conversion to the Document object

ragna/ragna/deploy/_api/schemas.py

Lines 29 to 37 in 2066dcd

def to_core(self) -> ragna.core.Document:

return ragna.core.LocalDocument(

id=self.id,

name=self.name,

# TEMP: setting an empty metadata dict for now.

# Will be resolved as part of the "managed ragna" work:

# https://github.com/Quansight/ragna/issues/256

metadata={},

)

Maybe this is also a good time to resolve this TODO first?

ragna/deploy/_ui/api_wrapper.py

blakerosenthal · 2024-08-26T18:40:41Z

@pmeier I have two outstanding questions before being ready for a full review:

Are we only supporting PDFs right now, or do we want to include all supported types? It doesn't seem too difficult to just send any blob to the browser and let the browser decide whether to display it or offer it as a download.
I'm currently just calling the Document.read() method and sending the entire blob to the browser wrapped in a fastapi.Response. I'd be interested in your thoughts on how this should be chunked and streamed properly.

Here's a video of how it looks so far.

Screencast.from.2024-08-26.11-33-04.mp4

pmeier · 2024-08-26T19:50:24Z

Are we only supporting PDFs right now, or do we want to include all supported types? It doesn't seem too difficult to just send any blob to the browser and let the browser decide whether to display it or offer it as a download.

Does the browser handle this automatically? Meaning, we just pass it a blob and the browser figures out whether it can show it (.pdf, .md, .txt) or offer it as download (.pptx, .docx)? If yes, that would be awesome and we should really have this feature for everything.

I'm currently just calling the Document.read() method and sending the entire blob to the browser wrapped in a fastapi.Response. I'd be interested in your thoughts on how this should be chunked and streamed properly.

FastAPI has a FileResponse that handles this properly. Downside is that this requires a file on disk to work. Short-term I'd be ok to just read the data and write it to a temporary file that we remove in a background task after the response has gone out. Long-term we need to build our own response type that does the same, but starts from bytes or an iterable thereof.

Here's a video of how it looks so far.

Looks good. #466 also states that

Ideally the new pane would scroll to and highlight the exact source content in the file.

Is that possible / planned?

blakerosenthal · 2024-08-26T20:04:08Z

Does the browser handle this automatically? Meaning, we just pass it a blob and the browser figures out whether it can show it (.pdf, .md, .txt) or offer it as download (.pptx, .docx)? If yes, that would be awesome and we should really have this feature for everything.

I thiiink so as long as the right mimetype is specified in the response header. I will confirm and let you know what I find out.

FastAPI has a FileResponse that handles this properly. Downside is that this requires a file on disk to work. Short-term I'd be ok to just read the data and write it to a temporary file that we remove in a background task after the response has gone out. Long-term we need to build our own response type that does the same, but starts from bytes or an iterable thereof.

Okay, I'd be interested in seeing the level of effort on building our own. It seems like FileResponse is mostly just a wrapper around StreamingResponse that sets the right headers.

Ideally the new pane would scroll to and highlight the exact source content in the file.

Is that possible / planned?

I need to think a bit more about how this could work. Might be best suited for a followup PR.

blakerosenthal · 2024-08-29T15:41:44Z

@pmeier This is ready for another look! This is was the source viewer looks like now after putting the accordion widget back and adding a new button. And I added some MIME types to our supported documents so the browser knows what to do with them when it receives the blob. On my browser PDFs open in a new tab, text is displayed as HTML, and Word and Powerpoints are downloaded. For sources with page numbers there's also now an anchor that scrolls the view to the right page for in-browser sources.

ragna/core/_document.py

ragna/deploy/_api/schemas.py

ragna/deploy/_ui/central_view.py

pmeier

This looks great, thanks Blake! One question though: in case the browser downloads the file, I get something like "{uuid.UUID}.docx". Why isn't this the proper file name?

blakerosenthal · 2024-09-02T21:43:03Z

This looks great, thanks Blake! One question though: in case the browser downloads the file, I get something like "{uuid.UUID}.docx". Why isn't this the proper file name?

That's a typo, thanks for catching!

pmeier · 2024-09-05T19:10:03Z

ragna/core/_document.py

@@ -24,6 +24,15 @@ class DocumentUploadParameters(BaseModel):
    data: dict


+_MIME_TYPES = {


As discussed offline, let's use mimetypes from the standard library.

pmeier · 2024-09-05T19:12:07Z

ragna/core/_document.py

    ):
        self.id = id or uuid.uuid4()
        self.name = name
        self.metadata = metadata
        self.handler = handler or self.get_handler(name)
+        self.mime_type = mime_type or self.parse_mime_type(name)


We need to store this in the DB as well. Otherwise any MIME type set by the user will be overridden as soon as we pull it from the DB, because the default would be used.

pmeier reviewed Aug 12, 2024

View reviewed changes

blakerosenthal force-pushed the document-viewer branch from 2b38d59 to 8f7c25e Compare August 19, 2024 21:42

blakerosenthal added 6 commits August 26, 2024 16:11

use buttons for source info in right sidebar

54fe725

add document content endpoint

0ff0e77

WIP

28586a8

more wip

cb2713f

works; no streaming or caching

074e144

handle more file types

0d17d06

blakerosenthal force-pushed the document-viewer branch from 68d7973 to 0d17d06 Compare August 26, 2024 23:28

blakerosenthal added 4 commits August 26, 2024 16:42

add go-to-page function

ec7505d

put accordion back

5f567a5

cleanup

4e69201

remove superfluous test

456de41

blakerosenthal marked this pull request as ready for review August 29, 2024 15:35

pmeier reviewed Aug 30, 2024

View reviewed changes

ragna/core/_document.py Outdated Show resolved Hide resolved

ragna/deploy/_api/schemas.py Outdated Show resolved Hide resolved

ragna/deploy/_ui/central_view.py Outdated Show resolved Hide resolved

ragna/deploy/_ui/central_view.py Outdated Show resolved Hide resolved

pmeier added 2 commits August 30, 2024 14:34

Merge branch 'corpus-dev' into document-viewer

c2f69cc

mypy

9d68c32

pmeier reviewed Aug 30, 2024

View reviewed changes

blakerosenthal added 2 commits September 2, 2024 14:39

revert async callback

23c0d3d

use document_name in file download

41533ad

blakerosenthal added 2 commits September 5, 2024 09:17

lookup MIME-type on Document

81bdf33

don't use mutable param

0dfccb8

pmeier reviewed Sep 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document viewer #486

Document viewer #486

blakerosenthal commented Aug 12, 2024

pmeier Aug 12, 2024

blakerosenthal commented Aug 26, 2024

pmeier commented Aug 26, 2024

blakerosenthal commented Aug 26, 2024 •

edited

Loading

blakerosenthal commented Aug 29, 2024

pmeier left a comment

blakerosenthal commented Sep 2, 2024

pmeier Sep 5, 2024

pmeier Sep 5, 2024

	def to_core(self) -> ragna.core.Document:
	return ragna.core.LocalDocument(
	id=self.id,
	name=self.name,
	# TEMP: setting an empty metadata dict for now.
	# Will be resolved as part of the "managed ragna" work:
	# https://github.com/Quansight/ragna/issues/256
	metadata={},
	)

		@@ -24,6 +24,15 @@ class DocumentUploadParameters(BaseModel):
		data: dict


		_MIME_TYPES = {

Document viewer #486

Are you sure you want to change the base?

Document viewer #486

Conversation

blakerosenthal commented Aug 12, 2024

pmeier Aug 12, 2024

Choose a reason for hiding this comment

blakerosenthal commented Aug 26, 2024

pmeier commented Aug 26, 2024

blakerosenthal commented Aug 26, 2024 • edited Loading

blakerosenthal commented Aug 29, 2024

pmeier left a comment

Choose a reason for hiding this comment

blakerosenthal commented Sep 2, 2024

pmeier Sep 5, 2024

Choose a reason for hiding this comment

pmeier Sep 5, 2024

Choose a reason for hiding this comment

blakerosenthal commented Aug 26, 2024 •

edited

Loading