"query in context" field does not work #1633

lukavdplas · 2024-07-18T15:09:12Z

What went wrong?

If you try to download a search results with snippets ("query in context"), the download will fail.

What did you expect to happen?

The download should not raise an error.

When this feature still worked, it would add a variable number of untitled columns as the last columns in the CSV. Those would contain snippets of text surrounding matches to the query.

Screenshot

No response

Where did you find the bug?

https://ianalyzer.hum.uu.nl
https://peopleandparliament.hum.uu.nl
https://peace.sites.uu.nl
a server hosted elsewhere (i.e. not by the research software lab)
a local server

Version

5.9.0

Steps to reproduce

Go to https://ianalyzer.hum.uu.nl/search/troonredes?query=nederland&highlight=200 - or go to a search page for a corpus, enter a query term and turn on highlighting
In the field selection for the download menu, click "select all fields" and then select "query in context"
Click "download"

lukavdplas · 2024-07-18T15:32:06Z

The error is in backend/download/create_csv.py. This module uses csv.DictWriter to create CSVs, but this writer does not support variable column numbers.

The DictWriter must receive a list of all fieldnames up front. We don't have that list, because each snippet of a document gets a new column; that means we don't know the number of query-in-context columns before we iterate through the data.

I think this may have worked in the past because all the CSV data was loaded in memory before writing the file. I remember that at some point, we rewrote the download module to avoid this, so it won't overload the memory for large files.

Some solutions, none of them very attractive:

Use csv.writer instead of csv.Dictwriter, which won't complain about a variable number of columns, or about some columns lacking headers. However, the resulting CSV is not great, because the number of columns will vary. Some readers will parse them without issue, some won't.
Add headers for an abundant number of context columns (e.g. 100). Cap off the number of snippets at this number.
Reformat the CSV to have a single context column, which contains all snippets separated by newlines.

lukavdplas · 2024-07-18T15:41:53Z

By the way, I was wondering why this isn't picked up by unit tests. The reason is that this test fixture is misleading:

I-analyzer/backend/download/tests/test_csv_results.py

Lines 44 to 49 in a0da9e1

    
           @pytest.fixture() 
        
           def result_csv_with_highlights(csv_directory): 
        
               route = 'parliament-netherlands_query=test' 
        
               fields = ['speech'] 
        
               file = create_csv.search_results_csv(hits(mock_es_result), fields, route, 0) 
        
               return file

The name suggests that it's creating a CSV with highlight snippets, but that doesn't match the parameters passed to the search_results_csv function. So it appears like we're testing CSVs with content snippets, but that's not actually happening.

lukavdplas added bug something isn't working right backend changes to the django backend labels Jul 18, 2024

lukavdplas mentioned this issue Jul 19, 2024

Feature/download tab #1635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"query in context" field does not work #1633

"query in context" field does not work #1633

lukavdplas commented Jul 18, 2024

lukavdplas commented Jul 18, 2024 •

edited

Loading

lukavdplas commented Jul 18, 2024

"query in context" field does not work #1633

"query in context" field does not work #1633

Comments

lukavdplas commented Jul 18, 2024

What went wrong?

What did you expect to happen?

Screenshot

Where did you find the bug?

Version

Steps to reproduce

lukavdplas commented Jul 18, 2024 • edited Loading

lukavdplas commented Jul 18, 2024

lukavdplas commented Jul 18, 2024 •

edited

Loading