Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"query in context" field does not work #1633

Open
2 of 5 tasks
lukavdplas opened this issue Jul 18, 2024 · 2 comments
Open
2 of 5 tasks

"query in context" field does not work #1633

lukavdplas opened this issue Jul 18, 2024 · 2 comments
Labels
backend changes to the django backend bug something isn't working right

Comments

@lukavdplas
Copy link
Contributor

What went wrong?

If you try to download a search results with snippets ("query in context"), the download will fail.

What did you expect to happen?

The download should not raise an error.

When this feature still worked, it would add a variable number of untitled columns as the last columns in the CSV. Those would contain snippets of text surrounding matches to the query.

Screenshot

No response

Where did you find the bug?

Version

5.9.0

Steps to reproduce

@lukavdplas lukavdplas added bug something isn't working right backend changes to the django backend labels Jul 18, 2024
@lukavdplas
Copy link
Contributor Author

lukavdplas commented Jul 18, 2024

The error is in backend/download/create_csv.py. This module uses csv.DictWriter to create CSVs, but this writer does not support variable column numbers.

The DictWriter must receive a list of all fieldnames up front. We don't have that list, because each snippet of a document gets a new column; that means we don't know the number of query-in-context columns before we iterate through the data.

I think this may have worked in the past because all the CSV data was loaded in memory before writing the file. I remember that at some point, we rewrote the download module to avoid this, so it won't overload the memory for large files.

Some solutions, none of them very attractive:

  • Use csv.writer instead of csv.Dictwriter, which won't complain about a variable number of columns, or about some columns lacking headers. However, the resulting CSV is not great, because the number of columns will vary. Some readers will parse them without issue, some won't.
  • Add headers for an abundant number of context columns (e.g. 100). Cap off the number of snippets at this number.
  • Reformat the CSV to have a single context column, which contains all snippets separated by newlines.

@lukavdplas
Copy link
Contributor Author

By the way, I was wondering why this isn't picked up by unit tests. The reason is that this test fixture is misleading:

@pytest.fixture()
def result_csv_with_highlights(csv_directory):
route = 'parliament-netherlands_query=test'
fields = ['speech']
file = create_csv.search_results_csv(hits(mock_es_result), fields, route, 0)
return file

The name suggests that it's creating a CSV with highlight snippets, but that doesn't match the parameters passed to the search_results_csv function. So it appears like we're testing CSVs with content snippets, but that's not actually happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend changes to the django backend bug something isn't working right
Projects
None yet
Development

No branches or pull requests

1 participant