Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe issue in Search()._scrape() #610

Open
bumatic opened this issue Sep 27, 2023 · 1 comment
Open

Severe issue in Search()._scrape() #610

bumatic opened this issue Sep 27, 2023 · 1 comment

Comments

@bumatic
Copy link

bumatic commented Sep 27, 2023

internetarchive Version: '3.5.0' (Python version and OS seem irrelevant for this issue)

When searching for queries that return more than 10.000 items, i.e. mediatype:software, the following error is raised always:

if i != num_found:
    raise ReadTimeout('The server failed to return results in the'
                      f' allotted amount of time for {r.request.url}')

When backtracking the issue, I encountered that the named r.request.url is correctly retrieved. In effect my results contained one more item than suggested by the API. In the case of the query mediatype:software its 1043904 for i while num_found is 1043903.

I don’t know why the API returns more results than it indicates for the query, but raising a ReadTimeout error based on the condition i != num_found is too restrictive especially since self._handle_scrape_error(j) is invoked earlier which should catch errors.

Nevertheless, I assume that this condition was included for a reason, which I cannot figure out right now. Therefore I can only suggest rough ideas for resolving this issue. Two that come to mind are removing the if conditional altogether (and potentially enhancing self._handle_scrape_error(j)) or weakening the condition to if i < num_found:

P.s. I checked for duplicate issues, but could not find any. A complete traceback can be provided. However, since I identified the problem, it seemed redundant to me. Let me know, if I am wrong and you want me to post it anyway.

@jjjake
Copy link
Owner

jjjake commented Oct 2, 2023

Thanks for the report @bumatic. This was added to deal with an issue on the archive.org side of things (a timeout happening on the backend leading to the search API failing silently). The aggressive checking of doc count is to avoid someone thinking they dumped a full result set when in fact they haven't.

Let me look into this more and give it some thinking. Thanks again for reporting, and sorry for the trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants