Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to only crawl website and not run warc2zim conversion #297

Closed
benoit74 opened this issue May 13, 2024 · 6 comments
Closed

Add option to only crawl website and not run warc2zim conversion #297

benoit74 opened this issue May 13, 2024 · 6 comments
Assignees

Comments

@benoit74
Copy link
Collaborator

For debugging purpose, it might be useful to only run the crawling and not run warc2zim conversion (which might be known to fail, or even hang forever in a dead loop).

We should add a --crawl-only CLI argument to support this scenario (and integrate this in the Zimfarm obviously).

@rgaudin
Copy link
Member

rgaudin commented May 13, 2024

PR looks good but I'm not sure about the operational value of this

which might be known to fail

How is that a problem if we can keep the WARCs (and even upload them via Zimfarm)? warc2zim is a speedy process

or even hang forever in a dead loop

Is this an existing problem? Where's the ticket about this? openzim/warc2zim#132 is zimit1 AFAIK

@benoit74
Copy link
Collaborator Author

How is that a problem if we can keep the WARCs (and even upload them via Zimfarm)? warc2zim is a speedy process

Speedy compared to the crawl, yes. But still time consuming for probably nothing.

Is this an existing problem?

It happened to me during development on my machine, I don't see why it would not happen during production. And I most probably have a case this morning but still did not had time to collect material to open corresponding issue.

@benoit74
Copy link
Collaborator Author

openzim/warc2zim#246

@benoit74
Copy link
Collaborator Author

But you're right that from an operational point of view, it would make more sense to adapt Zimfarm worker so that logs and artifacts are uploaded even when a cancellation is requested. It would probably help in multiple cases.

@rgaudin
Copy link
Member

rgaudin commented May 13, 2024

Yes, it's a huge frustration point for most scrapers

@benoit74
Copy link
Collaborator Author

Then, let's close this in favor of openzim/zimfarm#965

@benoit74 benoit74 closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants