-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function #145
Comments
after working through my project i found this is how the response if for unresolved webpages
and
if its in arun_many, will this result make the enitre crawler function in error, can we have like a list which says failed_url along with the reason, And also during crawling, it seems the token limit exceeded in my model, which resulted in infinite loop of the crawling in my program, so i added a exception for it |
@Mahizha-N-S Thx for the suggestion, appreciate it. For pages that do not exist, like 404, there are two situations. The success is true for the return result, but the content is whatever that website returns. Because not all websites always return the status code of the 404, but the status code is also a part of the result, so you can filter based on the status code. Another thing is that the latest version has this page timeout parameter. That means you can set the page time-out and change it to any amount that you want. Regarding the token limit, I don't understand if you share the code snippet. I can try it on my end. |
@unclecode Thanks for the reply, I got what you are trying to say regarding time-out, is this updated in the docs example for reference?, and regarding the model limit exceeded, i meant if i use Groq or any other token-limited provider, and if there are many urls to scarp, in terminal i observed that the error log was in a loop, So maybe if error in model during arun_many, we can catch the exception? , This was just what i observed, hope it understands |
while crawling multiple urls, how does the crawler handle the web url which is not found in net(404 Page not FOund)
The text was updated successfully, but these errors were encountered: