Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function #145

Open
Mahizha-N-S opened this issue Oct 8, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@Mahizha-N-S
Copy link

while crawling multiple urls, how does the crawler handle the web url which is not found in net(404 Page not FOund)

@unclecode unclecode self-assigned this Oct 8, 2024
@unclecode unclecode added the question Further information is requested label Oct 8, 2024
@Mahizha-N-S
Copy link
Author

Mahizha-N-S commented Oct 16, 2024

after working through my project i found this is how the response if for unresolved webpages


[ERROR] 🚫 Failed to crawl https://www.whattodonowiamnotapersonseemee.com/, error: Failed to crawl https://www.whattodonowiamnotapersonseemee.com/: Page.goto: net::ERR_NAME_NOT_RESOLVED at https://www.whattodonowiamnotapersonseemee.com/
Call log:
navigating to "https://www.whattodonowiamnotapersonseemee.com/", waiting until "domcontentloaded"

{"level":"ERROR","time":"Wed Oct 16 2024 12:18:20 IST+0530","name":"FastAPI Python Server","msg":"Error in crawling https://www.whattodonowiamnotapersonseemee.com/, Failed to crawl https://www.whattodonowiamnotapersonseemee.com/: Page.goto: net::ERR_NAME_NOT_RESOLVED at https://www.whattodonowiamnotapersonseemee.com/
Call log:
navigating to "https://www.whattodonowiamnotapersonseemee.com/", waiting until "domcontentloaded"
"}

and
this response when we cant extract the html content

[ERROR] 🚫 Failed to crawl https://mercedes-benz.com, error: Process HTML, Failed to extract content from the website: https://mercedes-benz.com, error: can only concatenate str (not "NoneType") to str
{"level":"ERROR","time":"Wed Oct 16 2024 12:17:32 IST+0530","name":"FastAPI Python Server  ","msg":"Error in crawling https://mercedes-benz.com, Process HTML, Failed to extract content from the website: https://mercedes-benz.com, error: can only concatenate str (not "NoneType") to str"}
{"level":"INFO","time":"Wed Oct 16 2024 12:18:22 IST+0530","name":"FastAPI Python Server ","msg":"Skipping URL: https://mercedes-benz.com due to empty content."}

if its in arun_many, will this result make the enitre crawler function in error, can we have like a list which says failed_url along with the reason,

And also during crawling, it seems the token limit exceeded in my model, which resulted in infinite loop of the crawling in my program, so i added a exception for it
Hope this gives ideas for enhancement!

@Mahizha-N-S Mahizha-N-S changed the title How does the crawl4ai handle the pages with 404 Not Found How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function Oct 16, 2024
@unclecode
Copy link
Owner

@Mahizha-N-S Thx for the suggestion, appreciate it. For pages that do not exist, like 404, there are two situations. The success is true for the return result, but the content is whatever that website returns. Because not all websites always return the status code of the 404, but the status code is also a part of the result, so you can filter based on the status code. Another thing is that the latest version has this page timeout parameter. That means you can set the page time-out and change it to any amount that you want. Regarding the token limit, I don't understand if you share the code snippet. I can try it on my end.

@Mahizha-N-S
Copy link
Author

@unclecode Thanks for the reply, I got what you are trying to say regarding time-out, is this updated in the docs example for reference?, and regarding the model limit exceeded, i meant if i use Groq or any other token-limited provider, and if there are many urls to scarp, in terminal i observed that the error log was in a loop, So maybe if error in model during arun_many, we can catch the exception? , This was just what i observed, hope it understands

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants