How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function #145

Mahizha-N-S · 2024-10-08T10:03:55Z

while crawling multiple urls, how does the crawler handle the web url which is not found in net(404 Page not FOund)

Mahizha-N-S · 2024-10-16T11:03:19Z

after working through my project i found this is how the response if for unresolved webpages


[ERROR] 🚫 Failed to crawl https://www.whattodonowiamnotapersonseemee.com/, error: Failed to crawl https://www.whattodonowiamnotapersonseemee.com/: Page.goto: net::ERR_NAME_NOT_RESOLVED at https://www.whattodonowiamnotapersonseemee.com/
Call log:
navigating to "https://www.whattodonowiamnotapersonseemee.com/", waiting until "domcontentloaded"

{"level":"ERROR","time":"Wed Oct 16 2024 12:18:20 IST+0530","name":"FastAPI Python Server","msg":"Error in crawling https://www.whattodonowiamnotapersonseemee.com/, Failed to crawl https://www.whattodonowiamnotapersonseemee.com/: Page.goto: net::ERR_NAME_NOT_RESOLVED at https://www.whattodonowiamnotapersonseemee.com/
Call log:
navigating to "https://www.whattodonowiamnotapersonseemee.com/", waiting until "domcontentloaded"
"}

and
this response when we cant extract the html content

[ERROR] 🚫 Failed to crawl https://mercedes-benz.com, error: Process HTML, Failed to extract content from the website: https://mercedes-benz.com, error: can only concatenate str (not "NoneType") to str
{"level":"ERROR","time":"Wed Oct 16 2024 12:17:32 IST+0530","name":"FastAPI Python Server  ","msg":"Error in crawling https://mercedes-benz.com, Process HTML, Failed to extract content from the website: https://mercedes-benz.com, error: can only concatenate str (not "NoneType") to str"}
{"level":"INFO","time":"Wed Oct 16 2024 12:18:22 IST+0530","name":"FastAPI Python Server ","msg":"Skipping URL: https://mercedes-benz.com due to empty content."}

if its in arun_many, will this result make the enitre crawler function in error, can we have like a list which says failed_url along with the reason,

And also during crawling, it seems the token limit exceeded in my model, which resulted in infinite loop of the crawling in my program, so i added a exception for it
Hope this gives ideas for enhancement!

unclecode · 2024-10-17T07:07:36Z

@Mahizha-N-S Thx for the suggestion, appreciate it. For pages that do not exist, like 404, there are two situations. The success is true for the return result, but the content is whatever that website returns. Because not all websites always return the status code of the 404, but the status code is also a part of the result, so you can filter based on the status code. Another thing is that the latest version has this page timeout parameter. That means you can set the page time-out and change it to any amount that you want. Regarding the token limit, I don't understand if you share the code snippet. I can try it on my end.

Mahizha-N-S · 2024-10-17T07:46:45Z

@unclecode Thanks for the reply, I got what you are trying to say regarding time-out, is this updated in the docs example for reference?, and regarding the model limit exceeded, i meant if i use Groq or any other token-limited provider, and if there are many urls to scarp, in terminal i observed that the error log was in a loop, So maybe if error in model during arun_many, we can catch the exception? , This was just what i observed, hope it understands

unclecode self-assigned this Oct 8, 2024

unclecode added the question Further information is requested label Oct 8, 2024

Mahizha-N-S changed the title ~~How does the crawl4ai handle the pages with 404 Not Found~~ How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function #145

How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function #145

Mahizha-N-S commented Oct 8, 2024

Mahizha-N-S commented Oct 16, 2024 •

edited

Loading

unclecode commented Oct 17, 2024

Mahizha-N-S commented Oct 17, 2024

How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function #145

How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function #145

Comments

Mahizha-N-S commented Oct 8, 2024

Mahizha-N-S commented Oct 16, 2024 • edited Loading

unclecode commented Oct 17, 2024

Mahizha-N-S commented Oct 17, 2024

Mahizha-N-S commented Oct 16, 2024 •

edited

Loading