Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypassing automated crawler detection by Firewalls #136

Open
dnmahendra opened this issue Oct 6, 2024 · 2 comments
Open

Bypassing automated crawler detection by Firewalls #136

dnmahendra opened this issue Oct 6, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@dnmahendra
Copy link

Is there a solution for websites behind WAFs like PerimeterX, Cloudflare, Akmai etc.?

@unclecode unclecode self-assigned this Oct 9, 2024
@unclecode unclecode added the question Further information is requested label Oct 9, 2024
@unclecode
Copy link
Owner

Thank you for raising this important question about bypassing Web Application Firewalls (WAFs) like PerimeterX, Cloudflare, and Akamai. While completely bypassing advanced WAFs can be challenging and may raise ethical concerns, Crawl4ai already has several features that can help mitigate some basic anti-bot measures:

  1. User-Agent Customization: You can set a custom User-Agent to mimic legitimate browser requests:

    crawler.crawler_strategy.update_user_agent("Your Custom User-Agent")
  2. Proxy Support: Use proxies to distribute requests across different IP addresses:

    crawler = AsyncWebCrawler(proxy="http://your-proxy-url:port")
  3. JavaScript Execution: Crawl4ai can execute JavaScript, which is crucial for rendering dynamic content:

    result = await crawler.arun(url="https://example.com", js_code="Your JavaScript Code")
  4. Session-Based Crawling: Maintain sessions to mimic human-like browsing behavior:

    result = await crawler.arun(url="https://example.com", session_id="unique_session_id")
  5. Custom Headers: Set custom headers to include necessary cookies or authentication information:

    crawler.crawler_strategy.set_custom_headers({"Cookie": "your_cookie_value"})

These features can help in many cases, but not specifically for WAFs, but for other applications we're considering features. like enhanced browser fingerprinting, CAPTCHA, human-like behaviour and more.

I'd love to hear more about your specific use case. Are there particular websites or WAFs you're encountering issues with?

@DungNguyen83
Copy link

could be nice if you can confirm your code can scrapy this page https://www.realestate.com.au/property-house-qld-indooroopilly-145860356? @unclecode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants