The web scraping landscape has changed dramatically. Five years ago, rotating a few datacenter proxies and setting a realistic User-Agent header was enough to scrape most sites. Today, companies like Cloudflare, Akamai, PerimeterX, and DataDome deploy sophisticated bot detection that analyzes browser fingerprints, mouse movements, TLS signatures, and behavioral patterns.
Here's how we handle these defenses in our production scraping infrastructure.
Understanding the Detection Stack
Modern anti-bot systems operate at multiple layers:
- Network layer: IP reputation scoring, ASN-based blocking (datacenter vs. residential), rate limiting, and geographic anomaly detection
- TLS layer: JA3/JA4 fingerprinting of the TLS handshake to identify automated clients. Headless Chrome has a different TLS fingerprint than regular Chrome
- HTTP layer: Header order analysis, HTTP/2 frame fingerprinting, and cookie validation
- Browser layer: JavaScript challenges that check for navigator properties, WebGL rendering, canvas fingerprints, and automation markers like
navigator.webdriver
- Behavioral layer: Mouse movement patterns, scroll behavior, click timing, and session flow analysis
IP and Proxy Management
The foundation of any scraping operation is a well-managed proxy infrastructure:
Proxy Types and When to Use Them
- Datacenter proxies: Fast, cheap, and reliable. Best for sites with minimal bot protection (government portals, academic databases, public data registries)
- Residential proxies: Real consumer IP addresses. Required for sites using IP reputation scoring (e-commerce, social media, travel booking sites)
- Mobile proxies: Carrier-grade NAT IPs shared by thousands of real users. Nearly impossible to block without affecting legitimate traffic. Used for the most heavily protected targets
- ISP proxies: Static residential IPs. Good for maintaining session consistency while appearing residential
Smart Rotation Strategy
Simple round-robin rotation is not enough. Our proxy manager implements:
- Health scoring: Each proxy gets a success rate score. Proxies dropping below 80% are quarantined for 30 minutes
- Session stickiness: For sites that validate IP consistency within a session, we pin a proxy for the duration of that crawl session
- Geographic targeting: Matching proxy location to the target site's expected user geography
- Cost optimization: Using cheaper datacenter proxies first and falling back to residential only when needed
Browser Fingerprint Management
When using headless browsers, you need to look like a real user's browser. This means managing:
- User-Agent consistency: The UA string must match the actual browser version, OS, and platform
- Navigator properties:
navigator.webdriver must be false, navigator.plugins must be populated, navigator.languages must match the proxy location
- Canvas and WebGL: Fingerprint randomization to avoid cross-session tracking
- Screen resolution: Realistic viewport sizes that match the declared device type
- Font enumeration: Consistent font lists that match the declared OS
We use custom Playwright configurations with stealth patches that modify these properties at the browser launch level, avoiding the common pitfall of injecting overrides via JavaScript (which detection systems can detect).
CAPTCHA Handling
CAPTCHAs are the most visible anti-bot measure. Here's how we handle different types:
reCAPTCHA v2 (Image Challenges)
Solved using third-party CAPTCHA solving services that route challenges to human solvers. Average solve time: 15–30 seconds. We pre-solve tokens in parallel to minimize crawl delays.
reCAPTCHA v3 (Score-Based)
No visible challenge — instead assigns a bot probability score based on behavior. The key is maintaining a high score by simulating realistic browsing patterns: page dwell time, scroll events, and mouse movement before making the target request.
hCaptcha
Similar to reCAPTCHA v2 but with different challenge types. We use dedicated hCaptcha solving APIs that handle the accessibility cookie flow for faster resolution.
Cloudflare Turnstile
A newer challenge that runs in the background. We handle this by using real browser sessions with proper TLS fingerprints and letting the Turnstile JavaScript execute naturally.
Adaptive Crawling Strategies
Static crawl configurations break when anti-bot systems update. Our spiders adapt in real-time:
- Response code monitoring: Detecting 403, 429, and soft-block pages (200 status but challenge content) to trigger strategy changes
- Automatic throttling: Reducing request rate when block rates exceed 5% and ramping back up when they normalize
- Rendering fallback: Starting with simple HTTP requests and escalating to headless browser only when JavaScript challenges are detected
- Session recycling: Rotating browser profiles and cookie jars at regular intervals to avoid long-lived session fingerprints
- Time-of-day optimization: Crawling during peak traffic hours when bot detection thresholds are typically higher
Ethical Considerations
We believe in responsible scraping. Our approach:
- Respect
robots.txt and crawl-delay directives
- Avoid overloading target servers — our crawl rate never exceeds what a few hundred human users would generate
- Only extract publicly available data
- Comply with applicable laws including CFAA, GDPR, and local data protection regulations
Need help scraping a site with aggressive bot protection? Talk to our team — we've handled everything from Cloudflare Enterprise to custom-built WAFs.