In e-commerce, pricing is the single biggest lever for conversion. A 1% price difference can shift market share overnight. Yet most retailers still rely on spot-checking competitor prices manually or using tools that only cover a fraction of their catalog. At scale — 100K+ SKUs across dozens of competitor sites — you need a purpose-built scraping infrastructure.
What to Extract
A comprehensive price monitoring dataset goes beyond just the sticker price. Here's the full data model we typically extract per product:
- Product identifiers: Title, SKU, UPC/EAN, ASIN, brand, and category hierarchy
- Pricing data: Regular price, sale price, MAP price, price per unit, bulk/tiered pricing, and currency
- Availability: In-stock status, stock quantity (if visible), estimated delivery date, and fulfillment method (FBA, FBM, dropship)
- Seller information: Seller name, rating, review count, and Buy Box winner (on marketplaces)
- Promotions: Coupon codes, bundle deals, loyalty pricing, and flash sale indicators
- Metadata: Product URL, scrape timestamp, and data freshness indicator
Infrastructure for 200K+ SKUs Daily
Scraping at this scale requires distributed infrastructure. Here's the architecture we use:
Distributed Crawling
We run Scrapy spiders across multiple worker nodes using Scrapy-Redis for job distribution. Each worker pulls URLs from a shared queue, processes them, and pushes results to a central pipeline. This gives us horizontal scalability — adding more workers linearly increases throughput.
- Worker count: Typically 8–16 workers for a 200K SKU catalog
- Crawl time: Full catalog refresh in 2–4 hours depending on target site complexity
- Scheduling: Daily full crawls with hourly spot-checks on high-priority SKUs
Proxy Management
E-commerce sites aggressively block scrapers. Our proxy layer includes:
- Rotating residential proxies with country-specific exit nodes
- Automatic proxy health scoring — slow or blocked proxies get deprioritized
- Session-sticky proxies for sites that track IP consistency within a browse session
- Datacenter proxies for less-protected sites to optimize cost
Browser Rendering
Many modern e-commerce sites load pricing via JavaScript (React, Next.js, Nuxt). For these, we use headless Chromium through Playwright with:
- Selective rendering — only enabling JS for pages that need it
- Request interception to block images, fonts, and analytics scripts (3–5x faster)
- Stealth plugins to avoid headless browser detection
Data Quality Pipeline
Raw scraped data is messy. Our quality pipeline applies several layers of validation:
- Schema validation: Every record must have required fields (title, price, URL, timestamp)
- Price sanity checks: Flagging prices that deviate more than 50% from the 7-day moving average
- Currency normalization: Converting all prices to a base currency with daily exchange rates
- Duplicate detection: Matching products across sellers using UPC, title similarity, and image hashing
- Freshness scoring: Marking stale data when a product page returns a 404 or redirect
Handling Dynamic Pricing and A/B Tests
Many retailers now use dynamic pricing — prices change based on time of day, user location, browsing history, or demand signals. To capture accurate competitor prices:
- We scrape from multiple geographic locations using region-specific proxies
- We run clean browser profiles without cookies to avoid personalized pricing
- We capture prices at consistent times to enable accurate day-over-day comparison
- We flag suspected A/B test variants when we see different prices from different sessions
Common Use Cases
Dynamic Repricing
Feed competitor prices into your repricing engine to automatically adjust your own prices within predefined rules. Example: "Match the lowest competitor price minus 2%, but never go below our floor margin of 15%."
MAP Compliance Monitoring
Track whether resellers are respecting your minimum advertised price. Automated alerts flag violations within hours, giving your brand team the data they need to enforce agreements.
Assortment Gap Analysis
Compare your product catalog against competitors to identify gaps. If a competitor carries 500 products in a category and you carry 300, the delta represents potential revenue you're leaving on the table.
Market Entry Research
Before launching in a new market, scrape local competitors to understand price points, popular products, and typical margin structures. This data informs your go-to-market pricing strategy.
Delivery Formats
We deliver price monitoring data in the format that fits your workflow:
- CSV/Excel: For ad-hoc analysis and small teams
- API endpoint: For integration with repricing engines and BI dashboards
- Database sync: Direct writes to your PostgreSQL, BigQuery, or Snowflake instance
- S3/GCS bucket: Daily data drops for data engineering teams
- Custom dashboard: Interactive UI with historical charts, alerts, and competitor comparison views
Ready to stop guessing and start tracking? Request a free sample of competitor price data for your market.