Skip to main content

Scrapy

Using SurfSky with Scrapy

SurfSky provides two Scrapy extensions to help you integrate cloud browsers into your scraping projects:

  1. scrapy-cloud-browser - Uses SurfSky's custom browser automation
  2. scrapy-playwright-cloud-browser - Uses Playwright for browser automation with SurfSky

Both extensions allow you to leverage SurfSky's anti-detection browsers through a cloud service, helping you avoid detection while scraping challenging websites.

Installation

pip install scrapy-cloud-browser

Configuration

Add the following to your Scrapy project's settings.py:

# Required: Use asyncio reactor
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

# Register the extension
DOWNLOAD_HANDLERS = {
"http": "scrapy_cloud_browser.CloudBrowserHandler",
"https": "scrapy_cloud_browser.CloudBrowserHandler",
}

# Configure SurfSky connection
CLOUD_BROWSER = {
"API_HOST": "https://app.surfsky.io/api", # SurfSky API endpoint
"API_TOKEN": "your_api_token_here", # Your SurfSky API token
"NUM_BROWSERS": 2, # Number of parallel browser instances
"PROXIES": [ # List of proxies to use
"http://user:[email protected]:8080",
"http://user:[email protected]:8080"
],
"PROXY_ORDERING": "random", # How to assign proxies: "random" or "round-robin"
"PAGES_PER_BROWSER": 100, # Pages to process before recycling browser
"START_SEMAPHORES": 10, # Limit concurrent browser startups

# Optional: Configure browser settings
"BROWSER_SETTINGS": {
"inactive_kill_timeout": 35
},

# Optional: Configure fingerprint
"FINGERPRINT": {
"os": "win",
}
}

Example Spider

import scrapy

class BasicSpider(scrapy.Spider):
name = "basic_spider"
start_urls = ["https://example.com"]

def parse(self, response):
# Process the response as usual
yield {
"title": response.css("title::text").get(),
"headers": dict(response.headers),
}

For detailed browser settings and fingerprint configuration options, please refer to the API Reference.