Scrapy
Using SurfSky with Scrapy
SurfSky provides two Scrapy extensions to help you integrate cloud browsers into your scraping projects:
- scrapy-cloud-browser - Uses SurfSky's custom browser automation
- scrapy-playwright-cloud-browser - Uses Playwright for browser automation with SurfSky
Both extensions allow you to leverage SurfSky's anti-detection browsers through a cloud service, helping you avoid detection while scraping challenging websites.
- scrapy-cloud-browser
- scrapy-playwright-cloud-browser
Installation
pip install scrapy-cloud-browser
Configuration
Add the following to your Scrapy project's settings.py
:
# Required: Use asyncio reactor
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
# Register the extension
DOWNLOAD_HANDLERS = {
"http": "scrapy_cloud_browser.CloudBrowserHandler",
"https": "scrapy_cloud_browser.CloudBrowserHandler",
}
# Configure SurfSky connection
CLOUD_BROWSER = {
"API_HOST": "https://app.surfsky.io/api", # SurfSky API endpoint
"API_TOKEN": "your_api_token_here", # Your SurfSky API token
"NUM_BROWSERS": 2, # Number of parallel browser instances
"PROXIES": [ # List of proxies to use
"http://user:[email protected]:8080",
"http://user:[email protected]:8080"
],
"PROXY_ORDERING": "random", # How to assign proxies: "random" or "round-robin"
"PAGES_PER_BROWSER": 100, # Pages to process before recycling browser
"START_SEMAPHORES": 10, # Limit concurrent browser startups
# Optional: Configure browser settings
"BROWSER_SETTINGS": {
"inactive_kill_timeout": 35
},
# Optional: Configure fingerprint
"FINGERPRINT": {
"os": "win",
}
}
Example Spider
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic_spider"
start_urls = ["https://example.com"]
def parse(self, response):
# Process the response as usual
yield {
"title": response.css("title::text").get(),
"headers": dict(response.headers),
}
Installation
pip install scrapy-playwright-cloud-browser
Configuration
Add the following to your Scrapy project's settings.py
:
# Required: Use asyncio reactor
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
# Register the extension and handlers
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright_cloud_browser.CloudBrowserHandler",
"https": "scrapy_playwright_cloud_browser.CloudBrowserHandler",
}
EXTENSIONS = {
'scrapy_playwright_cloud_browser.CloudBrowserExtension': 100,
}
# Configure SurfSky connection
CLOUD_BROWSER = {
"API_HOST": "https://app.surfsky.io/api", # SurfSky API endpoint
"API_TOKEN": "your_api_token_here", # Your SurfSky API token
"NUM_BROWSERS": 2, # Number of parallel browser instances
"PROXIES": [ # List of proxies to use
"http://user:[email protected]:8080",
"http://user:[email protected]:8080"
],
"PROXY_ORDERING": "random", # How to assign proxies: "random" or "round-robin"
"PAGES_PER_BROWSER": 100, # Pages to process before recycling browser
"START_SEMAPHORES": 10, # Limit concurrent browser startups
# Optional: Configure browser settings
"BROWSER_SETTINGS": {
"inactive_kill_timeout": 35
},
# Optional: Configure fingerprint
"FINGERPRINT": {
"os": "win",
}
}
# Playwright-specific settings
PLAYWRIGHT_BROWSER_TYPE = "chromium"
Example Spider with Playwright Features
import scrapy
from scrapy_playwright.page import PageMethod
class PlaywrightSpider(scrapy.Spider):
name = "playwright_spider"
def start_requests(self):
# Use Playwright-specific features
yield scrapy.Request(
url="https://example.com",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", "h1"),
# You can add more Playwright methods here
]
}
)
def parse(self, response):
# Process the response as usual
yield {
"title": response.css("title::text").get(),
"h1": response.css("h1::text").get(),
}
For detailed browser settings and fingerprint configuration options, please refer to the API Reference.