Speed Optimization
Surfsky provides powerful capabilities for running multiple browser instances in parallel to optimize your automation workflows. Thanks to our advanced Chromium core-level optimizations and enterprise-grade Kubernetes infrastructure, you can scale up to 1,000 concurrent browser instances (depending on your subscription plan) while maintaining stability and performance.
Each browser instance operates in complete isolation, with its own unique fingerprint and characteristics. This means that anti-bot systems can only correlate browser instances based on your automation patterns and behaviors, not through any inherent browser signatures or fingerprints. This architectural design provides several key advantages:
- True Browser Isolation: Each instance maintains its own independent state and fingerprint
- Resource Optimization: Chromium-level optimizations ensure efficient RAM and CPU usage
- High Reliability: Kubernetes orchestration ensures stable operation at scale
- Flexible Scaling: Easily scale from a few browsers to hundreds based on your needs
Here's how to leverage these features effectively.
Basic Parallelization Example
If you're interested in specific implementation examples or need code for particular use cases, please contact our team. We're happy to provide additional examples and guidance tailored to your needs.
Here's a simplified example of how to implement parallel processing:
from dataclasses import dataclass
from core.executor import TaskExecutor
from core.config import ExecutorConfig
from core.pipeline.types import BaseTask
from core.pipeline.result import Result
@dataclass
class ScrapedData:
title: str
status: int
class WebScraper(BaseTask):
async def main(self, browser, url) -> Result[ScrapedData]:
try:
async with browser.managed_page() as page:
response = await page.goto(url)
title = await page.title()
return Result.success(ScrapedData(
title=title,
status=response.status
))
except Exception as e:
return Result.failure(f"Error: {str(e)}")
async def main():
# Configure parallel execution
config = ExecutorConfig(
browser_count=10, # Number of parallel browsers
max_browser_tasks=5, # Tasks per browser before recycling
max_task_attempts=3, # Retry attempts per task
fingerprint={"os": "mac"} # Browser fingerprint
)
# Initialize scraper and executor
scraper = WebScraper()
executor = TaskExecutor(config)
# Your URLs to process
urls = [
"https://example1.com",
"https://example2.com",
# ... more URLs
]
# Execute tasks in parallel
results, metrics = await executor.execute(urls, scraper)
Optimization Strategies
1. Browser Pool Management
# Configure optimal browser pool size
config = ExecutorConfig(
browser_count=10, # Adjust based on system resources
max_browser_tasks=5, # Balance between reuse and freshness
max_browser_attempts=3, # Retry limit for browser issues
task_timeout=30, # Timeout for individual tasks
attempt_delay=2 # Delay between retry attempts
)
2. Proxy Rotation
def get_proxies(count: int) -> list[str]:
countries = ["US", "UK", "DE", "FR"] # Target countries
return [
f"socks5://user:[email protected]:1080?country={country}"
for country in countries
]
config = ExecutorConfig(
browser_count=10,
proxies=get_proxies(20), # Maintain proxy pool larger than browser count
)
3. Error Handling and Retries
class ResilientScraper(BaseTask):
async def main(self, browser, url) -> Result:
try:
async with browser.managed_page() as page:
# Add custom retry logic
for attempt in range(3):
try:
await page.goto(url, timeout=10000)
return Result.success(await self.extract_data(page))
except Exception as e:
if attempt == 2: # Last attempt
raise
await asyncio.sleep(2) # Delay between attempts
except Exception as e:
return Result.failure(str(e))
Best Practices
- Resource Management
- Monitor rate limiting headers:
x-ratelimit-limit: 200 # Maximum requests per minute
x-ratelimit-limit-hour: 3000 # Maximum requests per hour
x-ratelimit-remaining: 198 # Remaining requests this minute
x-ratelimit-remaining-hour: 2998 # Remaining requests this hour - Monitor active browser count using the /active or /profiles endpoint
- If you exceed browser limits, you'll receive a 429 (Too Many Requests) error
- If encountering errors, provide the tracing UUID to support:
x-cloud-tracing-uuid: fea367e7cfc840818508754b5f1c1f51
- Use
max_browser_tasks
to recycle browsers periodically
- Monitor rate limiting headers:
Browser Lifecycle Management
Browsers can be closed in several ways:
Automatic Closure
- Browsers are automatically closed after being inactive for
inactive_kill_timeout
seconds - This helps prevent resource wastage from forgotten sessions
- Browsers are automatically closed after being inactive for
Automation Framework Methods
- Using
browser.close()
in Playwright/Puppeteer - These methods internally send CDP (Chrome DevTools Protocol) commands to close the WebSocket connection
- Using
API Endpoints
Always properly close your browser sessions when done to maintain optimal resource usage and stay within your plan limits.
Proxy Strategy
- Maintain a larger proxy pool than browser count
- Rotate proxies based on geographic needs
- Monitor proxy health and performance
Error Handling
- Implement graceful retries
- Add appropriate delays between attempts
- Log and monitor failure patterns
Performance Monitoring
# Track execution metrics
print(f"Success rate: {(metrics.completed/metrics.total)*100:.1f}%")
print(f"Average processing time: {metrics.avg_time:.2f}s")
print(f"Failed tasks: {metrics.failed}")
Common Pitfalls
Over-parallelization
- Exceeding your plan's concurrent browser limit will trigger rate limiting (429 error)
- Can trigger rate limiting from target sites
Insufficient Error Handling
- Not accounting for network issues
- Missing retry logic for temporary failures
Poor Resource Management
- Not recycling browsers after heavy use
- Memory leaks from unclosed resources
Remember to monitor your automation's performance and adjust these parameters based on your specific use case and target website's requirements.
Understanding Network Latency
Round Trip Time (RTT) Considerations
When moving from local development to Surfsky's cloud infrastructure, it's important to understand how network latency affects your automation:
Local vs Cloud Execution
- Local development has near-instant RTT
- Cloud execution involves network latency for each request
- Additional services (like CAPTCHA solving or proxies) add extra latency
CDP Command Overhead
- Each page interaction may trigger multiple CDP commands
- Sequential operations accumulate RTT
- What seems fast locally may be slower in production
Optimization Strategies
Implement Parallel Processing Instead of sequential operations:
# Slower: Sequential processing
for element in elements:
await get_element_text(element)
# Faster: Parallel processing
await asyncio.gather(
*(get_element_text(element) for element in elements)
)Choose the Right Framework
- WebSocket-based frameworks (Playwright, Puppeteer) offer better performance
- HTTP-based frameworks (Selenium) may have higher latency
- Consider low-level frameworks for maximum performance
Framework-Specific Optimizations Example for Playwright:
# Slower: Simulates keystrokes
await page.keyboard.type("text")
# Faster: Direct value setting
await page.locator("input").fill("text")Geographic Optimization
- Use browsers in regions close to your infrastructure
- Regional proximity can improve performance by 8-9x
- Consider multi-region deployment for global operations
When designing your automation workflow, always consider the impact of network latency and implement parallel processing where possible to achieve optimal performance.
Running your automation close to Surfsky's infrastructure can lead to dramatic performance improvements, especially for operations requiring multiple CDP commands like form filling and submissions.