How to scrape Perplexity with minimal infrastructure
Perplexity AI serves over 10 million searches daily. The platform combines AI reasoning with real-time web search, delivering structured data that direct APIs completely miss.
The challenge: Perplexity wasn’t designed for programmatic access. The platform uses Server-Sent Events streaming and sophisticated query intent detection that traditional scraping tools can’t handle.
After analyzing 5+ million Perplexity responses, we’ve reverse-engineered their complete data extraction process. This guide will show you exactly how to scrape Perplexity and extract the rich structured data that makes it a powerful AI-powered search engine.
Table of contents
- Why scrape Perplexity responses?
- Understanding Perplexity’s architecture
- The Server-Sent Events parsing challenge
- Building the scraping infrastructure
- Query intent detection and data extraction
- Using cloro’s managed Perplexity scraper
Why scrape Perplexity responses?
Perplexity’s API responses are nothing like what users see in the UI.
What you miss with the API:
- The actual search experience users get
- Real-time web sources with citations
- Query intent detection and structured data
- Shopping cards and travel recommendations
Why it matters: API responses are nothing like the UI, making it impossible to verify information or influence SEO without scraping.
The math: Scraping costs up to 10x less than direct AI usage while providing the real search experience.
Use cases:
- Verification: Check what Perplexity actually tells users
- SEO: Monitor how Perplexity sources and cites information
- E-commerce: Track product recommendations and pricing
- Travel: Monitor hotel listings and travel data
Perplexity is a leader in the new wave of AI Search Engines.
Understanding Perplexity’s architecture
Perplexity combines multiple sophisticated systems to deliver its AI-powered search results:
Perplexity’s response generation process:
- Query Analysis: Classifies search intent (shopping, travel, media, general)
- Search Integration: Performs real-time web searches across multiple sources
- AI Synthesis: Uses LLMs to synthesize information with citations
- Structured Extraction: Automatically extracts rich data objects based on intent
- Streaming Response: Delivers results via Server-Sent Events (SSE)
Key technical challenges:
Multi-modal response structure:
// Perplexity combines text, sources, and rich data objects
{
answer: "AI-generated response with citations [1][2]",
sources: ["https://example.com/source1", "https://example.com/source2"],
shoppingCards: [...], // When shopping intent detected
videos: [...], // When media intent detected
hotels: [...] // When travel intent detected
}
Server-Sent Events format:
event: message
data: {"final_sse_message": false, "blocks": [{"markdown_block": {"answer": "Hello"}}]}
event: message
data: {"final_sse_message": false, "blocks": [{"markdown_block": {"answer": "Hello world"}}]}
event: message
data: {"final_sse_message": true, "blocks": [...], "web_results": [...]}
Query intent detection:
- Shopping queries → Product cards with pricing
- Travel queries → Hotel listings and places
- Media queries → Videos and images
- General queries → Text with citations
Anti-bot detection:
- Request pattern analysis
- Browser fingerprinting
- Rate limiting with exponential backoff
- Dynamic content loading challenges
The Server-Sent Events parsing challenge
The core of Perplexity scraping lies in parsing their SSE stream and extracting structured data blocks:
SSE event structure:
# Raw Perplexity SSE example
event: message
data: {"final_sse_message": false, "blocks": [{"markdown_block": {"answer": "Recent"}}]}
event: message
data: {"final_sse_message": false, "blocks": [{"markdown_block": {"answer": "developments"}}]}
event: message
data: {"final_sse_message": true, "blocks": [...], "web_results": [...]}
Parsing challenges:
- Multi-event streaming: Content arrives in multiple SSE events
- Final message detection: Only the last event contains complete structured data
- Block-based structure: Different data types are in separate blocks
- Mixed content types: Text, sources, media, and structured objects combined
Python SSE parsing implementation:
import json
from typing import List, Dict, Any, Optional
def get_last_final_message(sse_response: str) -> Optional[dict]:
"""
Extract the last message with final=true from Perplexity SSE response.
"""
messages = sse_response.strip().split("\n\n")
for message in reversed(messages):
if not message.startswith("event: message"):
continue
# Extract the data line
lines = message.split("\n")
for line in lines:
if line.startswith("data: "):
try:
data = json.loads(line[6:]) # Remove 'data: ' prefix
# Check if this is the final message
if data.get("final_sse_message"):
return data
except json.JSONDecodeError:
continue
return None
def extract_answer_text(final_message_data: Optional[dict]) -> str:
"""
Extract the answer text from the final message data.
"""
if not final_message_data:
return ""
blocks = final_message_data.get("blocks", [])
for block in blocks:
if "markdown_block" in block:
return block["markdown_block"].get("answer", "")
return ""
Source extraction from web results:
def extract_perplexity_sources(final_message_data: Optional[dict]) -> List[Dict[str, Any]]:
"""
Extract sources from Perplexity SSE response.
"""
sources = []
if not final_message_data:
return sources
# Extract web_results from blocks
blocks = final_message_data.get("blocks", [])
for block in blocks:
# Check for web_result_block
if "web_result_block" in block:
web_results = block["web_result_block"].get("web_results", [])
for idx, result in enumerate(web_results, start=1):
sources.append({
"position": idx,
"label": result.get("name", ""),
"url": result.get("url", ""),
"description": result.get("snippet") or result.get("meta_data", {}).get("description"),
})
return sources
Building the scraping infrastructure
Let’s build the complete Perplexity scraping system:
Required components:
- Browser automation: Playwright for dynamic content rendering
- SSE interception: Network request capture and parsing
- Intent detection: Query analysis for data extraction
- Structured data parsing: Shopping cards, media, travel data extraction
Complete scraper implementation:
import asyncio
from playwright.async_api import async_playwright, Page
import json
from typing import Dict, Any, List, Optional
class PerplexityScraper:
def __init__(self):
self.captured_responses = []
async def setup_sse_interceptor(self, page: Page):
"""Set up Server-Sent Events interception."""
async def handle_response(response):
# Capture Perplexity SSE responses
if 'rest/sse/perplexity_ask' in response.url:
response_body = await response.text()
self.captured_responses.append(response_body)
page.on('response', handle_response)
async def scrape_perplexity(self, query: str, country: str = 'US') -> Dict[str, Any]:
"""Main scraping function."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
# Set up SSE interception
await self.setup_sse_interceptor(page)
try:
# Navigate to Perplexity
await page.goto('https://www.perplexity.ai', timeout=20_000)
# Handle any modals or popups
await self.remove_dialogs(page)
# Fill and submit query
await page.wait_for_selector('#ask-input', state="visible", timeout=10_000)
await page.fill('#ask-input', query)
await page.click('[data-testid="submit-button"]', timeout=5_000)
# Wait for SSE response
await self.wait_for_perplexity_response(page)
# Parse the captured response
if self.captured_responses:
raw_response = self.captured_responses[0]
return self.parse_perplexity_response(raw_response)
else:
raise Exception("No SSE response captured")
finally:
await browser.close()
async def remove_dialogs(self, page: Page):
"""Remove any modal dialogs or popups."""
await page.evaluate("""
// Remove all portal elements
const elements = document.querySelectorAll("[data-type='portal']");
elements.forEach(element => {
element.remove();
});
""")
async def wait_for_perplexity_response(self, page: Page, timeout: int = 60):
"""Wait for Perplexity SSE response completion."""
for _ in range(timeout * 2): # Check every 500ms
# Check if we have captured responses
if self.captured_responses:
# Verify response contains final message
final_message = get_last_final_message(self.captured_responses[0])
if final_message:
return
await asyncio.sleep(0.5)
raise Exception("Response timeout after 60 seconds")
def parse_perplexity_response(self, sse_response: str) -> Dict[str, Any]:
"""Parse the raw Perplexity SSE response into structured data."""
# Extract final message data
final_message_data = get_last_final_message(sse_response)
# Extract core content
text = extract_answer_text(final_message_data)
sources = extract_perplexity_sources(final_message_data)
result = {
'text': text,
'sources': sources,
}
# Extract shopping products if shopping intent detected
if has_shopping_intent(final_message_data):
shopping_cards = extract_perplexity_shopping_products(final_message_data)
if shopping_cards:
result['shopping_cards'] = shopping_cards
# Extract media content
media = extract_perplexity_media(final_message_data)
if media['videos']:
result['videos'] = media['videos']
if media['images']:
result['images'] = media['images']
# Extract travel data
if has_places_intent(final_message_data):
hotels_places = extract_perplexity_hotels_and_places(final_message_data)
if hotels_places['hotels']:
result['hotels'] = hotels_places['hotels']
if hotels_places['places']:
result['places'] = hotels_places['places']
# Extract related queries
related_queries = extract_related_queries(final_message_data)
if related_queries:
result['related_queries'] = related_queries
return result
Query intent detection and data extraction
Perplexity automatically detects different query types and extracts corresponding structured data:
Shopping intent detection:
def has_shopping_intent(final_message_data: Optional[dict]) -> bool:
"""
Check if the response indicates shopping intent.
"""
if not final_message_data:
return False
# Check answer modes for shopping
answer_modes = final_message_data.get("answer_modes", [])
for mode in answer_modes:
if isinstance(mode, dict) and mode.get("answer_mode_type") == "SHOPPING":
return True
# Check classifier results
classifier_results = final_message_data.get("classifier_results", {})
return classifier_results.get("shopping_intent", False)
def extract_perplexity_shopping_products(final_message_data: Optional[dict]) -> List[Dict[str, Any]]:
"""
Extract shopping products from Perplexity response.
"""
shopping_cards = []
if not final_message_data:
return shopping_cards
blocks = final_message_data.get("blocks", [])
for block in blocks:
# Extract from shopping_block
if "shopping_block" in block:
shopping_block = block["shopping_block"]
products = shopping_block.get("products", [])
for product in products:
if isinstance(product, dict):
product_info = {
"title": product.get("name"),
"url": product.get("url"),
"description": product.get("description"),
"price": product.get("price"),
"original_price": product.get("original_price"),
"rating": product.get("rating"),
"num_reviews": product.get("num_reviews"),
"image_urls": product.get("image_urls", []),
"merchant": product.get("merchant"),
"id": product.get("id"),
"variants": product.get("variants", []),
"offers": product.get("offers", [])
}
shopping_cards.append({
"products": [product_info],
"tags": shopping_block.get("tags", [])
})
return shopping_cards
Media content extraction:
def extract_perplexity_media(final_message_data: Optional[dict]) -> dict:
"""
Extract media items (videos and images) from Perplexity response.
"""
videos = []
images = []
if not final_message_data:
return {"videos": videos, "images": images}
blocks = final_message_data.get("blocks", [])
for block in blocks:
# Extract from media_block
if "media_block" in block:
media_block = block["media_block"]
media_items = media_block.get("media_items", [])
for item in media_items:
if isinstance(item, dict):
media_item = {
"title": item.get("name"),
"url": item.get("url"),
"thumbnail": item.get("thumbnail"),
"medium": item.get("medium", "").lower(),
"source": item.get("source"),
}
# Add image dimensions
for dim_field in ["image_width", "image_height", "thumbnail_width", "thumbnail_height"]:
if dim_field in item:
try:
media_item[dim_field] = int(item[dim_field])
except (ValueError, TypeError):
pass
medium = item.get("medium", "").lower()
if medium == "video":
videos.append(media_item)
elif medium == "image":
images.append(media_item)
return {"videos": videos, "images": images}
Travel data extraction:
def extract_perplexity_hotels_and_places(final_message_data: Optional[dict]) -> dict:
"""
Extract hotels and places from Perplexity response.
"""
hotels = []
places = []
if not final_message_data:
return {"hotels": hotels, "places": places}
blocks = final_message_data.get("blocks", [])
for block in blocks:
# Extract from hotels_mode_block
if "hotels_mode_block" in block:
hotel_block = block["hotels_mode_block"]
hotel_places = hotel_block.get("places", [])
for place in hotel_places:
if isinstance(place, dict):
hotel_item = {
"name": place.get("name"),
"url": place.get("url", ""),
"rating": place.get("rating"),
"num_reviews": place.get("num_reviews"),
"address": place.get("address", []) if isinstance(place.get("address"), list) else [place.get("address", "")],
"phone": place.get("phone"),
"description": place.get("description"),
"image_url": place.get("image_url"),
"images": place.get("images", []),
"lat": place.get("lat"),
"lng": place.get("lng"),
"price_level": place.get("price_level"),
"categories": place.get("categories", [])
}
hotels.append(hotel_item)
# Extract from maps_mode_block
elif "maps_mode_block" in block:
maps_block = block["maps_mode_block"]
map_places = maps_block.get("places", [])
for place in map_places:
if isinstance(place, dict):
place_item = {
"name": place.get("name"),
"url": place.get("url", ""),
"address": place.get("address", []) if isinstance(place.get("address"), list) else [place.get("address", "")],
"rating": place.get("rating"),
"lat": place.get("lat"),
"lng": place.get("lng"),
"categories": place.get("categories", []),
"map_url": place.get("map_url"),
"images": place.get("images", [])
}
places.append(place_item)
return {"hotels": hotels, "places": places}
Related queries extraction:
def extract_related_queries(final_message_data: Optional[dict]) -> List[str]:
"""
Extract related queries from Perplexity response.
"""
if not final_message_data:
return []
# Extract from related_queries field (preferred source)
queries = final_message_data.get("related_queries", [])
if isinstance(queries, list):
related = [q.strip() for q in queries if isinstance(q, str) and q.strip()]
if related:
return related
# Check related_query_items for text fields
query_items = final_message_data.get("related_query_items", [])
if isinstance(query_items, list):
related = []
for item in query_items:
if isinstance(item, dict):
text = item.get("text")
if isinstance(text, str) and text.strip() and text not in related:
related.append(text.strip())
if related:
return related
return []
Using cloro’s managed Perplexity scraper
Building and maintaining a reliable Perplexity scraper requires significant infrastructure and ongoing maintenance. That’s why we built cloro - a managed API that handles all the complexity for you.
Simple API integration:
import requests
import json
# Your search query
query = "What are the latest developments in quantum computing 2025?"
# API request to cloro
response = requests.post(
'https://api.cloro.dev/v1/monitor/perplexity',
headers={'Authorization': 'Bearer YOUR_API_KEY'},
json={
'prompt': query,
'country': 'US',
'include': {
'markdown': True,
'html': True
}
}
)
result = response.json()
print(json.dumps(result, indent=2))
What cloro handles automatically:
- Query intent detection: Automatic classification of shopping, travel, media, and general queries
- SSE parsing: Complete Server-Sent Events handling and data extraction
- Anti-bot evasion: Advanced techniques to avoid detection and blocking
- Rate limiting: Intelligent request scheduling and backoff strategies
- Structured data extraction: Automatic parsing of shopping cards, media, and travel data
- Error handling: Comprehensive retry logic and error recovery
- Scalability: Distributed infrastructure for high-volume requests
Rich structured output you get:
{
"status": "success",
"result": {
"text": "Recent developments in quantum computing include breakthrough error correction methods...",
"sources": [
{
"position": 1,
"url": "https://example.com/quantum-breakthrough",
"label": "MIT Technology Review",
"description": "Scientists achieve 99.9% qubit fidelity in room temperature conditions..."
}
],
"shopping_cards": [
{
"products": [
{
"title": "Quantum Computing Book",
"url": "https://example.com/product",
"price": "$89.99",
"rating": 4.8,
"num_reviews": 1250,
"image_urls": ["https://example.com/image.jpg"],
"merchant": "TechBooks",
"offers": [...]
}
],
"tags": ["education", "quantum"]
}
],
"videos": [
{
"title": "Quantum Computing Explained",
"url": "https://youtube.com/watch?v=example",
"thumbnail": "https://example.com/thumb.jpg",
"medium": "video",
"source": "youtube"
}
],
"hotels": [
{
"name": "Quantum Research Hotel",
"url": "https://example.com/hotel",
"rating": 4.5,
"address": ["123 Tech Street", "Innovation City"],
"price_level": "$$$",
"categories": ["Hotel", "Business"]
}
],
"related_queries": [
"What companies are leading quantum computing?",
"How does quantum error correction work?"
]
}
}
Benefits of using cloro:
- 99.9% uptime vs. DIY solutions that frequently break
- P50 latency < 30s vs. manual scraping that takes hours
- Automatic query intent detection without implementing complex classifiers
- Rich structured data extraction for shopping, travel, media, and general queries
- No infrastructure costs - we handle browsers, proxies, and maintenance
- Compliance - ethical scraping practices and rate limiting
- Scalability - handle thousands of requests with consistent quality
Start scraping Perplexity today.
The insights from Perplexity’s AI-powered search are too valuable to ignore. Whether you’re monitoring market trends, conducting research, tracking competitive intelligence, or building automated workflows, access to structured Perplexity data provides incredible opportunities.
For most developers and businesses, we recommend using cloro’s Perplexity scraper. You get:
- Immediate access to reliable scraping infrastructure
- Automatic query intent detection and structured data extraction
- Real-time web source integration with proper attribution
- Built-in anti-bot evasion and rate limiting
- Comprehensive error handling and retries
- Rich structured output for shopping, travel, media, and general queries
The cost of building and maintaining this infrastructure yourself typically runs $3,000-7,000/month in development time, browser instances, proxy services, and ongoing maintenance.
For advanced users needing custom solutions, the technical approach outlined above provides the foundation for building your own scraping system. Be prepared for ongoing maintenance as Perplexity frequently updates their detection systems and response formats.
The window of opportunity is closing. As more businesses discover the value of AI-powered search intelligence, competition for attention in AI responses intensifies. Companies that start monitoring and optimizing their Perplexity presence now will build advantages that become increasingly difficult to overcome.
Ready to unlock Perplexity data for your business? Get started with cloro’s API to start accessing AI-powered search intelligence.
Don’t let your competitors define how AI presents information in your industry. Start scraping Perplexity today.