cloro
Technical Guides

How to scrape Perplexity with minimal infrastructure

#Perplexity#Scraping

Perplexity AI serves over 10 million searches daily. The platform combines AI reasoning with real-time web search, delivering structured data that direct APIs completely miss.

The challenge: Perplexity wasn’t designed for programmatic access. The platform uses Server-Sent Events streaming and sophisticated query intent detection that traditional scraping tools can’t handle.

After analyzing 5+ million Perplexity responses, we’ve reverse-engineered their complete data extraction process. This guide will show you exactly how to scrape Perplexity and extract the rich structured data that makes it a powerful AI-powered search engine.

Table of contents

Why scrape Perplexity responses?

Perplexity’s API responses are nothing like what users see in the UI.

What you miss with the API:

  • The actual search experience users get
  • Real-time web sources with citations
  • Query intent detection and structured data
  • Shopping cards and travel recommendations

Why it matters: API responses are nothing like the UI, making it impossible to verify information or influence SEO without scraping.

The math: Scraping costs up to 10x less than direct AI usage while providing the real search experience.

Use cases:

  • Verification: Check what Perplexity actually tells users
  • SEO: Monitor how Perplexity sources and cites information
  • E-commerce: Track product recommendations and pricing
  • Travel: Monitor hotel listings and travel data

Perplexity is a leader in the new wave of AI Search Engines.

Understanding Perplexity’s architecture

Perplexity combines multiple sophisticated systems to deliver its AI-powered search results:

Perplexity’s response generation process:

  1. Query Analysis: Classifies search intent (shopping, travel, media, general)
  2. Search Integration: Performs real-time web searches across multiple sources
  3. AI Synthesis: Uses LLMs to synthesize information with citations
  4. Structured Extraction: Automatically extracts rich data objects based on intent
  5. Streaming Response: Delivers results via Server-Sent Events (SSE)

Key technical challenges:

Multi-modal response structure:

// Perplexity combines text, sources, and rich data objects
{
  answer: "AI-generated response with citations [1][2]",
  sources: ["https://example.com/source1", "https://example.com/source2"],
  shoppingCards: [...], // When shopping intent detected
  videos: [...], // When media intent detected
  hotels: [...] // When travel intent detected
}

Server-Sent Events format:

event: message
data: {"final_sse_message": false, "blocks": [{"markdown_block": {"answer": "Hello"}}]}

event: message
data: {"final_sse_message": false, "blocks": [{"markdown_block": {"answer": "Hello world"}}]}

event: message
data: {"final_sse_message": true, "blocks": [...], "web_results": [...]}

Query intent detection:

  • Shopping queries → Product cards with pricing
  • Travel queries → Hotel listings and places
  • Media queries → Videos and images
  • General queries → Text with citations

Anti-bot detection:

  • Request pattern analysis
  • Browser fingerprinting
  • Rate limiting with exponential backoff
  • Dynamic content loading challenges

The Server-Sent Events parsing challenge

The core of Perplexity scraping lies in parsing their SSE stream and extracting structured data blocks:

SSE event structure:

# Raw Perplexity SSE example
event: message
data: {"final_sse_message": false, "blocks": [{"markdown_block": {"answer": "Recent"}}]}

event: message
data: {"final_sse_message": false, "blocks": [{"markdown_block": {"answer": "developments"}}]}

event: message
data: {"final_sse_message": true, "blocks": [...], "web_results": [...]}

Parsing challenges:

  1. Multi-event streaming: Content arrives in multiple SSE events
  2. Final message detection: Only the last event contains complete structured data
  3. Block-based structure: Different data types are in separate blocks
  4. Mixed content types: Text, sources, media, and structured objects combined

Python SSE parsing implementation:

import json
from typing import List, Dict, Any, Optional

def get_last_final_message(sse_response: str) -> Optional[dict]:
    """
    Extract the last message with final=true from Perplexity SSE response.
    """
    messages = sse_response.strip().split("\n\n")

    for message in reversed(messages):
        if not message.startswith("event: message"):
            continue

        # Extract the data line
        lines = message.split("\n")
        for line in lines:
            if line.startswith("data: "):
                try:
                    data = json.loads(line[6:])  # Remove 'data: ' prefix

                    # Check if this is the final message
                    if data.get("final_sse_message"):
                        return data
                except json.JSONDecodeError:
                    continue

    return None

def extract_answer_text(final_message_data: Optional[dict]) -> str:
    """
    Extract the answer text from the final message data.
    """
    if not final_message_data:
        return ""

    blocks = final_message_data.get("blocks", [])

    for block in blocks:
        if "markdown_block" in block:
            return block["markdown_block"].get("answer", "")

    return ""

Source extraction from web results:

def extract_perplexity_sources(final_message_data: Optional[dict]) -> List[Dict[str, Any]]:
    """
    Extract sources from Perplexity SSE response.
    """
    sources = []

    if not final_message_data:
        return sources

    # Extract web_results from blocks
    blocks = final_message_data.get("blocks", [])

    for block in blocks:
        # Check for web_result_block
        if "web_result_block" in block:
            web_results = block["web_result_block"].get("web_results", [])

            for idx, result in enumerate(web_results, start=1):
                sources.append({
                    "position": idx,
                    "label": result.get("name", ""),
                    "url": result.get("url", ""),
                    "description": result.get("snippet") or result.get("meta_data", {}).get("description"),
                })

    return sources

Building the scraping infrastructure

Let’s build the complete Perplexity scraping system:

Required components:

  1. Browser automation: Playwright for dynamic content rendering
  2. SSE interception: Network request capture and parsing
  3. Intent detection: Query analysis for data extraction
  4. Structured data parsing: Shopping cards, media, travel data extraction

Complete scraper implementation:

import asyncio
from playwright.async_api import async_playwright, Page
import json
from typing import Dict, Any, List, Optional

class PerplexityScraper:
    def __init__(self):
        self.captured_responses = []

    async def setup_sse_interceptor(self, page: Page):
        """Set up Server-Sent Events interception."""

        async def handle_response(response):
            # Capture Perplexity SSE responses
            if 'rest/sse/perplexity_ask' in response.url:
                response_body = await response.text()
                self.captured_responses.append(response_body)

        page.on('response', handle_response)

    async def scrape_perplexity(self, query: str, country: str = 'US') -> Dict[str, Any]:
        """Main scraping function."""

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            context = await browser.new_context()
            page = await context.new_page()

            # Set up SSE interception
            await self.setup_sse_interceptor(page)

            try:
                # Navigate to Perplexity
                await page.goto('https://www.perplexity.ai', timeout=20_000)

                # Handle any modals or popups
                await self.remove_dialogs(page)

                # Fill and submit query
                await page.wait_for_selector('#ask-input', state="visible", timeout=10_000)
                await page.fill('#ask-input', query)
                await page.click('[data-testid="submit-button"]', timeout=5_000)

                # Wait for SSE response
                await self.wait_for_perplexity_response(page)

                # Parse the captured response
                if self.captured_responses:
                    raw_response = self.captured_responses[0]
                    return self.parse_perplexity_response(raw_response)
                else:
                    raise Exception("No SSE response captured")

            finally:
                await browser.close()

    async def remove_dialogs(self, page: Page):
        """Remove any modal dialogs or popups."""
        await page.evaluate("""
            // Remove all portal elements
            const elements = document.querySelectorAll("[data-type='portal']");
            elements.forEach(element => {
                element.remove();
            });
        """)

    async def wait_for_perplexity_response(self, page: Page, timeout: int = 60):
        """Wait for Perplexity SSE response completion."""

        for _ in range(timeout * 2):  # Check every 500ms
            # Check if we have captured responses
            if self.captured_responses:
                # Verify response contains final message
                final_message = get_last_final_message(self.captured_responses[0])
                if final_message:
                    return

            await asyncio.sleep(0.5)

        raise Exception("Response timeout after 60 seconds")

    def parse_perplexity_response(self, sse_response: str) -> Dict[str, Any]:
        """Parse the raw Perplexity SSE response into structured data."""

        # Extract final message data
        final_message_data = get_last_final_message(sse_response)

        # Extract core content
        text = extract_answer_text(final_message_data)
        sources = extract_perplexity_sources(final_message_data)

        result = {
            'text': text,
            'sources': sources,
        }

        # Extract shopping products if shopping intent detected
        if has_shopping_intent(final_message_data):
            shopping_cards = extract_perplexity_shopping_products(final_message_data)
            if shopping_cards:
                result['shopping_cards'] = shopping_cards

        # Extract media content
        media = extract_perplexity_media(final_message_data)
        if media['videos']:
            result['videos'] = media['videos']
        if media['images']:
            result['images'] = media['images']

        # Extract travel data
        if has_places_intent(final_message_data):
            hotels_places = extract_perplexity_hotels_and_places(final_message_data)
            if hotels_places['hotels']:
                result['hotels'] = hotels_places['hotels']
            if hotels_places['places']:
                result['places'] = hotels_places['places']

        # Extract related queries
        related_queries = extract_related_queries(final_message_data)
        if related_queries:
            result['related_queries'] = related_queries

        return result

Query intent detection and data extraction

Perplexity automatically detects different query types and extracts corresponding structured data:

Shopping intent detection:

def has_shopping_intent(final_message_data: Optional[dict]) -> bool:
    """
    Check if the response indicates shopping intent.
    """
    if not final_message_data:
        return False

    # Check answer modes for shopping
    answer_modes = final_message_data.get("answer_modes", [])
    for mode in answer_modes:
        if isinstance(mode, dict) and mode.get("answer_mode_type") == "SHOPPING":
            return True

    # Check classifier results
    classifier_results = final_message_data.get("classifier_results", {})
    return classifier_results.get("shopping_intent", False)

def extract_perplexity_shopping_products(final_message_data: Optional[dict]) -> List[Dict[str, Any]]:
    """
    Extract shopping products from Perplexity response.
    """
    shopping_cards = []

    if not final_message_data:
        return shopping_cards

    blocks = final_message_data.get("blocks", [])

    for block in blocks:
        # Extract from shopping_block
        if "shopping_block" in block:
            shopping_block = block["shopping_block"]
            products = shopping_block.get("products", [])

            for product in products:
                if isinstance(product, dict):
                    product_info = {
                        "title": product.get("name"),
                        "url": product.get("url"),
                        "description": product.get("description"),
                        "price": product.get("price"),
                        "original_price": product.get("original_price"),
                        "rating": product.get("rating"),
                        "num_reviews": product.get("num_reviews"),
                        "image_urls": product.get("image_urls", []),
                        "merchant": product.get("merchant"),
                        "id": product.get("id"),
                        "variants": product.get("variants", []),
                        "offers": product.get("offers", [])
                    }

                    shopping_cards.append({
                        "products": [product_info],
                        "tags": shopping_block.get("tags", [])
                    })

    return shopping_cards

Media content extraction:

def extract_perplexity_media(final_message_data: Optional[dict]) -> dict:
    """
    Extract media items (videos and images) from Perplexity response.
    """
    videos = []
    images = []

    if not final_message_data:
        return {"videos": videos, "images": images}

    blocks = final_message_data.get("blocks", [])

    for block in blocks:
        # Extract from media_block
        if "media_block" in block:
            media_block = block["media_block"]
            media_items = media_block.get("media_items", [])

            for item in media_items:
                if isinstance(item, dict):
                    media_item = {
                        "title": item.get("name"),
                        "url": item.get("url"),
                        "thumbnail": item.get("thumbnail"),
                        "medium": item.get("medium", "").lower(),
                        "source": item.get("source"),
                    }

                    # Add image dimensions
                    for dim_field in ["image_width", "image_height", "thumbnail_width", "thumbnail_height"]:
                        if dim_field in item:
                            try:
                                media_item[dim_field] = int(item[dim_field])
                            except (ValueError, TypeError):
                                pass

                    medium = item.get("medium", "").lower()
                    if medium == "video":
                        videos.append(media_item)
                    elif medium == "image":
                        images.append(media_item)

    return {"videos": videos, "images": images}

Travel data extraction:

def extract_perplexity_hotels_and_places(final_message_data: Optional[dict]) -> dict:
    """
    Extract hotels and places from Perplexity response.
    """
    hotels = []
    places = []

    if not final_message_data:
        return {"hotels": hotels, "places": places}

    blocks = final_message_data.get("blocks", [])

    for block in blocks:
        # Extract from hotels_mode_block
        if "hotels_mode_block" in block:
            hotel_block = block["hotels_mode_block"]
            hotel_places = hotel_block.get("places", [])

            for place in hotel_places:
                if isinstance(place, dict):
                    hotel_item = {
                        "name": place.get("name"),
                        "url": place.get("url", ""),
                        "rating": place.get("rating"),
                        "num_reviews": place.get("num_reviews"),
                        "address": place.get("address", []) if isinstance(place.get("address"), list) else [place.get("address", "")],
                        "phone": place.get("phone"),
                        "description": place.get("description"),
                        "image_url": place.get("image_url"),
                        "images": place.get("images", []),
                        "lat": place.get("lat"),
                        "lng": place.get("lng"),
                        "price_level": place.get("price_level"),
                        "categories": place.get("categories", [])
                    }

                    hotels.append(hotel_item)

        # Extract from maps_mode_block
        elif "maps_mode_block" in block:
            maps_block = block["maps_mode_block"]
            map_places = maps_block.get("places", [])

            for place in map_places:
                if isinstance(place, dict):
                    place_item = {
                        "name": place.get("name"),
                        "url": place.get("url", ""),
                        "address": place.get("address", []) if isinstance(place.get("address"), list) else [place.get("address", "")],
                        "rating": place.get("rating"),
                        "lat": place.get("lat"),
                        "lng": place.get("lng"),
                        "categories": place.get("categories", []),
                        "map_url": place.get("map_url"),
                        "images": place.get("images", [])
                    }

                    places.append(place_item)

    return {"hotels": hotels, "places": places}

Related queries extraction:

def extract_related_queries(final_message_data: Optional[dict]) -> List[str]:
    """
    Extract related queries from Perplexity response.
    """
    if not final_message_data:
        return []

    # Extract from related_queries field (preferred source)
    queries = final_message_data.get("related_queries", [])
    if isinstance(queries, list):
        related = [q.strip() for q in queries if isinstance(q, str) and q.strip()]
        if related:
            return related

    # Check related_query_items for text fields
    query_items = final_message_data.get("related_query_items", [])
    if isinstance(query_items, list):
        related = []
        for item in query_items:
            if isinstance(item, dict):
                text = item.get("text")
                if isinstance(text, str) and text.strip() and text not in related:
                    related.append(text.strip())
        if related:
            return related

    return []

Using cloro’s managed Perplexity scraper

Building and maintaining a reliable Perplexity scraper requires significant infrastructure and ongoing maintenance. That’s why we built cloro - a managed API that handles all the complexity for you.

Simple API integration:

import requests
import json

# Your search query
query = "What are the latest developments in quantum computing 2025?"

# API request to cloro
response = requests.post(
    'https://api.cloro.dev/v1/monitor/perplexity',
    headers={'Authorization': 'Bearer YOUR_API_KEY'},
    json={
        'prompt': query,
        'country': 'US',
        'include': {
            'markdown': True,
            'html': True
        }
    }
)

result = response.json()
print(json.dumps(result, indent=2))

What cloro handles automatically:

  • Query intent detection: Automatic classification of shopping, travel, media, and general queries
  • SSE parsing: Complete Server-Sent Events handling and data extraction
  • Anti-bot evasion: Advanced techniques to avoid detection and blocking
  • Rate limiting: Intelligent request scheduling and backoff strategies
  • Structured data extraction: Automatic parsing of shopping cards, media, and travel data
  • Error handling: Comprehensive retry logic and error recovery
  • Scalability: Distributed infrastructure for high-volume requests

Rich structured output you get:

{
  "status": "success",
  "result": {
    "text": "Recent developments in quantum computing include breakthrough error correction methods...",
    "sources": [
      {
        "position": 1,
        "url": "https://example.com/quantum-breakthrough",
        "label": "MIT Technology Review",
        "description": "Scientists achieve 99.9% qubit fidelity in room temperature conditions..."
      }
    ],
    "shopping_cards": [
      {
        "products": [
          {
            "title": "Quantum Computing Book",
            "url": "https://example.com/product",
            "price": "$89.99",
            "rating": 4.8,
            "num_reviews": 1250,
            "image_urls": ["https://example.com/image.jpg"],
            "merchant": "TechBooks",
            "offers": [...]
          }
        ],
        "tags": ["education", "quantum"]
      }
    ],
    "videos": [
      {
        "title": "Quantum Computing Explained",
        "url": "https://youtube.com/watch?v=example",
        "thumbnail": "https://example.com/thumb.jpg",
        "medium": "video",
        "source": "youtube"
      }
    ],
    "hotels": [
      {
        "name": "Quantum Research Hotel",
        "url": "https://example.com/hotel",
        "rating": 4.5,
        "address": ["123 Tech Street", "Innovation City"],
        "price_level": "$$$",
        "categories": ["Hotel", "Business"]
      }
    ],
    "related_queries": [
      "What companies are leading quantum computing?",
      "How does quantum error correction work?"
    ]
  }
}

Benefits of using cloro:

  • 99.9% uptime vs. DIY solutions that frequently break
  • P50 latency < 30s vs. manual scraping that takes hours
  • Automatic query intent detection without implementing complex classifiers
  • Rich structured data extraction for shopping, travel, media, and general queries
  • No infrastructure costs - we handle browsers, proxies, and maintenance
  • Compliance - ethical scraping practices and rate limiting
  • Scalability - handle thousands of requests with consistent quality

Start scraping Perplexity today.

The insights from Perplexity’s AI-powered search are too valuable to ignore. Whether you’re monitoring market trends, conducting research, tracking competitive intelligence, or building automated workflows, access to structured Perplexity data provides incredible opportunities.

For most developers and businesses, we recommend using cloro’s Perplexity scraper. You get:

  • Immediate access to reliable scraping infrastructure
  • Automatic query intent detection and structured data extraction
  • Real-time web source integration with proper attribution
  • Built-in anti-bot evasion and rate limiting
  • Comprehensive error handling and retries
  • Rich structured output for shopping, travel, media, and general queries

The cost of building and maintaining this infrastructure yourself typically runs $3,000-7,000/month in development time, browser instances, proxy services, and ongoing maintenance.

For advanced users needing custom solutions, the technical approach outlined above provides the foundation for building your own scraping system. Be prepared for ongoing maintenance as Perplexity frequently updates their detection systems and response formats.

The window of opportunity is closing. As more businesses discover the value of AI-powered search intelligence, competition for attention in AI responses intensifies. Companies that start monitoring and optimizing their Perplexity presence now will build advantages that become increasingly difficult to overcome.

Ready to unlock Perplexity data for your business? Get started with cloro’s API to start accessing AI-powered search intelligence.

Don’t let your competitors define how AI presents information in your industry. Start scraping Perplexity today.