cloro
Technical Guides

How to scrape ChatGPT with minimal infrastructure

#ChatGPT#Scraping

ChatGPT generates over 1 billion responses daily. Behind the scenes, the web interface delivers rich structured data that direct APIs completely miss, including citations, shopping recommendations, and brand intelligence.

The challenge: ChatGPT wasn’t built for programmatic access. The platform uses sophisticated anti-bot systems, dynamic content rendering, and Server-Sent Events streaming that traditional scraping tools can’t handle.

After analyzing over 10 million ChatGPT responses, we’ve reverse-engineered the complete process. This guide will show you exactly how to scrape ChatGPT and extract the structured data that makes it valuable for businesses and researchers.

Table of contents

Why scrape ChatGPT responses?

ChatGPT’s API responses are nothing like what users see in the UI.

What you miss with the API:

  • The actual interface experience users get
  • Sources and citations for verification
  • Shopping cards with product recommendations
  • Brand entity recognition and tracking

Why it matters: API responses are nothing like the UI, making it impossible to verify information or influence SEO without scraping.

The math: Scraping costs up to 12x less than direct API usage while providing the real user experience.

Use cases:

  • Verification: Check what ChatGPT actually tells users
  • SEO: Monitor how ChatGPT sources and cites information
  • E-commerce: Track product recommendations and brand mentions
  • Research: Analyze AI response patterns and bias

Understanding ChatGPT’s architecture

Before diving into the technical implementation, let’s understand what makes ChatGPT scraping challenging:

ChatGPT’s response generation process:

  1. Query Processing: Your prompt is analyzed and broken down into sub-queries using query fanout
  2. Search Integration: For web-enabled chats, ChatGPT performs real-time web searches
  3. Streaming Generation: Responses are generated using Server-Sent Events (SSE)
  4. Dynamic Rendering: Content is rendered client-side using React and web components
  5. Source Attribution: Citations and sources are dynamically linked and rendered

Key technical challenges:

JavaScript-heavy interface:

// ChatGPT uses React components that require full browser rendering
const responseContainer = document.querySelector(
  '[data-message-author-role="assistant"]',
);
// Content isn't available in initial HTML

Streaming response format:

data: {"id": "chatcmpl-abc123", "object": "chat.completion.chunk"}
data: {"choices": [{"delta": {"content": "Hello"}}]}
data: [DONE]

Anti-bot detection:

  • Canvas fingerprinting
  • Behavioral analysis
  • Request pattern monitoring
  • CAPTCHA challenges

Dynamic content loading:

  • Lazy-loaded source citations
  • Modal-based source browsing
  • Real-time content updates

The event stream parsing challenge

The core of ChatGPT scraping lies in parsing Server-Sent Events (SSE). Here’s what makes it complex:

Event stream structure:

# Raw ChatGPT event stream example
data: {"id": "chatcmpl-123", "object": "chat.completion.chunk", "created": 1677652288}
data: {"choices": [{"index": 0, "delta": {"content": "I"}}]}
data: {"choices": [{"index": 0, "delta": {"content": " recommend"}}]}
data: {"choices": [{"index": 0, "delta": {"content": " using"}}]}
data: {"choices": [{"index": 0, "delta": {"content": " Python"}}]}
data: [DONE]

Parsing challenges:

  1. Mixed data types: Events contain both JSON and special markers
  2. Partial responses: Content comes in chunks that need reconstruction
  3. Metadata extraction: Model info, citations, and search queries are embedded
  4. Error handling: Network issues can split events mid-stream

Python parsing implementation:

import json
from typing import List, Dict, Any

def extract_raw_response(input_string: str) -> List[Dict[str, Any]]:
    """Parse ChatGPT's Server-Sent Events stream."""
    json_objects = []

    # Split by lines that start with "data: "
    lines = input_string.split("\n")

    for line in lines:
        # Skip empty lines and non-data lines
        if not line.strip() or not line.startswith("data: "):
            continue

        # Remove "data: " prefix
        json_str = line[6:].strip()

        # Skip special markers like [DONE]
        if json_str == "[DONE]":
            continue

        # Try to parse as JSON
        try:
            json_obj = json.loads(json_str)

            # Only include if it's a dictionary (object), not string or other types
            if isinstance(json_obj, dict):
                json_objects.append(json_obj)
        except json.JSONDecodeError:
            # Skip invalid JSON
            continue

    return json_objects

Reconstructing the full response:

def reconstruct_content(events: List[Dict[str, Any]]) -> str:
    """Rebuild complete response from streaming chunks."""
    content_parts = []

    for event in events:
        # Extract content from delta messages
        if 'choices' in event and len(event['choices']) > 0:
            delta = event['choices'][0].get('delta', {})
            if 'content' in delta:
                content_parts.append(delta['content'])

    return ''.join(content_parts)

Building the scraping infrastructure

Let’s build the complete scraping system step by step.

Required components:

  1. Browser automation: Playwright or Selenium
  2. Network interception: To capture API calls
  3. Event stream parser: To process SSE data
  4. Content extractor: To parse HTML and extract structured data
  5. Error handling: For captchas, rate limits, and network issues

Complete scraper implementation:

import asyncio
from playwright.async_api import async_playwright, Page
import json
from typing import Dict, Any, List

class ChatGPTScraper:
    def __init__(self):
        self.captured_responses = []

    async def setup_page_interceptor(self, page: Page):
        """Set up network request interception."""

        async def handle_response(response):
            # Capture conversation API responses
            if 'backend-api/f/conversation' in response.url:
                response_body = await response.text()
                self.captured_responses.append(response_body)

        page.on('response', handle_response)

    async def scrape_chatgpt(self, prompt: str) -> Dict[str, Any]:
        """Main scraping function."""

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            context = await browser.new_context()
            page = await context.new_page()

            # Set up response interception
            await self.setup_page_interceptor(page)

            try:
                # Navigate to ChatGPT
                await page.goto('https://chatgpt.com/?temporary-chat=true')

                # Wait for textarea and enter prompt
                await page.wait_for_selector('#prompt-textarea')
                await page.fill('#prompt-textarea', '/search')  # Enable web search
                await page.press('#prompt-textarea', 'Enter')
                await asyncio.sleep(0.5)

                await page.fill('#prompt-textarea', prompt)
                await page.press('#prompt-textarea', 'Enter')

                # Wait for response completion
                await self.wait_for_response(page)

                # Parse the captured response
                if self.captured_responses:
                    raw_response = self.captured_responses[0]
                    return self.parse_chatgpt_response(raw_response)
                else:
                    raise Exception("No response captured")

            finally:
                await browser.close()

    async def wait_for_response(self, page: Page, timeout: int = 60):
        """Wait for ChatGPT response completion."""

        for i in range(timeout * 2):  # Check every 500ms
            # Check if we have captured responses
            if self.captured_responses:
                # Verify response is complete
                response = self.captured_responses[0]
                if '[DONE]' in response:
                    return

            # Check for content in DOM
            content_div = page.locator('[data-message-author-role="assistant"]').first
            if await content_div.count() > 0:
                content_text = await content_div.text_content()
                if content_text and len(content_text.strip()) > 50:
                    # Check if response seems complete
                    await asyncio.sleep(2)  # Allow for final updates
                    continue

            await asyncio.sleep(0.5)

        raise Exception("Response timeout")

    def parse_chatgpt_response(self, raw_response: str) -> Dict[str, Any]:
        """Parse the raw ChatGPT response into structured data."""

        # Extract streaming events
        events = extract_raw_response(raw_response)

        # Reconstruct content
        content = reconstruct_content(events)

        # Extract metadata
        model = self.extract_model_info(events)
        search_queries = self.extract_search_queries(events)

        return {
            'content': content,
            'model': model,
            'search_queries': search_queries,
            'raw_events': events
        }

    def extract_model_info(self, events: List[Dict]) -> str:
        """Extract model information from events."""
        for event in events:
            if 'model' in event:
                return event['model']
        return 'unknown'

    def extract_search_queries(self, events: List[Dict]) -> List[str]:
        """Extract search queries from the response."""
        queries = []

        # This requires analyzing the metadata in the events
        # Implementation varies based on ChatGPT's current format
        for event in events:
            if 'metadata' in event:
                metadata = event.get('metadata', {})
                if 'search_queries' in metadata:
                    queries.extend(metadata['search_queries'])

        return queries

Parsing the streaming response data

Now let’s dive deeper into the data extraction process:

Extracting citations and sources:

async def extract_sources(page: Page) -> List[Dict[str, Any]]:
    """Extract source citations from ChatGPT response."""

    try:
        # Click sources button if available
        sources_button = page.locator("button.group\\/footnote")
        if await sources_button.count() > 0:
            await sources_button.first.click()

            # Wait for modal
            modal = page.locator('[data-testid="screen-threadFlyOut"]')
            await modal.wait_for(state="visible", timeout=2000)

            # Extract links from modal
            links = modal.locator("a")
            link_count = await links.count()

            sources = []
            for i in range(link_count):
                link = links.nth(i)
                url = await link.get_attribute('href')
                text = await link.text_content()

                if url:
                    sources.append({
                        'url': url,
                        'title': text.strip() if text else '',
                        'position': i + 1
                    })

            return sources

    except Exception as e:
        print(f"Source extraction failed: {e}")
        return []

Shopping card extraction:

def extract_shopping_cards(events: List[Dict]) -> List[Dict[str, Any]]:
    """Extract product/shopping information from response."""

    shopping_cards = []

    for event in events:
        if 'shopping_card' in event:
            card_data = event['shopping_card']

            # Parse product information
            products = []
            for product in card_data.get('products', []):
                product_info = {
                    'title': product.get('title'),
                    'url': product.get('url'),
                    'price': product.get('price'),
                    'rating': product.get('rating'),
                    'num_reviews': product.get('num_reviews'),
                    'image_urls': product.get('image_urls', []),
                    'offers': []
                }

                # Parse merchant offers
                for offer in product.get('offers', []):
                    product_info['offers'].append({
                        'merchant_name': offer.get('merchant_name'),
                        'price': offer.get('price'),
                        'url': offer.get('url'),
                        'available': offer.get('available', True)
                    })

                products.append(product_info)

            shopping_cards.append({
                'tags': card_data.get('tags', []),
                'products': products
            })

    return shopping_cards

Entity extraction:

def extract_entities(events: List[Dict]) -> List[Dict[str, Any]]:
    """Extract named entities from ChatGPT response."""

    entities = []

    for event in events:
        if 'entities' in event:
            for entity in event['entities']:
                entities.append({
                    'type': entity.get('type'),
                    'name': entity.get('name'),
                    'confidence': entity.get('confidence'),
                    'context': entity.get('context')
                })

    return entities

Using cloro’s managed ChatGPT scraper

Building and maintaining a reliable ChatGPT scraper is complex and resource-intensive. That’s why we built cloro - a managed API that handles all the complexity for you.

Simple API integration:

import requests
import json

# Your prompt
prompt = "Compare the top 3 programming languages for web development in 2025"

# API request to cloro
response = requests.post(
    'https://api.cloro.dev/v1/monitor/chatgpt',
    headers={'Authorization': 'Bearer YOUR_API_KEY'},
    json={
        'prompt': prompt,
        'country': 'US',
        'include': {
            'markdown': True,
            'rawResponse': True,
            'searchQueries': True
        }
    }
)

result = response.json()
print(json.dumps(result, indent=2))

What cloro handles automatically:

  • Browser management: Rotating browsers, user agents, and fingerprints
  • Anti-bot evasion: Advanced CAPTCHA solving and detection avoidance
  • Rate limiting: Intelligent request scheduling and backoff strategies
  • Data parsing: Automatic extraction of structured data from responses
  • Error handling: Comprehensive retry logic and error recovery
  • Scalability: Distributed infrastructure for high-volume requests

Structured output you get:

{
  "status": "success",
  "result": {
    "model": "gpt-5-mini",
    "text": "When comparing programming languages for web development in 2025...",
    "markdown": "**When comparing programming languages for web development in 2025**...",
    "sources": [
      {
        "position": 1,
        "url": "https://developer.mozilla.org/en-US/docs/Learn",
        "label": "MDN Web Docs",
        "description": "Comprehensive web development documentation"
      }
    ],
    "shoppingCards": [
      {
        "tags": ["programming", "education"],
        "products": [
          {
            "title": "Python Crash Course",
            "price": "$39.99",
            "rating": 4.8,
            "offers": [...]
          }
        ]
      }
    ],
    "searchQueries": ["web development languages 2025", "popular programming frameworks"],
    "rawResponse": [...]
  }
}

Benefits of using cloro:

  • 99.9% uptime vs. DIY solutions that break frequently
  • P50 latency < 60s vs. manual scraping that takes hours
  • No infrastructure costs - we handle browsers, proxies, and maintenance
  • Structured data - automatic parsing of citations, shopping cards, and entities
  • Compliance - ethical scraping practices and rate limiting
  • Scalability - handle thousands of requests without breaking ChatGPT’s terms

Start scraping ChatGPT today.

The insights from ChatGPT data are too valuable to ignore. Whether you’re a researcher studying AI behavior, a business monitoring your competitive landscape, or a developer building AI-powered tools, access to structured ChatGPT data provides incredible opportunities.

For most developers and businesses, we recommend using cloro’s ChatGPT scraper. You get:

  • Immediate access to reliable scraping infrastructure
  • Automatic data parsing and structuring
  • Built-in anti-bot evasion and rate limiting
  • Comprehensive error handling and retries
  • Structured JSON output with all metadata

The cost of building and maintaining this infrastructure yourself typically runs $5,000-10,000/month in development time, browser instances, proxy services, and maintenance overhead.

For advanced users needing custom solutions, the technical approach outlined above provides the foundation for building your own scraping system. Be prepared for ongoing maintenance as ChatGPT frequently updates its anti-bot measures and response formats.

The window of opportunity is closing. As more businesses discover the value of AI monitoring, competition for attention in AI responses intensifies. Companies that start monitoring and optimizing their ChatGPT presence now will build advantages that become increasingly difficult to overcome.

Ready to unlock ChatGPT data for your business? Get started with cloro’s API to start accessing conversational AI insights.

Don’t let your competitors define how AI describes your industry. Start scraping ChatGPT today.