How to scrape ChatGPT with minimal infrastructure
ChatGPT generates over 1 billion responses daily. Behind the scenes, the web interface delivers rich structured data that direct APIs completely miss, including citations, shopping recommendations, and brand intelligence.
The challenge: ChatGPT wasn’t built for programmatic access. The platform uses sophisticated anti-bot systems, dynamic content rendering, and Server-Sent Events streaming that traditional scraping tools can’t handle.
After analyzing over 10 million ChatGPT responses, we’ve reverse-engineered the complete process. This guide will show you exactly how to scrape ChatGPT and extract the structured data that makes it valuable for businesses and researchers.
Table of contents
- Why scrape ChatGPT responses?
- Understanding ChatGPT’s architecture
- The event stream parsing challenge
- Building the scraping infrastructure
- Parsing the streaming response data
- Extracting structured data from responses
- Using cloro’s managed ChatGPT scraper
Why scrape ChatGPT responses?
ChatGPT’s API responses are nothing like what users see in the UI.
What you miss with the API:
- The actual interface experience users get
- Sources and citations for verification
- Shopping cards with product recommendations
- Brand entity recognition and tracking
Why it matters: API responses are nothing like the UI, making it impossible to verify information or influence SEO without scraping.
The math: Scraping costs up to 12x less than direct API usage while providing the real user experience.
Use cases:
- Verification: Check what ChatGPT actually tells users
- SEO: Monitor how ChatGPT sources and cites information
- E-commerce: Track product recommendations and brand mentions
- Research: Analyze AI response patterns and bias
Understanding ChatGPT’s architecture
Before diving into the technical implementation, let’s understand what makes ChatGPT scraping challenging:
ChatGPT’s response generation process:
- Query Processing: Your prompt is analyzed and broken down into sub-queries using query fanout
- Search Integration: For web-enabled chats, ChatGPT performs real-time web searches
- Streaming Generation: Responses are generated using Server-Sent Events (SSE)
- Dynamic Rendering: Content is rendered client-side using React and web components
- Source Attribution: Citations and sources are dynamically linked and rendered
Key technical challenges:
JavaScript-heavy interface:
// ChatGPT uses React components that require full browser rendering
const responseContainer = document.querySelector(
'[data-message-author-role="assistant"]',
);
// Content isn't available in initial HTML
Streaming response format:
data: {"id": "chatcmpl-abc123", "object": "chat.completion.chunk"}
data: {"choices": [{"delta": {"content": "Hello"}}]}
data: [DONE]
Anti-bot detection:
- Canvas fingerprinting
- Behavioral analysis
- Request pattern monitoring
- CAPTCHA challenges
Dynamic content loading:
- Lazy-loaded source citations
- Modal-based source browsing
- Real-time content updates
The event stream parsing challenge
The core of ChatGPT scraping lies in parsing Server-Sent Events (SSE). Here’s what makes it complex:
Event stream structure:
# Raw ChatGPT event stream example
data: {"id": "chatcmpl-123", "object": "chat.completion.chunk", "created": 1677652288}
data: {"choices": [{"index": 0, "delta": {"content": "I"}}]}
data: {"choices": [{"index": 0, "delta": {"content": " recommend"}}]}
data: {"choices": [{"index": 0, "delta": {"content": " using"}}]}
data: {"choices": [{"index": 0, "delta": {"content": " Python"}}]}
data: [DONE]
Parsing challenges:
- Mixed data types: Events contain both JSON and special markers
- Partial responses: Content comes in chunks that need reconstruction
- Metadata extraction: Model info, citations, and search queries are embedded
- Error handling: Network issues can split events mid-stream
Python parsing implementation:
import json
from typing import List, Dict, Any
def extract_raw_response(input_string: str) -> List[Dict[str, Any]]:
"""Parse ChatGPT's Server-Sent Events stream."""
json_objects = []
# Split by lines that start with "data: "
lines = input_string.split("\n")
for line in lines:
# Skip empty lines and non-data lines
if not line.strip() or not line.startswith("data: "):
continue
# Remove "data: " prefix
json_str = line[6:].strip()
# Skip special markers like [DONE]
if json_str == "[DONE]":
continue
# Try to parse as JSON
try:
json_obj = json.loads(json_str)
# Only include if it's a dictionary (object), not string or other types
if isinstance(json_obj, dict):
json_objects.append(json_obj)
except json.JSONDecodeError:
# Skip invalid JSON
continue
return json_objects
Reconstructing the full response:
def reconstruct_content(events: List[Dict[str, Any]]) -> str:
"""Rebuild complete response from streaming chunks."""
content_parts = []
for event in events:
# Extract content from delta messages
if 'choices' in event and len(event['choices']) > 0:
delta = event['choices'][0].get('delta', {})
if 'content' in delta:
content_parts.append(delta['content'])
return ''.join(content_parts)
Building the scraping infrastructure
Let’s build the complete scraping system step by step.
Required components:
- Browser automation: Playwright or Selenium
- Network interception: To capture API calls
- Event stream parser: To process SSE data
- Content extractor: To parse HTML and extract structured data
- Error handling: For captchas, rate limits, and network issues
Complete scraper implementation:
import asyncio
from playwright.async_api import async_playwright, Page
import json
from typing import Dict, Any, List
class ChatGPTScraper:
def __init__(self):
self.captured_responses = []
async def setup_page_interceptor(self, page: Page):
"""Set up network request interception."""
async def handle_response(response):
# Capture conversation API responses
if 'backend-api/f/conversation' in response.url:
response_body = await response.text()
self.captured_responses.append(response_body)
page.on('response', handle_response)
async def scrape_chatgpt(self, prompt: str) -> Dict[str, Any]:
"""Main scraping function."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
# Set up response interception
await self.setup_page_interceptor(page)
try:
# Navigate to ChatGPT
await page.goto('https://chatgpt.com/?temporary-chat=true')
# Wait for textarea and enter prompt
await page.wait_for_selector('#prompt-textarea')
await page.fill('#prompt-textarea', '/search') # Enable web search
await page.press('#prompt-textarea', 'Enter')
await asyncio.sleep(0.5)
await page.fill('#prompt-textarea', prompt)
await page.press('#prompt-textarea', 'Enter')
# Wait for response completion
await self.wait_for_response(page)
# Parse the captured response
if self.captured_responses:
raw_response = self.captured_responses[0]
return self.parse_chatgpt_response(raw_response)
else:
raise Exception("No response captured")
finally:
await browser.close()
async def wait_for_response(self, page: Page, timeout: int = 60):
"""Wait for ChatGPT response completion."""
for i in range(timeout * 2): # Check every 500ms
# Check if we have captured responses
if self.captured_responses:
# Verify response is complete
response = self.captured_responses[0]
if '[DONE]' in response:
return
# Check for content in DOM
content_div = page.locator('[data-message-author-role="assistant"]').first
if await content_div.count() > 0:
content_text = await content_div.text_content()
if content_text and len(content_text.strip()) > 50:
# Check if response seems complete
await asyncio.sleep(2) # Allow for final updates
continue
await asyncio.sleep(0.5)
raise Exception("Response timeout")
def parse_chatgpt_response(self, raw_response: str) -> Dict[str, Any]:
"""Parse the raw ChatGPT response into structured data."""
# Extract streaming events
events = extract_raw_response(raw_response)
# Reconstruct content
content = reconstruct_content(events)
# Extract metadata
model = self.extract_model_info(events)
search_queries = self.extract_search_queries(events)
return {
'content': content,
'model': model,
'search_queries': search_queries,
'raw_events': events
}
def extract_model_info(self, events: List[Dict]) -> str:
"""Extract model information from events."""
for event in events:
if 'model' in event:
return event['model']
return 'unknown'
def extract_search_queries(self, events: List[Dict]) -> List[str]:
"""Extract search queries from the response."""
queries = []
# This requires analyzing the metadata in the events
# Implementation varies based on ChatGPT's current format
for event in events:
if 'metadata' in event:
metadata = event.get('metadata', {})
if 'search_queries' in metadata:
queries.extend(metadata['search_queries'])
return queries
Parsing the streaming response data
Now let’s dive deeper into the data extraction process:
Extracting citations and sources:
async def extract_sources(page: Page) -> List[Dict[str, Any]]:
"""Extract source citations from ChatGPT response."""
try:
# Click sources button if available
sources_button = page.locator("button.group\\/footnote")
if await sources_button.count() > 0:
await sources_button.first.click()
# Wait for modal
modal = page.locator('[data-testid="screen-threadFlyOut"]')
await modal.wait_for(state="visible", timeout=2000)
# Extract links from modal
links = modal.locator("a")
link_count = await links.count()
sources = []
for i in range(link_count):
link = links.nth(i)
url = await link.get_attribute('href')
text = await link.text_content()
if url:
sources.append({
'url': url,
'title': text.strip() if text else '',
'position': i + 1
})
return sources
except Exception as e:
print(f"Source extraction failed: {e}")
return []
Shopping card extraction:
def extract_shopping_cards(events: List[Dict]) -> List[Dict[str, Any]]:
"""Extract product/shopping information from response."""
shopping_cards = []
for event in events:
if 'shopping_card' in event:
card_data = event['shopping_card']
# Parse product information
products = []
for product in card_data.get('products', []):
product_info = {
'title': product.get('title'),
'url': product.get('url'),
'price': product.get('price'),
'rating': product.get('rating'),
'num_reviews': product.get('num_reviews'),
'image_urls': product.get('image_urls', []),
'offers': []
}
# Parse merchant offers
for offer in product.get('offers', []):
product_info['offers'].append({
'merchant_name': offer.get('merchant_name'),
'price': offer.get('price'),
'url': offer.get('url'),
'available': offer.get('available', True)
})
products.append(product_info)
shopping_cards.append({
'tags': card_data.get('tags', []),
'products': products
})
return shopping_cards
Entity extraction:
def extract_entities(events: List[Dict]) -> List[Dict[str, Any]]:
"""Extract named entities from ChatGPT response."""
entities = []
for event in events:
if 'entities' in event:
for entity in event['entities']:
entities.append({
'type': entity.get('type'),
'name': entity.get('name'),
'confidence': entity.get('confidence'),
'context': entity.get('context')
})
return entities
Using cloro’s managed ChatGPT scraper
Building and maintaining a reliable ChatGPT scraper is complex and resource-intensive. That’s why we built cloro - a managed API that handles all the complexity for you.
Simple API integration:
import requests
import json
# Your prompt
prompt = "Compare the top 3 programming languages for web development in 2025"
# API request to cloro
response = requests.post(
'https://api.cloro.dev/v1/monitor/chatgpt',
headers={'Authorization': 'Bearer YOUR_API_KEY'},
json={
'prompt': prompt,
'country': 'US',
'include': {
'markdown': True,
'rawResponse': True,
'searchQueries': True
}
}
)
result = response.json()
print(json.dumps(result, indent=2))
What cloro handles automatically:
- Browser management: Rotating browsers, user agents, and fingerprints
- Anti-bot evasion: Advanced CAPTCHA solving and detection avoidance
- Rate limiting: Intelligent request scheduling and backoff strategies
- Data parsing: Automatic extraction of structured data from responses
- Error handling: Comprehensive retry logic and error recovery
- Scalability: Distributed infrastructure for high-volume requests
Structured output you get:
{
"status": "success",
"result": {
"model": "gpt-5-mini",
"text": "When comparing programming languages for web development in 2025...",
"markdown": "**When comparing programming languages for web development in 2025**...",
"sources": [
{
"position": 1,
"url": "https://developer.mozilla.org/en-US/docs/Learn",
"label": "MDN Web Docs",
"description": "Comprehensive web development documentation"
}
],
"shoppingCards": [
{
"tags": ["programming", "education"],
"products": [
{
"title": "Python Crash Course",
"price": "$39.99",
"rating": 4.8,
"offers": [...]
}
]
}
],
"searchQueries": ["web development languages 2025", "popular programming frameworks"],
"rawResponse": [...]
}
}
Benefits of using cloro:
- 99.9% uptime vs. DIY solutions that break frequently
- P50 latency < 60s vs. manual scraping that takes hours
- No infrastructure costs - we handle browsers, proxies, and maintenance
- Structured data - automatic parsing of citations, shopping cards, and entities
- Compliance - ethical scraping practices and rate limiting
- Scalability - handle thousands of requests without breaking ChatGPT’s terms
Start scraping ChatGPT today.
The insights from ChatGPT data are too valuable to ignore. Whether you’re a researcher studying AI behavior, a business monitoring your competitive landscape, or a developer building AI-powered tools, access to structured ChatGPT data provides incredible opportunities.
For most developers and businesses, we recommend using cloro’s ChatGPT scraper. You get:
- Immediate access to reliable scraping infrastructure
- Automatic data parsing and structuring
- Built-in anti-bot evasion and rate limiting
- Comprehensive error handling and retries
- Structured JSON output with all metadata
The cost of building and maintaining this infrastructure yourself typically runs $5,000-10,000/month in development time, browser instances, proxy services, and maintenance overhead.
For advanced users needing custom solutions, the technical approach outlined above provides the foundation for building your own scraping system. Be prepared for ongoing maintenance as ChatGPT frequently updates its anti-bot measures and response formats.
The window of opportunity is closing. As more businesses discover the value of AI monitoring, competition for attention in AI responses intensifies. Companies that start monitoring and optimizing their ChatGPT presence now will build advantages that become increasingly difficult to overcome.
Ready to unlock ChatGPT data for your business? Get started with cloro’s API to start accessing conversational AI insights.
Don’t let your competitors define how AI describes your industry. Start scraping ChatGPT today.