How to scrape Google Gemini with minimal infrastructure
Google Gemini generates hundreds of millions of responses daily. Behind the scenes, the web interface delivers rich structured data with confidence scoring that direct APIs completely miss, including detailed sources, proper markdown formatting, and real-time web integration.
The challenge: Gemini wasn’t built for programmatic access. The platform uses sophisticated anti-bot systems, internal API endpoints with complex nested JSON, and session validation that traditional scraping tools can’t handle.
After analyzing millions of Gemini responses, we’ve reverse-engineered the complete process. This guide will show you exactly how to scrape Gemini and extract the structured data with confidence scoring that makes it valuable for businesses and researchers.
Why scrape Google Gemini responses?
Google Gemini’s API responses are nothing like what users see in the UI.
What you miss with the API:
- The actual interface experience users get
- Rich source citations with confidence levels
- Proper markdown formatting and structure
- Real-time web integration and context
Why it matters: API responses are nothing like the UI, making it impossible to verify information or understand source reliability without scraping.
The math: Scraping costs up to 12x less than direct API usage while providing the real user experience with confidence scoring.
Use cases:
- Verification: Check what Gemini actually tells users with source confidence
- SEO: Monitor how Gemini sources and cites information
- Market research: Extract comprehensive responses with formatted markdown
- Content analysis: Analyze how Gemini structures information with reliability scoring
You might also be interested in how to scrape Google AI Mode for a different perspective on Google’s AI search.
Understanding Gemini’s architecture
Before diving into the technical implementation, let’s understand what makes Gemini scraping challenging:
Gemini’s response generation process:
- Query Processing: Your prompt is analyzed and sent to Bard backend
- Internal API Calls: Gemini makes HTTP POST requests to Bard frontend endpoints
- JSON Array Responses: Responses come as structured nested JSON data
- Dynamic Rendering: Content is rendered client-side with source citations
- Confidence Scoring: Sources are assigned confidence levels based on reliability
Key technical challenges:
Internal API format:
// Gemini uses complex nested JSON arrays
const response = {
0: [
2,
"response_data",
{
4: [0, "text_content", [{ 1: "confidence_levels" }]],
},
],
};
// Content isn't available in standard API format
Complex response structure:
[0, [2, "nested_response_data", {"4": [0, "content", [sources]]}]]
Anti-bot detection:
- Canvas fingerprinting
- Request pattern monitoring
- CAPTCHA challenges
- Cookie-based session validation
Dynamic source loading:
- Confidence-level based source ordering
- Real-time web integration
- Nested JSON parsing requirements
The internal API parsing challenge
The core of Gemini scraping lies in parsing complex nested JSON arrays from internal Bard endpoints. Here’s what makes it complex:
Event stream structure:
# Raw Gemini internal API response example
[0, [2, "response_data", {
"4": [0, "Hello", [
{"1": 85, "2": ["https://example.com", "Source Title", "Description"]}
]]
}]]
Parsing challenges:
- Nested arrays: Response data is deeply nested in JSON arrays
- Mixed indexing: Content and sources use different array positions
- Confidence extraction: Source confidence levels require specific path navigation
- Error handling: Network issues can corrupt the nested structure
Python parsing implementation:
import json
from typing import List, Dict, Any, Optional
def get_final_response(event_stream_body: str) -> Optional[Any]:
"""Extract the final complete response from an event stream."""
lines: List[str] = event_stream_body.strip().split("\n")
largest_response: Optional[Any] = None
largest_size: int = 0
for line in lines:
try:
data: Any = json.loads(line)
line_size: int = len(line)
if line_size > largest_size:
largest_size = line_size
largest_response = data
except (json.JSONDecodeError, IndexError, TypeError):
continue
if not largest_response:
return None
return json.loads(largest_response[0][2])
def extract_response_text(response_object: Any) -> str:
"""Extract the main text content from nested response."""
return response_object[4][0][1][0]
def extract_sources(response_object: Any) -> List[Dict[str, Any]]:
"""Extract sources with confidence levels from response."""
sources: List[Dict[str, Any]] = []
try:
citations_objects = response_object[4][0][2][1]
for idx, citation_object in enumerate(citations_objects, start=1):
confidence_level = citation_object[1][2]
url = citation_object[2][0][0]
label = citation_object[2][0][1]
description = citation_object[2][0][3]
sources.append({
"position": idx,
"label": label,
"url": url,
"description": description,
"confidence_level": confidence_level,
})
except (json.JSONDecodeError, IndexError, TypeError, KeyError, AttributeError):
pass
return sources
Building the scraping infrastructure
Let’s build the complete scraping system step by step.
Required components:
- Browser automation: Playwright for JavaScript-heavy interface
- Network interception: To capture internal Bard API calls
- JSON parser: To process nested response arrays
- Content extractor: To parse HTML and extract structured data
Complete scraper implementation:
import asyncio
from playwright.async_api import async_playwright, Page
import json
from typing import Dict, Any, List, Optional
class GeminiScraper:
def __init__(self):
self.captured_responses = []
async def setup_page_interceptor(self, page: Page):
"""Set up network request interception for Bard endpoints."""
async def handle_response(response):
# Capture Bard frontend API responses
if 'BardChatUi/data/assistant.lamda.BardFrontendService/StreamGenerate' in response.url:
response_body = await response.text()
self.captured_responses.append(response_body)
page.on('response', handle_response)
async def scrape_gemini(self, prompt: str) -> Dict[str, Any]:
"""Main scraping function."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
# Set up response interception
await self.setup_page_interceptor(page)
try:
# Navigate to Gemini
await page.goto('https://gemini.google.com/app')
# Wait for textarea and enter prompt
await page.wait_for_selector('[role="textbox"]')
await page.fill('[role="textbox"]', prompt)
await page.press('[role="textbox"]', 'Enter')
# Wait for response completion
await self.wait_for_response(page)
# Parse the captured response
if self.captured_responses:
raw_response = self.captured_responses[0]
return self.parse_gemini_response(raw_response)
else:
raise Exception("No response captured")
finally:
await browser.close()
async def wait_for_response(self, page: Page, timeout: int = 60):
"""Wait for Gemini response completion."""
for i in range(timeout * 2): # Check every 500ms
# Check if we have captured responses
if self.captured_responses:
return
# Check for content in DOM
content_div = page.locator('message-content').first
if await content_div.count() > 0:
content_text = await content_div.text_content()
if content_text and len(content_text.strip()) > 50:
await asyncio.sleep(2) # Allow for final updates
continue
await asyncio.sleep(0.5)
raise Exception("Response timeout")
def parse_gemini_response(self, raw_response: str) -> Dict[str, Any]:
"""Parse the raw Gemini response into structured data."""
# Extract final response from event stream
final_response = get_final_response(raw_response)
# Extract text and sources
text = extract_response_text(final_response)
sources = extract_sources(final_response)
return {
'text': text,
'sources': sources,
}
Parsing the streaming response data
Now let’s dive deeper into the data extraction process:
Extracting markdown with inline sources:
async def extract_markdown_with_sources(page: Page, sources: List[Dict]) -> str:
"""Extract markdown content with inline source citations."""
try:
# Wait for source chips to be visible
chip_locator = "source-inline-chip .button"
if await page.locator(chip_locator).count() > 0:
await page.locator(chip_locator).first.wait_for(state="visible")
# Get the main content HTML
content_html = await page.locator("message-content").first.inner_html()
# Convert HTML to markdown with source links
markdown = convert_html_to_markdown_with_links(
content_html,
[[source] for source in sources],
chip_locator
)
return markdown
except Exception as e:
print(f"Markdown extraction failed: {e}")
return ""
async def extract_html_content(page: Page, request_id: str) -> str:
"""Extract full HTML content for upload."""
try:
full_html = await page.content()
# Upload to storage service
uploaded_url = await upload_html(request_id, full_html)
return uploaded_url
except Exception as e:
print(f"HTML extraction failed: {e}")
return ""
Complete response parsing with all data types:
from typing import TypedDict, List, NotRequired, Optional
class GeminiLinkData(TypedDict):
position: int
label: str
url: str
description: str
confidence_level: int
class GeminiResult(TypedDict):
text: str
sources: List[GeminiLinkData]
markdown: NotRequired[str]
html: NotRequired[Optional[str]]
async def parse_complete_gemini_response(
page: Page,
request_data: Dict[str, Any],
event_stream_body: str
) -> GeminiResult:
"""Parse Gemini response with all optional data types."""
include_markdown = request_data.get("include", {}).get("markdown", False)
include_html = request_data.get("include", {}).get("html", False)
# Extract core data
final_response = get_final_response(event_stream_body)
text = extract_response_text(final_response)
sources = extract_sources(final_response)
result: GeminiResult = {
"text": text,
"sources": sources,
}
# Add optional data
if include_markdown:
result["markdown"] = await extract_markdown_with_sources(page, sources)
if include_html:
result["html"] = await extract_html_content(page, request_data["requestId"])
return result
Using cloro’s managed Gemini scraper
Building and maintaining a reliable Gemini scraper is complex and resource-intensive. That’s why we built cloro - a managed API that handles all the complexity for you.
Simple API integration:
import requests
import json
# Your prompt
prompt = "What are the latest developments in renewable energy in 2025?"
# API request to cloro
response = requests.post(
'https://api.cloro.dev/v1/monitor/gemini',
headers={'Authorization': 'Bearer YOUR_API_KEY'},
json={
'prompt': prompt,
'country': 'US',
'include': {
'markdown': True,
'html': True,
'sources': True
}
}
)
result = response.json()
print(json.dumps(result, indent=2))
What cloro handles automatically:
- Browser management: Rotating browsers, user agents, and fingerprints
- Anti-bot evasion: Advanced CAPTCHA solving and detection avoidance
- Rate limiting: Intelligent request scheduling and backoff strategies
- Data parsing: Automatic extraction of structured data from responses
- Error handling: Comprehensive retry logic and error recovery
- Scalability: Distributed infrastructure for high-volume requests
Structured output you get:
{
"status": "success",
"result": {
"text": "The renewable energy sector has seen remarkable developments in 2025...",
"sources": [
{
"position": 1,
"url": "https://energy.gov/solar-innovations",
"label": "DOE Solar Innovations Report",
"description": "Latest breakthroughs in solar panel efficiency and storage technology",
"confidence_level": 92
}
],
"markdown": "**The renewable energy sector** has seen remarkable developments in 2025...",
"html": "https://storage.cloud.html/uploaded-gemini-response.html"
}
}
Benefits of using cloro:
- 99.9% uptime vs. DIY solutions that break frequently
- P50 latency < 45s vs. manual scraping that takes hours
- No infrastructure costs - we handle browsers, proxies, and maintenance
- Structured data - automatic parsing of sources with confidence levels and markdown
- Compliance - ethical scraping practices and rate limiting
- Scalability - handle thousands of requests without breaking Gemini’s terms
Conclusion
Start scraping Gemini today.
The insights from Gemini data are too valuable to ignore. Whether you’re a researcher studying AI behavior, a business monitoring your competitive landscape, or a developer building AI-powered tools, access to structured Gemini data provides incredible opportunities with unique confidence scoring.
For most developers and businesses, we recommend using cloro’s Gemini scraper. You get:
- Immediate access to reliable scraping infrastructure
- Automatic data parsing with confidence scoring
- Built-in anti-bot evasion and rate limiting
- Comprehensive error handling and retries
- Structured JSON output with all metadata
The cost of building and maintaining this infrastructure yourself typically runs $5,000-10,000/month in development time, browser instances, proxy services, and maintenance overhead.
For advanced users needing custom solutions, the technical approach outlined above provides the foundation for building your own scraping system. Be prepared for ongoing maintenance as Gemini frequently updates its anti-bot measures and response formats.
The window of opportunity is closing. As more businesses discover the value of AI monitoring, competition for attention in AI responses intensifies. Companies that start monitoring and optimizing their Gemini presence now will build advantages that become increasingly difficult to overcome.
Ready to unlock Gemini data for your business? Get started with cloro’s API to start accessing advanced AI conversation data.
Don’t let your competitors define how AI describes your industry. Start scraping Gemini today.