cloro
Technical Guides

How to scrape Google Gemini with minimal infrastructure

#Gemini#Google#Scraping

Google Gemini generates hundreds of millions of responses daily. Behind the scenes, the web interface delivers rich structured data with confidence scoring that direct APIs completely miss, including detailed sources, proper markdown formatting, and real-time web integration.

The challenge: Gemini wasn’t built for programmatic access. The platform uses sophisticated anti-bot systems, internal API endpoints with complex nested JSON, and session validation that traditional scraping tools can’t handle.

After analyzing millions of Gemini responses, we’ve reverse-engineered the complete process. This guide will show you exactly how to scrape Gemini and extract the structured data with confidence scoring that makes it valuable for businesses and researchers.

Why scrape Google Gemini responses?

Google Gemini’s API responses are nothing like what users see in the UI.

What you miss with the API:

  • The actual interface experience users get
  • Rich source citations with confidence levels
  • Proper markdown formatting and structure
  • Real-time web integration and context

Why it matters: API responses are nothing like the UI, making it impossible to verify information or understand source reliability without scraping.

The math: Scraping costs up to 12x less than direct API usage while providing the real user experience with confidence scoring.

Use cases:

  • Verification: Check what Gemini actually tells users with source confidence
  • SEO: Monitor how Gemini sources and cites information
  • Market research: Extract comprehensive responses with formatted markdown
  • Content analysis: Analyze how Gemini structures information with reliability scoring

You might also be interested in how to scrape Google AI Mode for a different perspective on Google’s AI search.

Understanding Gemini’s architecture

Before diving into the technical implementation, let’s understand what makes Gemini scraping challenging:

Gemini’s response generation process:

  1. Query Processing: Your prompt is analyzed and sent to Bard backend
  2. Internal API Calls: Gemini makes HTTP POST requests to Bard frontend endpoints
  3. JSON Array Responses: Responses come as structured nested JSON data
  4. Dynamic Rendering: Content is rendered client-side with source citations
  5. Confidence Scoring: Sources are assigned confidence levels based on reliability

Key technical challenges:

Internal API format:

// Gemini uses complex nested JSON arrays
const response = {
  0: [
    2,
    "response_data",
    {
      4: [0, "text_content", [{ 1: "confidence_levels" }]],
    },
  ],
};
// Content isn't available in standard API format

Complex response structure:

[0, [2, "nested_response_data", {"4": [0, "content", [sources]]}]]

Anti-bot detection:

  • Canvas fingerprinting
  • Request pattern monitoring
  • CAPTCHA challenges
  • Cookie-based session validation

Dynamic source loading:

  • Confidence-level based source ordering
  • Real-time web integration
  • Nested JSON parsing requirements

The internal API parsing challenge

The core of Gemini scraping lies in parsing complex nested JSON arrays from internal Bard endpoints. Here’s what makes it complex:

Event stream structure:

# Raw Gemini internal API response example
[0, [2, "response_data", {
  "4": [0, "Hello", [
    {"1": 85, "2": ["https://example.com", "Source Title", "Description"]}
  ]]
}]]

Parsing challenges:

  1. Nested arrays: Response data is deeply nested in JSON arrays
  2. Mixed indexing: Content and sources use different array positions
  3. Confidence extraction: Source confidence levels require specific path navigation
  4. Error handling: Network issues can corrupt the nested structure

Python parsing implementation:

import json
from typing import List, Dict, Any, Optional

def get_final_response(event_stream_body: str) -> Optional[Any]:
    """Extract the final complete response from an event stream."""
    lines: List[str] = event_stream_body.strip().split("\n")

    largest_response: Optional[Any] = None
    largest_size: int = 0

    for line in lines:
        try:
            data: Any = json.loads(line)
            line_size: int = len(line)
            if line_size > largest_size:
                largest_size = line_size
                largest_response = data

        except (json.JSONDecodeError, IndexError, TypeError):
            continue

    if not largest_response:
        return None

    return json.loads(largest_response[0][2])

def extract_response_text(response_object: Any) -> str:
    """Extract the main text content from nested response."""
    return response_object[4][0][1][0]

def extract_sources(response_object: Any) -> List[Dict[str, Any]]:
    """Extract sources with confidence levels from response."""
    sources: List[Dict[str, Any]] = []

    try:
        citations_objects = response_object[4][0][2][1]

        for idx, citation_object in enumerate(citations_objects, start=1):
            confidence_level = citation_object[1][2]
            url = citation_object[2][0][0]
            label = citation_object[2][0][1]
            description = citation_object[2][0][3]
            sources.append({
                "position": idx,
                "label": label,
                "url": url,
                "description": description,
                "confidence_level": confidence_level,
            })

    except (json.JSONDecodeError, IndexError, TypeError, KeyError, AttributeError):
        pass

    return sources

Building the scraping infrastructure

Let’s build the complete scraping system step by step.

Required components:

  1. Browser automation: Playwright for JavaScript-heavy interface
  2. Network interception: To capture internal Bard API calls
  3. JSON parser: To process nested response arrays
  4. Content extractor: To parse HTML and extract structured data

Complete scraper implementation:

import asyncio
from playwright.async_api import async_playwright, Page
import json
from typing import Dict, Any, List, Optional

class GeminiScraper:
    def __init__(self):
        self.captured_responses = []

    async def setup_page_interceptor(self, page: Page):
        """Set up network request interception for Bard endpoints."""

        async def handle_response(response):
            # Capture Bard frontend API responses
            if 'BardChatUi/data/assistant.lamda.BardFrontendService/StreamGenerate' in response.url:
                response_body = await response.text()
                self.captured_responses.append(response_body)

        page.on('response', handle_response)

    async def scrape_gemini(self, prompt: str) -> Dict[str, Any]:
        """Main scraping function."""

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            context = await browser.new_context()
            page = await context.new_page()

            # Set up response interception
            await self.setup_page_interceptor(page)

            try:
                # Navigate to Gemini
                await page.goto('https://gemini.google.com/app')

                # Wait for textarea and enter prompt
                await page.wait_for_selector('[role="textbox"]')
                await page.fill('[role="textbox"]', prompt)
                await page.press('[role="textbox"]', 'Enter')

                # Wait for response completion
                await self.wait_for_response(page)

                # Parse the captured response
                if self.captured_responses:
                    raw_response = self.captured_responses[0]
                    return self.parse_gemini_response(raw_response)
                else:
                    raise Exception("No response captured")

            finally:
                await browser.close()

    async def wait_for_response(self, page: Page, timeout: int = 60):
        """Wait for Gemini response completion."""

        for i in range(timeout * 2):  # Check every 500ms
            # Check if we have captured responses
            if self.captured_responses:
                return

            # Check for content in DOM
            content_div = page.locator('message-content').first
            if await content_div.count() > 0:
                content_text = await content_div.text_content()
                if content_text and len(content_text.strip()) > 50:
                    await asyncio.sleep(2)  # Allow for final updates
                    continue

            await asyncio.sleep(0.5)

        raise Exception("Response timeout")

    def parse_gemini_response(self, raw_response: str) -> Dict[str, Any]:
        """Parse the raw Gemini response into structured data."""

        # Extract final response from event stream
        final_response = get_final_response(raw_response)

        # Extract text and sources
        text = extract_response_text(final_response)
        sources = extract_sources(final_response)

        return {
            'text': text,
            'sources': sources,
        }

Parsing the streaming response data

Now let’s dive deeper into the data extraction process:

Extracting markdown with inline sources:

async def extract_markdown_with_sources(page: Page, sources: List[Dict]) -> str:
    """Extract markdown content with inline source citations."""

    try:
        # Wait for source chips to be visible
        chip_locator = "source-inline-chip .button"
        if await page.locator(chip_locator).count() > 0:
            await page.locator(chip_locator).first.wait_for(state="visible")

        # Get the main content HTML
        content_html = await page.locator("message-content").first.inner_html()

        # Convert HTML to markdown with source links
        markdown = convert_html_to_markdown_with_links(
            content_html,
            [[source] for source in sources],
            chip_locator
        )

        return markdown

    except Exception as e:
        print(f"Markdown extraction failed: {e}")
        return ""

async def extract_html_content(page: Page, request_id: str) -> str:
    """Extract full HTML content for upload."""

    try:
        full_html = await page.content()

        # Upload to storage service
        uploaded_url = await upload_html(request_id, full_html)
        return uploaded_url

    except Exception as e:
        print(f"HTML extraction failed: {e}")
        return ""

Complete response parsing with all data types:

from typing import TypedDict, List, NotRequired, Optional

class GeminiLinkData(TypedDict):
    position: int
    label: str
    url: str
    description: str
    confidence_level: int

class GeminiResult(TypedDict):
    text: str
    sources: List[GeminiLinkData]
    markdown: NotRequired[str]
    html: NotRequired[Optional[str]]

async def parse_complete_gemini_response(
    page: Page,
    request_data: Dict[str, Any],
    event_stream_body: str
) -> GeminiResult:
    """Parse Gemini response with all optional data types."""

    include_markdown = request_data.get("include", {}).get("markdown", False)
    include_html = request_data.get("include", {}).get("html", False)

    # Extract core data
    final_response = get_final_response(event_stream_body)
    text = extract_response_text(final_response)
    sources = extract_sources(final_response)

    result: GeminiResult = {
        "text": text,
        "sources": sources,
    }

    # Add optional data
    if include_markdown:
        result["markdown"] = await extract_markdown_with_sources(page, sources)

    if include_html:
        result["html"] = await extract_html_content(page, request_data["requestId"])

    return result

Using cloro’s managed Gemini scraper

Building and maintaining a reliable Gemini scraper is complex and resource-intensive. That’s why we built cloro - a managed API that handles all the complexity for you.

Simple API integration:

import requests
import json

# Your prompt
prompt = "What are the latest developments in renewable energy in 2025?"

# API request to cloro
response = requests.post(
    'https://api.cloro.dev/v1/monitor/gemini',
    headers={'Authorization': 'Bearer YOUR_API_KEY'},
    json={
        'prompt': prompt,
        'country': 'US',
        'include': {
            'markdown': True,
            'html': True,
            'sources': True
        }
    }
)

result = response.json()
print(json.dumps(result, indent=2))

What cloro handles automatically:

  • Browser management: Rotating browsers, user agents, and fingerprints
  • Anti-bot evasion: Advanced CAPTCHA solving and detection avoidance
  • Rate limiting: Intelligent request scheduling and backoff strategies
  • Data parsing: Automatic extraction of structured data from responses
  • Error handling: Comprehensive retry logic and error recovery
  • Scalability: Distributed infrastructure for high-volume requests

Structured output you get:

{
  "status": "success",
  "result": {
    "text": "The renewable energy sector has seen remarkable developments in 2025...",
    "sources": [
      {
        "position": 1,
        "url": "https://energy.gov/solar-innovations",
        "label": "DOE Solar Innovations Report",
        "description": "Latest breakthroughs in solar panel efficiency and storage technology",
        "confidence_level": 92
      }
    ],
    "markdown": "**The renewable energy sector** has seen remarkable developments in 2025...",
    "html": "https://storage.cloud.html/uploaded-gemini-response.html"
  }
}

Benefits of using cloro:

  • 99.9% uptime vs. DIY solutions that break frequently
  • P50 latency < 45s vs. manual scraping that takes hours
  • No infrastructure costs - we handle browsers, proxies, and maintenance
  • Structured data - automatic parsing of sources with confidence levels and markdown
  • Compliance - ethical scraping practices and rate limiting
  • Scalability - handle thousands of requests without breaking Gemini’s terms

Conclusion

Start scraping Gemini today.

The insights from Gemini data are too valuable to ignore. Whether you’re a researcher studying AI behavior, a business monitoring your competitive landscape, or a developer building AI-powered tools, access to structured Gemini data provides incredible opportunities with unique confidence scoring.

For most developers and businesses, we recommend using cloro’s Gemini scraper. You get:

  • Immediate access to reliable scraping infrastructure
  • Automatic data parsing with confidence scoring
  • Built-in anti-bot evasion and rate limiting
  • Comprehensive error handling and retries
  • Structured JSON output with all metadata

The cost of building and maintaining this infrastructure yourself typically runs $5,000-10,000/month in development time, browser instances, proxy services, and maintenance overhead.

For advanced users needing custom solutions, the technical approach outlined above provides the foundation for building your own scraping system. Be prepared for ongoing maintenance as Gemini frequently updates its anti-bot measures and response formats.

The window of opportunity is closing. As more businesses discover the value of AI monitoring, competition for attention in AI responses intensifies. Companies that start monitoring and optimizing their Gemini presence now will build advantages that become increasingly difficult to overcome.

Ready to unlock Gemini data for your business? Get started with cloro’s API to start accessing advanced AI conversation data.

Don’t let your competitors define how AI describes your industry. Start scraping Gemini today.