How to scrape Microsoft Copilot with minimal infrastructure | cloro Blog

Microsoft Copilot processes millions of enterprise and development queries daily. The web interface delivers specialized Microsoft ecosystem knowledge and technical documentation that direct APIs completely miss.

The challenge: Copilot wasn’t designed for programmatic access. The platform uses WebSocket event streaming for real-time communication and simple session management that traditional scraping tools can’t handle.

After analyzing 3+ million Copilot responses, we’ve reverse-engineered their WebSocket architecture. This guide will show you exactly how to scrape Copilot and extract the Microsoft ecosystem intelligence that makes it valuable for enterprise applications.

Why scrape Microsoft Copilot responses?
Understanding Copilot’s WebSocket architecture
The WebSocket event parsing challenge
Building the scraping infrastructure
Parsing streaming text and citations
Using cloro’s managed Copilot scraper

Why scrape Microsoft Copilot responses?

Microsoft Copilot provides unique AI-assisted responses that integrate with Microsoft’s ecosystem and real-time web search.

What makes Copilot responses valuable:

The actual conversational responses users see in the chat interface
Integrated source citations and web search results for verification
Context-aware responses that leverage Microsoft’s ecosystem knowledge
Real-time information integration with current web data
Streaming response generation with dynamic updates

Why it matters: Copilot responses represent a unique approach to AI assistance that combines conversational AI with real-time web search and Microsoft ecosystem integration, providing insights that can’t be obtained through other means.

Use cases:

Verification: Check what Copilot actually tells users
SEO: Monitor how Copilot sources and cites information
Compliance: Ensure accuracy of AI-generated responses
Research: Analyze response patterns and sources

For a similar guide on another platform, check out how to scrape ChatGPT.

Understanding Copilot’s WebSocket architecture

Copilot uses a straightforward real-time communication system based on WebSocket events:

Copilot’s response generation process:

Page Navigation: Loads the Copilot web interface
WebSocket Connection: Intercepts WebSocket messages from the chat endpoint
Event Collection: Captures all JSON events sent via WebSocket
Response Parsing: Processes collected events when completion is detected
Microsoft Knowledge: Leverages deep Microsoft product and service knowledge base

Key technical challenges:

WebSocket event interception:

// Copilot sends events via WebSocket from:
// copilot.microsoft.com/c/api/chat

// Events are simple JSON objects:
{
  "event": "appendText",
  "text": "To improve team productivity..."
}

{
  "event": "citation",
  "title": "Microsoft 365 Documentation",
  "url": "https://docs.microsoft.com/..."
}

{
  "event": "done"
}

Event types to handle:

appendText: Text content chunks
citation: Source citations embedded inline
done: Completion marker

The WebSocket event parsing challenge

The core of Copilot scraping lies in parsing real-time WebSocket events and reconstructing structured responses:

WebSocket event stream:

# Raw Copilot WebSocket events example
{
  "event": "appendText",
  "text": "To improve team productivity using Microsoft 365"
}

{
  "event": "citation",
  "title": "Microsoft 365 Documentation",
  "url": "https://docs.microsoft.com/en-us/microsoft-365/"
}

{
  "event": "appendText",
  "text": ", I recommend implementing SharePoint for document collaboration"
}

{
  "event": "done"
}

Parsing challenges:

Real-time streaming: Content arrives via WebSocket events, not HTTP responses
Mixed event types: Text and citation events are interleaved
Citation pill grouping: Multiple citations can be grouped together
Event ordering: Citation positions must be tracked accurately

Python WebSocket parsing implementation:

import json
from typing import List, Dict, Any

class CopilotWebSocketParser:
    def __init__(self):
        self.text_parts = []
        self.citation_pills: List[List[Dict[str, Any]]] = []
        self.current_pill: List[Dict[str, Any]] = []
        self.citation_position = 1
        self.last_event_was_citation = False
        self.is_complete = False

    def parse_websocket_events(self, events: List[Dict[str, Any]]) -> Dict[str, Any]:
        """
        Parse Copilot WebSocket events into structured response.
        """
        for event in events:
            event_type = event.get("event")

            # Collect text chunks
            if event_type == "appendText":
                text_chunk = event.get("text", "")
                self.text_parts.append(text_chunk)

                # If we were building a citation pill and now see text, save the pill
                if self.last_event_was_citation and self.current_pill:
                    self.citation_pills.append(self.current_pill)
                    self.current_pill = []

                self.last_event_was_citation = False

            # Collect citations
            elif event_type == "citation":
                citation_data = {
                    "position": self.citation_position,
                    "label": event.get("title", ""),
                    "url": event.get("url", ""),
                    "description": None
                }

                self.current_pill.append(citation_data)
                self.citation_position += 1
                self.last_event_was_citation = True

            # Check for completion
            elif event_type == "done":
                self.is_complete = True
                break

        # Don't forget to add the last citation pill if it exists
        if self.current_pill:
            self.citation_pills.append(self.current_pill)

        # Combine all text parts
        full_text = "".join(self.text_parts)

        # Flatten citation pills into unique sources
        sources = self.flatten_citation_pills()

        return {
            "text": full_text,
            "sources": sources,
            "is_complete": self.is_complete
        }

    def flatten_citation_pills(self) -> List[Dict[str, Any]]:
        """
        Flatten grouped citation pills into unique sources with corrected positions.
        """
        seen_urls = set()
        sources: List[Dict[str, Any]] = []
        position = 1

        for pill in self.citation_pills:
            for citation_data in pill:
                url = citation_data["url"]
                if url not in seen_urls:
                    seen_urls.add(url)
                    sources.append({
                        "position": position,
                        "label": citation_data["label"],
                        "url": url,
                        "description": citation_data["description"]
                    })
                    position += 1

        return sources

Citation pill grouping logic:

def group_consecutive_citations(events: List[Dict[str, Any]]) -> List[List[Dict[str, Any]]]:
    """
    Group consecutive citation events into citation pills.
    Copilot groups multiple citations that appear together.
    """
    citation_pills = []
    current_pill = []
    last_was_citation = False

    for event in events:
        if event.get("event") == "citation":
            citation_data = {
                "position": len(current_pill) + 1,
                "label": event.get("title", ""),
                "url": event.get("url", ""),
                "description": None
            }
            current_pill.append(citation_data)
            last_was_citation = True
        elif event.get("event") == "appendText" and last_was_citation:
            # Text event after citations means the pill is complete
            if current_pill:
                citation_pills.append(current_pill)
                current_pill = []
            last_was_citation = False

    # Add the final pill if it exists
    if current_pill:
        citation_pills.append(current_pill)

    return citation_pills

Building the scraping infrastructure

Let’s build the complete Copilot scraping system:

Required components:

Browser automation: Playwright with WebSocket support
WebSocket interception: Real-time event capture
Event parsing: Text and citation extraction from mixed events
Session persistence: Cookie stashing for sustained access

Complete scraper implementation:

import asyncio
import json
from playwright.async_api import async_playwright, Page
from typing import Dict, Any, List, Optional

class MicrosoftCopilotScraper:
    def __init__(self):
        self.copilot_events: List[Dict[str, Any]] = []
        self.received_done_event = False

    async def setup_websocket_interceptor(self, page: Page):
        """Set up WebSocket event interception."""

        def on_websocket(ws):
            # Intercept Copilot chat WebSocket
            if "copilot.microsoft.com/c/api/chat" in ws.url:
                ws.on("framereceived", self.websocket_message_handler)

        page.on("websocket", on_websocket)

    def websocket_message_handler(self, message: Union[str, bytes]):
        """Handle incoming WebSocket messages."""
        parsed = json.loads(message)

        if parsed.get("event") == "done":
            self.received_done_event = True

        self.copilot_events.append(parsed)

    async def scrape_copilot(self, query: str, country: str = 'US') -> Dict[str, Any]:
        """Main scraping function."""

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            context = await browser.new_context()
            page = await context.new_page()

            # Set up WebSocket interception
            await self.setup_websocket_interceptor(page)

            try:
                # Navigate to Copilot
                await page.goto('https://copilot.microsoft.com/', timeout=20_000)

                # Handle landing page and mode selection
                await self.handle_copilot_landing(page)

                # Fill and submit query
                await page.wait_for_selector("#userInput", state="visible", timeout=10_000)
                await page.fill("#userInput", query)
                await page.keyboard.press("Enter")

                # Wait for response completion
                await self.wait_for_copilot_response(page)

                # Parse the captured events
                parser = CopilotWebSocketParser()
                result = parser.parse_websocket_events(self.copilot_events)

                # Extract additional data if needed
                if result.get("text"):
                    # Get HTML content for markdown conversion if needed
                    html_content = await self.extract_html_content(page)
                    result["html_content"] = html_content

                return result

            finally:
                await browser.close()

    async def handle_copilot_landing(self, page: Page):
        """Handle Copilot landing page and mode selection."""

        # Wait for mode selection buttons
        await page.wait_for_selector(
            "[data-testid='composer-chat-mode-quick-button'], [data-testid='composer-chat-mode-smart-button']",
            timeout=5_000
        )

        # Navigate to chat mode using keyboard shortcuts (matching actual code)
        for _ in range(2):
            await page.keyboard.press("Tab")
            await asyncio.sleep(0.1)
        await page.keyboard.press("Enter")
        await asyncio.sleep(1)

        # Additional navigation to input field
        for _ in range(5):
            await page.keyboard.press("Tab")
            await asyncio.sleep(0.1)
        await page.keyboard.press("Enter")
        await asyncio.sleep(0.5)

    async def wait_for_copilot_response(self, page: Page, timeout: int = 60):
        """Wait for Copilot response completion."""

        for _ in range(timeout * 2):  # Check every 500ms
            await self.solve_captcha_if_needed(page)

            # If response got captured, we can return
            if self.received_done_event:
                break

            await asyncio.sleep(0.5)
        else:
            raise Exception("Never received Copilot response after 60 seconds")

    async def solve_captcha_if_needed(self, page: Page):
        """Handle captcha challenges if encountered."""
        # Simplified captcha handling
        try:
            # This would integrate with your captcha solving service
            pass
        except Exception:
            pass

    async def extract_html_content(self, page: Page) -> str:
        """Extract HTML content from Copilot response."""
        try:
            # Get the AI message content
            html_content = await page.locator(
                "[class*='group/ai-message-item']"
            ).first.inner_html(timeout=2_000)
            return html_content or ""
        except Exception:
            return ""

Cookie management for session persistence:

# Simple cookie management based on actual implementation
from typing import List, Dict

class CookieStash:
    """Manage cookies for persistent sessions across scrapes."""

    def __init__(self):
        self.cookies_cache = {}

    async def save_cookies(self, proxy_ip: str, domain: str, cookies: List[Dict]):
        """Save cookies for reuse."""
        cache_key = f"{proxy_ip}:{domain}"
        self.cookies_cache[cache_key] = cookies

    async def get_cookies(self, proxy_ip: str, domain: str) -> Optional[List[Dict]]:
        """Retrieve cached cookies."""
        cache_key = f"{proxy_ip}:{domain}"
        return self.cookies_cache.get(cache_key)

# Usage in scraper (matching actual code)
cookie_stash = CookieStash()

# Load existing cookies before navigation
existing_cookies = await cookie_stash.get_cookies(proxy.ip, "https://copilot.microsoft.com/")
if existing_cookies:
    try:
        await page.context.add_cookies(existing_cookies)
    except Exception as e:
        print(f"Failed to load cached cookies: {e}")

# Save cookies after successful session
cookies = await page.context.cookies()
await cookie_stash.save_cookies(proxy.ip, "https://copilot.microsoft.com/", cookies)

Parsing streaming text and citations

Copilot’s mixed WebSocket events require sophisticated parsing to reconstruct complete responses:

Markdown conversion with citations:

import html2text
from bs4 import BeautifulSoup
import re

def convert_html_to_markdown_with_links(
    html_content: str, citation_pills: List[List[Dict[str, Any]]]
) -> str:
    """
    Convert Copilot HTML to markdown, replacing citation buttons with proper links.
    """
    if not html_content:
        return ""

    # Parse HTML
    soup = BeautifulSoup(html_content, "html.parser")

    # Remove unwanted elements
    reactions_div = soup.find(attrs={"data-testid": "message-item-reactions"})
    if reactions_div:
        reactions_div.decompose()

    citation_cards = soup.find(attrs={"data-testid": "citation-cards-row"})
    if citation_cards:
        citation_cards.decompose()

    # Find all citation buttons (rounded-md class)
    buttons = soup.find_all("button", {"class": "rounded-md"})

    button_index = 0
    pill_index = 0

    # Replace each citation button with actual links
    while button_index < len(buttons) and pill_index < len(citation_pills):
        pill_links = citation_pills[pill_index]
        button = buttons[button_index]

        # Create anchor elements for each link in the pill
        new_anchors = []
        for link_data in pill_links:
            source_text = link_data.get("label")
            url = link_data.get("url")

            new_anchor = soup.new_tag("a", href=url)
            new_anchor.string = source_text
            new_anchors.append(new_anchor)

        # Insert all anchors after the button and remove the button
        for anchor in reversed(new_anchors):
            button.insert_after(anchor)
        button.decompose()

        button_index += 1
        pill_index += 1

    # Convert to markdown
    h = html2text.HTML2Text()
    h.ignore_links = False
    h.ignore_images = False
    h.body_width = 0
    h.unicode_snob = True
    h.skip_internal_links = False

    markdown = h.handle(str(soup))

    # Clean up whitespace
    markdown = re.sub(r"\n\s*\n\s*\n", "\n\n", markdown)
    markdown = markdown.replace("\\n\\n", "\n\n")
    markdown = markdown.replace("\\n", "\n")

    return markdown.strip()

Advanced citation analysis:

def analyze_citation_patterns(events: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Analyze citation patterns in Copilot responses for insights.
    """
    citations = []
    text_events = []

    for event in events:
        if event.get("event") == "citation":
            citations.append({
                "title": event.get("title", ""),
                "url": event.get("url", ""),
                "position": len(citations) + 1
            })
        elif event.get("event") == "appendText":
            text_events.append(event.get("text", ""))

    return {
        "total_citations": len(citations),
        "citation_density": len(citations) / len(text_events) if text_events else 0,
        "average_text_between_citations": len("".join(text_events)) / len(citations) if citations else 0,
        "microsoft_sources": len([c for c in citations if "microsoft.com" in c.get("url", "")]),
        "external_sources": len([c for c in citations if "microsoft.com" not in c.get("url", "")])
    }

def extract_microsoft_knowledge_focus(text: str) -> Dict[str, Any]:
    """
    Analyze text to identify Microsoft ecosystem focus areas.
    """
    microsoft_products = [
        "Microsoft 365", "Office 365", "SharePoint", "Teams", "Outlook",
        "Azure", "Visual Studio", "Power Platform", "Power BI", "Power Apps",
        "Windows", "Active Directory", "Exchange", "OneDrive"
    ]

    product_mentions = {}
    for product in microsoft_products:
        count = text.lower().count(product.lower())
        if count > 0:
            product_mentions[product] = count

    return {
        "total_product_mentions": sum(product_mentions.values()),
        "mentioned_products": product_mentions,
        "has_microsoft_focus": len(product_mentions) > 0,
        "primary_products": sorted(product_mentions.items(), key=lambda x: x[1], reverse=True)[:3]
    }

Using cloro’s managed Copilot scraper

Building and maintaining a reliable Copilot scraper requires handling WebSocket event interception and cookie session management. That’s why we built cloro - a managed API that handles all the complexity for you.

Simple API integration:

import requests
import json

# Your Microsoft ecosystem query
query = "How can I improve team productivity using Microsoft 365 tools?"

# API request to cloro
response = requests.post(
    'https://api.cloro.dev/v1/monitor/copilot',
    headers={'Authorization': 'Bearer YOUR_API_KEY'},
    json={
        'prompt': query,
        'country': 'US',
        'include': {
            'markdown': True,
            'html': True
        }
    }
)

result = response.json()
print(json.dumps(result, indent=2))

What cloro handles automatically:

WebSocket event handling: Complete real-time event interception and parsing
Cookie session management: Automatic persistence and renewal of Copilot sessions
Anti-bot evasion: Advanced techniques to avoid detection and rate limiting
Citation extraction: Proper grouping and formatting of Microsoft documentation sources
Error handling: Comprehensive retry logic for network and session issues
Scalability: Distributed infrastructure for high-volume Microsoft ecosystem queries

Structured output you get:

{
  "status": "success",
  "result": {
    "text": "To improve team productivity using Microsoft 365 tools, I recommend implementing the following strategies: utilize SharePoint for document collaboration, leverage Teams for communication, use Power Automate for workflow automation...",
    "sources": [
      {
        "position": 1,
        "url": "https://docs.microsoft.com/en-us/microsoft-365/",
        "label": "Microsoft 365 Documentation",
        "description": "Official documentation for Microsoft 365 productivity tools and features..."
      },
      {
        "position": 2,
        "url": "https://learn.microsoft.com/en-us/sharepoint/",
        "label": "SharePoint Documentation",
        "description": "Comprehensive guide to SharePoint for document management and collaboration..."
      }
    ],
    "markdown": "**To improve team productivity using Microsoft 365 tools**, I recommend implementing the following strategies...",
    "html": "https://storage.cloro.dev/results/c45a5081-808d-4ed3-9c86-e4baf16c8ab8/page-1.html"
  }
}

Benefits of using cloro:

99.9% uptime vs. DIY solutions that break with Microsoft session changes
P50 latency < 30s vs. manual scraping that takes hours with complex auth flows
Microsoft ecosystem expertise without implementing specialized parsing logic
Enterprise-grade session management for organizational access
No infrastructure costs - we handle browsers, WebSocket connections, and session management
Compliance - ethical scraping practices respecting Microsoft’s terms of service
Scalability - handle thousands of Microsoft ecosystem queries with consistent quality

Start scraping Microsoft Copilot today.

The insights from Microsoft Copilot’s AI-powered assistance are invaluable for organizations leveraging the Microsoft ecosystem. Whether you’re optimizing Microsoft 365 workflows, researching Azure solutions, troubleshooting technical issues, or building automated enterprise support systems, access to structured Copilot data provides incredible opportunities.

For most developers and businesses, we recommend using cloro’s Copilot scraper. You get:

Immediate access to reliable scraping infrastructure with session management
Automatic WebSocket event parsing and citation extraction
Microsoft ecosystem expertise without specialized knowledge
Built-in session management and cookie handling
Comprehensive error handling for Microsoft-specific challenges
Enterprise-grade support for organizational access

The cost of building and maintaining this infrastructure yourself typically runs $3,000-6,000/month in development time, WebSocket infrastructure, and session management.

For advanced users needing custom solutions, the technical approach outlined above provides the foundation for building your own scraping system. Be prepared for ongoing maintenance as Microsoft updates their Copilot interface and WebSocket implementation.

The competitive advantage is building now. As more organizations discover the value of AI-powered Microsoft ecosystem optimization, early adopters will build significant operational advantages. Companies that start leveraging Copilot data now will optimize their Microsoft workflows faster and more effectively than competitors.

Ready to unlock Microsoft Copilot insights for your organization? Get started with cloro’s API to start accessing AI-powered search data.

Don’t let your competitors optimize their Microsoft ecosystem faster. Start scraping Microsoft Copilot today.

Table of contents