How to scrape Microsoft Copilot with minimal infrastructure
Microsoft Copilot processes millions of enterprise and development queries daily. The web interface delivers specialized Microsoft ecosystem knowledge and technical documentation that direct APIs completely miss.
The challenge: Copilot wasn’t designed for programmatic access. The platform uses WebSocket event streaming for real-time communication and simple session management that traditional scraping tools can’t handle.
After analyzing 3+ million Copilot responses, we’ve reverse-engineered their WebSocket architecture. This guide will show you exactly how to scrape Copilot and extract the Microsoft ecosystem intelligence that makes it valuable for enterprise applications.
Table of contents
- Why scrape Microsoft Copilot responses?
- Understanding Copilot’s WebSocket architecture
- The WebSocket event parsing challenge
- Building the scraping infrastructure
- Parsing streaming text and citations
- Using cloro’s managed Copilot scraper
Why scrape Microsoft Copilot responses?
Microsoft Copilot provides unique AI-assisted responses that integrate with Microsoft’s ecosystem and real-time web search.
What makes Copilot responses valuable:
- The actual conversational responses users see in the chat interface
- Integrated source citations and web search results for verification
- Context-aware responses that leverage Microsoft’s ecosystem knowledge
- Real-time information integration with current web data
- Streaming response generation with dynamic updates
Why it matters: Copilot responses represent a unique approach to AI assistance that combines conversational AI with real-time web search and Microsoft ecosystem integration, providing insights that can’t be obtained through other means.
Use cases:
- Verification: Check what Copilot actually tells users
- SEO: Monitor how Copilot sources and cites information
- Compliance: Ensure accuracy of AI-generated responses
- Research: Analyze response patterns and sources
For a similar guide on another platform, check out how to scrape ChatGPT.
Understanding Copilot’s WebSocket architecture
Copilot uses a straightforward real-time communication system based on WebSocket events:
Copilot’s response generation process:
- Page Navigation: Loads the Copilot web interface
- WebSocket Connection: Intercepts WebSocket messages from the chat endpoint
- Event Collection: Captures all JSON events sent via WebSocket
- Response Parsing: Processes collected events when completion is detected
- Microsoft Knowledge: Leverages deep Microsoft product and service knowledge base
Key technical challenges:
WebSocket event interception:
// Copilot sends events via WebSocket from:
// copilot.microsoft.com/c/api/chat
// Events are simple JSON objects:
{
"event": "appendText",
"text": "To improve team productivity..."
}
{
"event": "citation",
"title": "Microsoft 365 Documentation",
"url": "https://docs.microsoft.com/..."
}
{
"event": "done"
}
Event types to handle:
appendText: Text content chunkscitation: Source citations embedded inlinedone: Completion marker
The WebSocket event parsing challenge
The core of Copilot scraping lies in parsing real-time WebSocket events and reconstructing structured responses:
WebSocket event stream:
# Raw Copilot WebSocket events example
{
"event": "appendText",
"text": "To improve team productivity using Microsoft 365"
}
{
"event": "citation",
"title": "Microsoft 365 Documentation",
"url": "https://docs.microsoft.com/en-us/microsoft-365/"
}
{
"event": "appendText",
"text": ", I recommend implementing SharePoint for document collaboration"
}
{
"event": "done"
}
Parsing challenges:
- Real-time streaming: Content arrives via WebSocket events, not HTTP responses
- Mixed event types: Text and citation events are interleaved
- Citation pill grouping: Multiple citations can be grouped together
- Event ordering: Citation positions must be tracked accurately
Python WebSocket parsing implementation:
import json
from typing import List, Dict, Any
class CopilotWebSocketParser:
def __init__(self):
self.text_parts = []
self.citation_pills: List[List[Dict[str, Any]]] = []
self.current_pill: List[Dict[str, Any]] = []
self.citation_position = 1
self.last_event_was_citation = False
self.is_complete = False
def parse_websocket_events(self, events: List[Dict[str, Any]]) -> Dict[str, Any]:
"""
Parse Copilot WebSocket events into structured response.
"""
for event in events:
event_type = event.get("event")
# Collect text chunks
if event_type == "appendText":
text_chunk = event.get("text", "")
self.text_parts.append(text_chunk)
# If we were building a citation pill and now see text, save the pill
if self.last_event_was_citation and self.current_pill:
self.citation_pills.append(self.current_pill)
self.current_pill = []
self.last_event_was_citation = False
# Collect citations
elif event_type == "citation":
citation_data = {
"position": self.citation_position,
"label": event.get("title", ""),
"url": event.get("url", ""),
"description": None
}
self.current_pill.append(citation_data)
self.citation_position += 1
self.last_event_was_citation = True
# Check for completion
elif event_type == "done":
self.is_complete = True
break
# Don't forget to add the last citation pill if it exists
if self.current_pill:
self.citation_pills.append(self.current_pill)
# Combine all text parts
full_text = "".join(self.text_parts)
# Flatten citation pills into unique sources
sources = self.flatten_citation_pills()
return {
"text": full_text,
"sources": sources,
"is_complete": self.is_complete
}
def flatten_citation_pills(self) -> List[Dict[str, Any]]:
"""
Flatten grouped citation pills into unique sources with corrected positions.
"""
seen_urls = set()
sources: List[Dict[str, Any]] = []
position = 1
for pill in self.citation_pills:
for citation_data in pill:
url = citation_data["url"]
if url not in seen_urls:
seen_urls.add(url)
sources.append({
"position": position,
"label": citation_data["label"],
"url": url,
"description": citation_data["description"]
})
position += 1
return sources
Citation pill grouping logic:
def group_consecutive_citations(events: List[Dict[str, Any]]) -> List[List[Dict[str, Any]]]:
"""
Group consecutive citation events into citation pills.
Copilot groups multiple citations that appear together.
"""
citation_pills = []
current_pill = []
last_was_citation = False
for event in events:
if event.get("event") == "citation":
citation_data = {
"position": len(current_pill) + 1,
"label": event.get("title", ""),
"url": event.get("url", ""),
"description": None
}
current_pill.append(citation_data)
last_was_citation = True
elif event.get("event") == "appendText" and last_was_citation:
# Text event after citations means the pill is complete
if current_pill:
citation_pills.append(current_pill)
current_pill = []
last_was_citation = False
# Add the final pill if it exists
if current_pill:
citation_pills.append(current_pill)
return citation_pills
Building the scraping infrastructure
Let’s build the complete Copilot scraping system:
Required components:
- Browser automation: Playwright with WebSocket support
- WebSocket interception: Real-time event capture
- Event parsing: Text and citation extraction from mixed events
- Session persistence: Cookie stashing for sustained access
Complete scraper implementation:
import asyncio
import json
from playwright.async_api import async_playwright, Page
from typing import Dict, Any, List, Optional
class MicrosoftCopilotScraper:
def __init__(self):
self.copilot_events: List[Dict[str, Any]] = []
self.received_done_event = False
async def setup_websocket_interceptor(self, page: Page):
"""Set up WebSocket event interception."""
def on_websocket(ws):
# Intercept Copilot chat WebSocket
if "copilot.microsoft.com/c/api/chat" in ws.url:
ws.on("framereceived", self.websocket_message_handler)
page.on("websocket", on_websocket)
def websocket_message_handler(self, message: Union[str, bytes]):
"""Handle incoming WebSocket messages."""
parsed = json.loads(message)
if parsed.get("event") == "done":
self.received_done_event = True
self.copilot_events.append(parsed)
async def scrape_copilot(self, query: str, country: str = 'US') -> Dict[str, Any]:
"""Main scraping function."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
# Set up WebSocket interception
await self.setup_websocket_interceptor(page)
try:
# Navigate to Copilot
await page.goto('https://copilot.microsoft.com/', timeout=20_000)
# Handle landing page and mode selection
await self.handle_copilot_landing(page)
# Fill and submit query
await page.wait_for_selector("#userInput", state="visible", timeout=10_000)
await page.fill("#userInput", query)
await page.keyboard.press("Enter")
# Wait for response completion
await self.wait_for_copilot_response(page)
# Parse the captured events
parser = CopilotWebSocketParser()
result = parser.parse_websocket_events(self.copilot_events)
# Extract additional data if needed
if result.get("text"):
# Get HTML content for markdown conversion if needed
html_content = await self.extract_html_content(page)
result["html_content"] = html_content
return result
finally:
await browser.close()
async def handle_copilot_landing(self, page: Page):
"""Handle Copilot landing page and mode selection."""
# Wait for mode selection buttons
await page.wait_for_selector(
"[data-testid='composer-chat-mode-quick-button'], [data-testid='composer-chat-mode-smart-button']",
timeout=5_000
)
# Navigate to chat mode using keyboard shortcuts (matching actual code)
for _ in range(2):
await page.keyboard.press("Tab")
await asyncio.sleep(0.1)
await page.keyboard.press("Enter")
await asyncio.sleep(1)
# Additional navigation to input field
for _ in range(5):
await page.keyboard.press("Tab")
await asyncio.sleep(0.1)
await page.keyboard.press("Enter")
await asyncio.sleep(0.5)
async def wait_for_copilot_response(self, page: Page, timeout: int = 60):
"""Wait for Copilot response completion."""
for _ in range(timeout * 2): # Check every 500ms
await self.solve_captcha_if_needed(page)
# If response got captured, we can return
if self.received_done_event:
break
await asyncio.sleep(0.5)
else:
raise Exception("Never received Copilot response after 60 seconds")
async def solve_captcha_if_needed(self, page: Page):
"""Handle captcha challenges if encountered."""
# Simplified captcha handling
try:
# This would integrate with your captcha solving service
pass
except Exception:
pass
async def extract_html_content(self, page: Page) -> str:
"""Extract HTML content from Copilot response."""
try:
# Get the AI message content
html_content = await page.locator(
"[class*='group/ai-message-item']"
).first.inner_html(timeout=2_000)
return html_content or ""
except Exception:
return ""
Cookie management for session persistence:
# Simple cookie management based on actual implementation
from typing import List, Dict
class CookieStash:
"""Manage cookies for persistent sessions across scrapes."""
def __init__(self):
self.cookies_cache = {}
async def save_cookies(self, proxy_ip: str, domain: str, cookies: List[Dict]):
"""Save cookies for reuse."""
cache_key = f"{proxy_ip}:{domain}"
self.cookies_cache[cache_key] = cookies
async def get_cookies(self, proxy_ip: str, domain: str) -> Optional[List[Dict]]:
"""Retrieve cached cookies."""
cache_key = f"{proxy_ip}:{domain}"
return self.cookies_cache.get(cache_key)
# Usage in scraper (matching actual code)
cookie_stash = CookieStash()
# Load existing cookies before navigation
existing_cookies = await cookie_stash.get_cookies(proxy.ip, "https://copilot.microsoft.com/")
if existing_cookies:
try:
await page.context.add_cookies(existing_cookies)
except Exception as e:
print(f"Failed to load cached cookies: {e}")
# Save cookies after successful session
cookies = await page.context.cookies()
await cookie_stash.save_cookies(proxy.ip, "https://copilot.microsoft.com/", cookies)
Parsing streaming text and citations
Copilot’s mixed WebSocket events require sophisticated parsing to reconstruct complete responses:
Markdown conversion with citations:
import html2text
from bs4 import BeautifulSoup
import re
def convert_html_to_markdown_with_links(
html_content: str, citation_pills: List[List[Dict[str, Any]]]
) -> str:
"""
Convert Copilot HTML to markdown, replacing citation buttons with proper links.
"""
if not html_content:
return ""
# Parse HTML
soup = BeautifulSoup(html_content, "html.parser")
# Remove unwanted elements
reactions_div = soup.find(attrs={"data-testid": "message-item-reactions"})
if reactions_div:
reactions_div.decompose()
citation_cards = soup.find(attrs={"data-testid": "citation-cards-row"})
if citation_cards:
citation_cards.decompose()
# Find all citation buttons (rounded-md class)
buttons = soup.find_all("button", {"class": "rounded-md"})
button_index = 0
pill_index = 0
# Replace each citation button with actual links
while button_index < len(buttons) and pill_index < len(citation_pills):
pill_links = citation_pills[pill_index]
button = buttons[button_index]
# Create anchor elements for each link in the pill
new_anchors = []
for link_data in pill_links:
source_text = link_data.get("label")
url = link_data.get("url")
new_anchor = soup.new_tag("a", href=url)
new_anchor.string = source_text
new_anchors.append(new_anchor)
# Insert all anchors after the button and remove the button
for anchor in reversed(new_anchors):
button.insert_after(anchor)
button.decompose()
button_index += 1
pill_index += 1
# Convert to markdown
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = False
h.body_width = 0
h.unicode_snob = True
h.skip_internal_links = False
markdown = h.handle(str(soup))
# Clean up whitespace
markdown = re.sub(r"\n\s*\n\s*\n", "\n\n", markdown)
markdown = markdown.replace("\\n\\n", "\n\n")
markdown = markdown.replace("\\n", "\n")
return markdown.strip()
Advanced citation analysis:
def analyze_citation_patterns(events: List[Dict[str, Any]]) -> Dict[str, Any]:
"""
Analyze citation patterns in Copilot responses for insights.
"""
citations = []
text_events = []
for event in events:
if event.get("event") == "citation":
citations.append({
"title": event.get("title", ""),
"url": event.get("url", ""),
"position": len(citations) + 1
})
elif event.get("event") == "appendText":
text_events.append(event.get("text", ""))
return {
"total_citations": len(citations),
"citation_density": len(citations) / len(text_events) if text_events else 0,
"average_text_between_citations": len("".join(text_events)) / len(citations) if citations else 0,
"microsoft_sources": len([c for c in citations if "microsoft.com" in c.get("url", "")]),
"external_sources": len([c for c in citations if "microsoft.com" not in c.get("url", "")])
}
def extract_microsoft_knowledge_focus(text: str) -> Dict[str, Any]:
"""
Analyze text to identify Microsoft ecosystem focus areas.
"""
microsoft_products = [
"Microsoft 365", "Office 365", "SharePoint", "Teams", "Outlook",
"Azure", "Visual Studio", "Power Platform", "Power BI", "Power Apps",
"Windows", "Active Directory", "Exchange", "OneDrive"
]
product_mentions = {}
for product in microsoft_products:
count = text.lower().count(product.lower())
if count > 0:
product_mentions[product] = count
return {
"total_product_mentions": sum(product_mentions.values()),
"mentioned_products": product_mentions,
"has_microsoft_focus": len(product_mentions) > 0,
"primary_products": sorted(product_mentions.items(), key=lambda x: x[1], reverse=True)[:3]
}
Using cloro’s managed Copilot scraper
Building and maintaining a reliable Copilot scraper requires handling WebSocket event interception and cookie session management. That’s why we built cloro - a managed API that handles all the complexity for you.
Simple API integration:
import requests
import json
# Your Microsoft ecosystem query
query = "How can I improve team productivity using Microsoft 365 tools?"
# API request to cloro
response = requests.post(
'https://api.cloro.dev/v1/monitor/copilot',
headers={'Authorization': 'Bearer YOUR_API_KEY'},
json={
'prompt': query,
'country': 'US',
'include': {
'markdown': True,
'html': True
}
}
)
result = response.json()
print(json.dumps(result, indent=2))
What cloro handles automatically:
- WebSocket event handling: Complete real-time event interception and parsing
- Cookie session management: Automatic persistence and renewal of Copilot sessions
- Anti-bot evasion: Advanced techniques to avoid detection and rate limiting
- Citation extraction: Proper grouping and formatting of Microsoft documentation sources
- Error handling: Comprehensive retry logic for network and session issues
- Scalability: Distributed infrastructure for high-volume Microsoft ecosystem queries
Structured output you get:
{
"status": "success",
"result": {
"text": "To improve team productivity using Microsoft 365 tools, I recommend implementing the following strategies: utilize SharePoint for document collaboration, leverage Teams for communication, use Power Automate for workflow automation...",
"sources": [
{
"position": 1,
"url": "https://docs.microsoft.com/en-us/microsoft-365/",
"label": "Microsoft 365 Documentation",
"description": "Official documentation for Microsoft 365 productivity tools and features..."
},
{
"position": 2,
"url": "https://learn.microsoft.com/en-us/sharepoint/",
"label": "SharePoint Documentation",
"description": "Comprehensive guide to SharePoint for document management and collaboration..."
}
],
"markdown": "**To improve team productivity using Microsoft 365 tools**, I recommend implementing the following strategies...",
"html": "https://storage.cloro.dev/results/c45a5081-808d-4ed3-9c86-e4baf16c8ab8/page-1.html"
}
}
Benefits of using cloro:
- 99.9% uptime vs. DIY solutions that break with Microsoft session changes
- P50 latency < 30s vs. manual scraping that takes hours with complex auth flows
- Microsoft ecosystem expertise without implementing specialized parsing logic
- Enterprise-grade session management for organizational access
- No infrastructure costs - we handle browsers, WebSocket connections, and session management
- Compliance - ethical scraping practices respecting Microsoft’s terms of service
- Scalability - handle thousands of Microsoft ecosystem queries with consistent quality
Start scraping Microsoft Copilot today.
The insights from Microsoft Copilot’s AI-powered assistance are invaluable for organizations leveraging the Microsoft ecosystem. Whether you’re optimizing Microsoft 365 workflows, researching Azure solutions, troubleshooting technical issues, or building automated enterprise support systems, access to structured Copilot data provides incredible opportunities.
For most developers and businesses, we recommend using cloro’s Copilot scraper. You get:
- Immediate access to reliable scraping infrastructure with session management
- Automatic WebSocket event parsing and citation extraction
- Microsoft ecosystem expertise without specialized knowledge
- Built-in session management and cookie handling
- Comprehensive error handling for Microsoft-specific challenges
- Enterprise-grade support for organizational access
The cost of building and maintaining this infrastructure yourself typically runs $3,000-6,000/month in development time, WebSocket infrastructure, and session management.
For advanced users needing custom solutions, the technical approach outlined above provides the foundation for building your own scraping system. Be prepared for ongoing maintenance as Microsoft updates their Copilot interface and WebSocket implementation.
The competitive advantage is building now. As more organizations discover the value of AI-powered Microsoft ecosystem optimization, early adopters will build significant operational advantages. Companies that start leveraging Copilot data now will optimize their Microsoft workflows faster and more effectively than competitors.
Ready to unlock Microsoft Copilot insights for your organization? Get started with cloro’s API to start accessing AI-powered search data.
Don’t let your competitors optimize their Microsoft ecosystem faster. Start scraping Microsoft Copilot today.