cloro
Technical Guides

How to find all URLs on a domain

#Web Scraping#Sitemaps

You think you know your website.

You have a navigation bar. You have a footer. You have a database.

But if you actually scan your domain, you will find ghosts: Old landing pages from 2019 marketing campaigns. Staging subdomains indexed by mistake. Orphaned blog posts with zero internal links.

Finding all URLs on a domain is not just a housekeeping task. It is the foundation of:

  • SEO Audits: You can’t fix what you can’t see.
  • Site Migrations: Ensuring no link is left behind.
  • Competitive Intelligence: Seeing exactly what your competitor is publishing.
  • Security: Finding exposed admin panels or sensitive files.

There is no single “magic button” to find everything. You need a multi-layered approach.

Here is the complete playbook for mapping 100% of a domain.

Table of contents

Level 1: The polite way (sitemaps)

Before you break out the heavy artillery, try the front door.

Most modern CMSs (WordPress, Shopify, Webflow) generate a map of the site automatically. This is meant for Googlebot, but you can read it too.

Step 1: Check robots.txt

Go to domain.com/robots.txt. This is the instruction manual for crawlers. Often, developers explicitly list the sitemap location here.

User-agent: *
Disallow: /admin
Sitemap: https://domain.com/sitemap_index.xml

Step 2: Check standard sitemap paths

If it’s not in robots.txt, guess. Try these common URLs:

  • /sitemap.xml
  • /sitemap_index.xml
  • /sitemap.php
  • /sitemap.txt

Step 3: Parse it

Sitemaps are often nested. The “Index” sitemap links to “Post” sitemaps and “Product” sitemaps. You need to follow the chain.

Visual Helper: If the XML is hard to read, paste the URL into a tool like XML Sitemap Validator to get a clean list.

Level 2: The hacker way (Google Dorks)

Sometimes, a website doesn’t want you to find a page. It’s not in the sitemap. It’s not in the menu.

But if Google has indexed it, you can find it using Search Operators (also known as Google Dorks).

The site: operator

Go to Google and type:

site:cloro.dev

This returns every page Google has indexed for that domain.

Advanced dorking strategies:

  • Find subdomains: site:cloro.dev -www (Shows results that don’t start with www).
  • Find documents: site:cloro.dev filetype:pdf (Finds hidden whitepapers).
  • Find Excel sheets: site:cloro.dev filetype:xlsx (Often exposes sensitive pricing data).
  • Find login pages: site:cloro.dev inurl:login

Why this works: Google’s crawler is more aggressive than any tool you run on your laptop. It has been indexing the site for years. It remembers pages the owner forgot they published.

Check out our guide on Google Search Parameters to master these filters.

Level 3: The archivist way (Wayback Machine)

What about pages that were deleted? Or pages that are currently offline?

The Internet Archive (Wayback Machine) has been taking snapshots of the web since 1996. You can query their API to find every URL they have ever seen for a domain.

The tool: waybackurls

If you are comfortable with the command line (terminal), there is a legendary tool by TomHudson called waybackurls.

Installation (Go required):

go install github.com/tomnomnom/waybackurls@latest

Usage:

echo "cloro.dev" | waybackurls > urls.txt

This will dump thousands of URLs into a text file in seconds. You will find:

  • Old API endpoints (/api/v1/...)
  • Deprecated staging environments (dev.domain.com)
  • Broken redirects

Pro Tip: This is how bug bounty hunters find vulnerabilities. They look for old, unpatched pages that the developer forgot to delete.

Level 4: The developer way (Python)

If you want to build your own map—fresh, live, and custom—you need to build a crawler.

A crawler starts at the homepage, finds all the links, visits those links, finds their links, and repeats until there is nowhere left to go.

Here is a robust Python script using requests and BeautifulSoup. If you are new to making requests in Python, check out our guide on converting curl to Python.

Prerequisites:

pip install requests beautifulsoup4

The Code:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time

target_url = "https://cloro.dev"
domain_name = urlparse(target_url).netloc
visited_urls = set()
urls_to_visit = {target_url}

# User-Agent to look like a real browser (avoid 403 blocks)
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

def get_all_links(url):
    try:
        response = requests.get(url, headers=headers, timeout=5)
        soup = BeautifulSoup(response.text, "html.parser")
        links = set()

        for a_tag in soup.findAll("a"):
            href = a_tag.attrs.get("href")
            if href == "" or href is None:
                continue

            # Convert relative URLs to absolute URLs
            href = urljoin(url, href)
            parsed_href = urlparse(href)

            # Clean the URL (remove query params for deduplication)
            href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path

            # Only keep internal links
            if domain_name in href and href not in visited_urls:
                links.add(href)

        return links
    except Exception as e:
        print(f"Error crawling {url}: {e}")
        return set()

print(f"Starting crawl of {target_url}...")

while urls_to_visit:
    current_url = urls_to_visit.pop()
    if current_url in visited_urls:
        continue

    print(f"Crawling: {current_url}")
    visited_urls.add(current_url)

    # Get new links
    new_links = get_all_links(current_url)
    urls_to_visit.update(new_links)

    # Be polite! Don't crash their server.
    time.sleep(0.5)

print(f"\nFound {len(visited_urls)} unique URLs:")
for url in visited_urls:
    print(url)

Warning: This script is basic. It doesn’t handle JavaScript rendering (React/Vue/Angular sites). For that, you would need Playwright or Selenium—similar to the techniques used in scraping Google AI Mode.

Level 5: The pro tools

If you don’t want to code, use the industry standards. These tools handle JavaScript, cookies, and rate limiting out of the box.

1. Screaming Frog SEO Spider

The undisputed king. It installs locally on your Mac/PC.

  • Pros: Extremely deep crawling, finds broken links (404s), visualizes site architecture.
  • Cons: Paid license required for >500 URLs. UI looks like an Excel spreadsheet from 1999.

2. Ahrefs / SEMrush

These are cloud-based. They don’t crawl your site live (usually); they show you what they have indexed over time.

  • Pros: Shows you which pages have the most backlinks.
  • Cons: Expensive subscriptions ($100+/mo).

3. Hexomatic / Browse AI

No-code scraping platforms.

  • Pros: Great for extracting data from the URLs once found (e.g., getting all prices from all product pages).
  • Cons: Can be slow for massive sites.

The problem of “Orphan Pages”

Here is the scary part: A standard crawler (Level 4 & 5) cannot find Orphan Pages.

An Orphan Page is a page that exists on the server but has zero internal links pointing to it. If you don’t link to it, the crawler can’t click to it.

How to find Orphan Pages:

  1. Cross-reference: Compare your “Crawled URLs” list (Screaming Frog) with your “Sitemap URLs” list. Anything in the sitemap but not the crawl is an orphan.
  2. Google Analytics: Check your “Landing Pages” report for the last year. Users might be arriving at pages via email links or social ads that aren’t linked on your menu.
  3. Log File Analysis: This is the nuclear option. You ask the server admin for the “Access Logs.” This shows every single URL that anyone has requested from the server. It reveals everything.

Monitoring your digital footprint

Finding URLs on your domain is step one. But what about finding where your URLs are appearing on the rest of the web—specifically in the “hidden web” of AI answers?

Traditional crawlers stop at the website boundary. They can’t see inside ChatGPT or Perplexity.

That is where cloro comes in.

You might map your domain perfectly, but if an AI engine is hallucinating a pricing page that doesn’t exist, or linking users to a broken 404, your audit is incomplete.

The modern discovery stack:

  1. Screaming Frog: To map your physical structure.
  2. Google Console: To map your search visibility.
  3. cloro: To map your AI visibility and ensure the robots are citing the right pages.

Knowing your domain is good. Knowing how the world sees your domain is better.