Extract URLs from Sitemap
Crawl XML sitemaps to extract all URLs for bulk analysis and monitoring.
Extract URLs from Sitemap reads XML sitemaps to get comprehensive lists of website URLs. Essential for SEO audits, competitive analysis, and bulk website monitoring tasks.
When to Use It
Use this node to:
- Get complete URL lists for SEO audits
- Analyze competitor website structure
- Monitor large websites for changes
- Bulk check page status across entire sites
- Feed URLs into scraping or analysis workflows
Inputs
Field | Type | Required | Description |
---|---|---|---|
XML Sitemap URL | Text | Yes | The sitemap.xml URL you want to extract URLs from |
Limit | Number | No | Maximum number of URLs to extract (optional) |
How It Works
This node reads XML sitemap files and extracts all the URLs listed within them. Sitemaps are files that websites use to tell search engines about their pages.
Common Sitemap Locations
Most websites have sitemaps at these standard locations:
https://example.com/sitemap.xml
https://example.com/sitemap_index.xml
https://example.com/sitemaps/sitemap.xml
You can also find sitemap URLs in:
robots.txt
file (usually athttps://example.com/robots.txt
)- Google Search Console
- Website footer links
Sitemap Types
Standard Sitemaps:
- List all website pages in XML format
- Include last modification dates
- Show page priority and update frequency
Sitemap Index Files:
- Point to multiple sitemap files
- Common for large websites
- May contain thousands of URLs across multiple files
Specialized Sitemaps:
- News sitemaps (news articles)
- Image sitemaps (image content)
- Video sitemaps (video content)
Output
The node returns:
- URLs - List of all URLs found in the sitemap
- Total Count - Number of URLs extracted
- Last Modified - When each URL was last updated (if available)
- Priority - Page priority as specified in sitemap (if available)
Tips
Finding Sitemaps:
- Check
/robots.txt
for sitemap declarations - Try common sitemap URLs first
- Look in Google Search Console for verified sitemaps
- Some sites have multiple sitemaps for different content types
Large Sitemaps:
- Use the limit parameter for initial testing
- Large sites may have sitemap index files linking to multiple sitemaps
- Consider processing in batches for very large sites
Error Handling:
- Not all websites have sitemaps
- Some sitemaps may be incomplete or outdated
- Private or restricted sitemaps may not be accessible
FAQ
What if a website doesn't have a sitemap?
What if a website doesn't have a sitemap?
Not all websites have sitemaps. You can try common sitemap URLs or check the robots.txt file. For sites without sitemaps, consider using web scraping to find links or manually compile URL lists.
Can I extract URLs from sitemap index files?
Can I extract URLs from sitemap index files?
Yes, the node can handle sitemap index files that reference multiple sitemaps. It will follow the references and extract URLs from all linked sitemaps.
How do I handle very large sitemaps?
How do I handle very large sitemaps?
Use the limit parameter to extract a subset of URLs first. For complete analysis of large sites, consider processing the sitemap in batches or focusing on specific sections.
What if I get access denied errors?
What if I get access denied errors?
Some websites restrict access to their sitemaps or require specific user agents. The sitemap might be protected or the URL might be incorrect. Try accessing it directly in your browser first.
Can I use this to monitor competitors?
Can I use this to monitor competitors?
Yes, extracting competitor sitemaps is a common competitive analysis technique. Combine with other nodes to track their content strategy, new pages, and site structure changes.
How often should I extract sitemap data?
How often should I extract sitemap data?
For monitoring purposes, weekly or monthly extraction is usually sufficient unless you’re tracking rapidly changing sites. Use schedulers to automate regular sitemap analysis.