Web pages
Realm can sync any publicly accessible website and add its content to your knowledge base.Setting up web scraping
Configure Patterns (Optional)
If you want to sync only certain URLs, add include or exclude patterns.
URL patterns
Use patterns to control what content is synced:Include patterns
Specify patterns to only sync matching URLs:| Pattern | Effect |
|---|---|
/docs/* | Only sync pages under /docs |
/blog/2024/* | Only sync 2024 blog posts |
*/api/* | Only sync API documentation |
Exclude patterns
Specify patterns to skip matching URLs:| Pattern | Effect |
|---|---|
/login | Skip login page |
/admin/* | Skip admin pages |
*.pdf | Skip PDF files |
Use cases
Product Documentation
Sync public product docs from your website
Help Centers
Import external help center content
Competitor Research
Track competitor public documentation
Partner Content
Sync partner documentation and resources
Best practices
Respect robots.txt
Respect robots.txt
Only scrape websites that allow crawling. Check the site’s robots.txt file.
Use Specific Patterns
Use Specific Patterns
Be specific with include/exclude patterns to avoid syncing irrelevant content.
Monitor Content
Monitor Content
Periodically review scraped content to ensure it remains relevant and accurate.
Start Small
Start Small
Begin with a specific section of a website before expanding to more pages.
Limitations
- Only works with publicly accessible websites
- Dynamic content (JavaScript-rendered) may not be fully captured
- Rate limiting may apply to prevent overloading target sites

