Skip to main content

Web pages

Realm can sync any publicly accessible website and add its content to your knowledge base.

Setting up web scraping

1

Navigate to Data sources

Go to the Data sources page in Realm.
2

Find Web Scraping

Find “Web Scraping” and click Manage, then ”+ Add site”.
3

Enter URL

Insert the URL you wish to sync (e.g., https://example.com/docs).
4

Configure Patterns (Optional)

If you want to sync only certain URLs, add include or exclude patterns.
5

Save

Click Save to start the sync.

URL patterns

Use patterns to control what content is synced:

Include patterns

Specify patterns to only sync matching URLs:
PatternEffect
/docs/*Only sync pages under /docs
/blog/2024/*Only sync 2024 blog posts
*/api/*Only sync API documentation

Exclude patterns

Specify patterns to skip matching URLs:
PatternEffect
/loginSkip login page
/admin/*Skip admin pages
*.pdfSkip PDF files

Use cases

Product Documentation

Sync public product docs from your website

Help Centers

Import external help center content

Competitor Research

Track competitor public documentation

Partner Content

Sync partner documentation and resources

Best practices

Only scrape websites that allow crawling. Check the site’s robots.txt file.
Be specific with include/exclude patterns to avoid syncing irrelevant content.
Periodically review scraped content to ensure it remains relevant and accurate.
Begin with a specific section of a website before expanding to more pages.

Limitations

  • Only works with publicly accessible websites
  • Dynamic content (JavaScript-rendered) may not be fully captured
  • Rate limiting may apply to prevent overloading target sites