Web pages

Setting up web scraping

Navigate to Data sources

Go to the Data sources page in Realm.

Find Web Scraping

Open “Web Scraping”, then click Add Site.

Enter URL

Insert the URL you wish to sync (e.g., https://example.com/docs).

Configure Patterns (Optional)

If you want to sync only certain URLs, add include or exclude patterns.

Save

Click Save to start the sync.

URL patterns

Use patterns to control what content is synced:

Include patterns

Specify patterns to only sync matching URLs:

Pattern	Effect
`/docs/*`	Only sync pages under /docs
`/blog/2024/*`	Only sync 2024 blog posts
`/api/`	Only sync API documentation

Exclude patterns

Specify patterns to skip matching URLs:

Pattern	Effect
`/login`	Skip login page
`/admin/*`	Skip admin pages
`*.pdf`	Skip PDF files

Product Documentation

Sync public product docs from your website

Help Centers

Import external help center content

Competitor Research

Track competitor public documentation

Partner Content

Sync partner documentation and resources

Best practices

Respect robots.txt

Only scrape websites that allow crawling. Check the site’s robots.txt file.

Use Specific Patterns

Be specific with include/exclude patterns to avoid syncing irrelevant content.

Monitor Content

Periodically review scraped content to ensure it remains relevant and accurate.

Start Small

Begin with a specific section of a website before expanding to more pages.

Web pages

Setting up web scraping

URL patterns

Include patterns

Exclude patterns

Use cases

Product Documentation

Help Centers

Competitor Research

Partner Content

Best practices

Limitations

​Web pages

​Setting up web scraping

​URL patterns

​Include patterns

​Exclude patterns

​Use cases

Product Documentation

Help Centers

Competitor Research

Partner Content

​Best practices

​Limitations

Web pages

Setting up web scraping

URL patterns

Include patterns

Exclude patterns

Use cases

Best practices

Limitations