DatoCMS Site Search

How the crawling works

The DatoCMS Site Search crawler is an in-house spider we created, not a standard library. We will try to explain its behaviours here.

The crawling process starts from the URL you configure as the "Starting URL" in your build trigger settings. From there, it will recursively follow all the hyperlinks pointing to your domain. It will also look for URLs you provide in sitemaps (see below).

User Agent

The User-Agent used by our crawler is DatoCmsSearchBot.

How can I control what pages will be crawled on my site?

DatoCmsSearchBot respects the robots.txt directives user-agent, allow, and disallow (case-insensitive). We also support a simple * wildcard and the $ end-of-path indicator (ignoring possible query strings).

In the example below, DatoCmsSearchBot won't crawl documents that are under /do-not-crawl/ or /not-allowed/ or that end in .json.

User-agent: DatoCmsSearchBot      # DatoCMS's user agent
Disallow: /do-not-crawl/          # disallow this directory
Disallow: /*.json$                # disallow all JSON files

User-agent: *                     # any robot
Disallow: /not-allowed/           # disallow this directory

DatoCmsSearchBot does not currently support the crawl-delay directive in robots.txt and robots meta tags on HTML pages such as nofollow and noindex.

At the moment we do not support robots.txt with multiple groups for the same user agent.

Allow and Disallow: Order matters!

If your robots.txt requires more complex rules, DatoCmsSearchBot only respects the first matching Allow or Disallow directive for any given URL. For example, with a robots.txt like:

# INCORRECT robots.txt example that accidentally disallows /other/

User-agent: DatoCmsSearchBot
Allow: /blog/
Disallow: /
Allow: /other/

/blog/my-article, /blog/2/, etc. will be crawled
/other/my-page will NOT be crawled, even though it's Allowed, because it would've first matched the Disallow: / line right above it
In other words, ALL paths not starting with /blog will be skipped because of the Disallow: / rule. This is counterintuitive because it is the order of the directives, not their specificity, that our crawler respects.

If you wish to allow both /blog/ and /other/, then you MUST place their Allow directives first, like:

# Correct robots.txt example that allows both /blog/ and /other/

User-agent: DatoCmsSearchBot
Allow: /blog/
Allow: /other/
Disallow: /

This way, everything under /blog/ and /other/ will be crawled
Everything else will be skipped

Sitemaps

In addition to following the links within pages, if your website provides a Sitemap file, the crawler will use it as an additional source of URLs to crawl. Sitemap Index files are also supported.

The crawler will first look for sitemap directives in the robots.txt file. If a robots.txt file does not exist, or it does not offer any sitemap directive, the crawler will try with /sitemap.xml under the root of your domain.

Ensure the URLs in your sitemaps match your domain!

Any link to domains different than the one configured as the "Website frontend URL" in your build trigger settings will be ignored by the bot.

Using a User-Agent custom suffix

For each search index, you can specify a custom suffix to be added to the standard User-Agent: for example, if you define "Docs", our crawler will use DatoCmsSearchBotDocs as a User-Agent.

By using a custom User-Agent and a custom-tailored robots.txt, you can restrict the crawling to specific subsets of your website and therefore provide even more refined search experiences.

Here is an example: with the following robots.txt, you can create two separate search indexes (one with the suffix "Docs", the other with the suffix "Blog") for the documentation and the blog sections of a website:

User-agent: DatoCmsSearchBotDocs
Allow: /docs/
Disallow: /

User-agent: DatoCmsSearchBotBlog
Allow: /blog/
Disallow: /

Language Detection

Through the HTML global lang attribute present on a page — or language-detection heuristics, if the attribute is missing — we detect the language of every crawled page, so that indexing will happen with proper stemming.

That is, if the visitor searches for "cats", we'll also return results for "cat", "catlike", "catty", etc.

Plain HTML only

The crawler does not execute JavaScript on the spidered pages, it only parses plain HTML. If your website is a Single Page App, you'll need to setup pre-rendering to make it readable by our bot.

Excluding content from indexing

To give your users the best experience, it's often useful to instruct DatoCmsSearchBot to exclude certain parts of your pages from indexing — ie. website headers and footers. Those sections are repeated in every page, thus can only degrade your search results.

To do that, you can simply add a data-datocms-noindex attribute to the HTML elements of your page you want to exclude: everything cointained in those elements will be ignored during indexing.

1
<body>
2
  <div class="header" data-datocms-noindex>
3
    ...
4
  </div>
5
  <div class="main-content">
6
    ...
7
  </div>
8
  <div class="footer" data-datocms-noindex>
9
    ...
10
  </div>
11
</body>

Crawling time

The time needed to finish the crawling operation depends on the number of pages in your website and your hosting's performances, but normally it's about ~20 indexed pages/sec.