Sorry, no results found for "".
The DatoCMS Site Search crawler is an in-house spider we created, not a standard library. We will try to explain its behaviors here.
The crawling process starts from the URL you configure as the "Website frontend URL" in your build trigger settings. From there, it will recursively follow all the hyperlinks pointing to your domain. It will also look for URLs you provide in sitemaps (see below).
The User-Agent used by our crawler is DatoCMSSearchBot
.
DatoCMSSearchBot respects the robots.txt directives user-agent
, allow
, and disallow
(case-insensitive). We also support a simple *
wildcard and the $
end-of-path indicator (ignoring possible query strings).
In the example below, DatoCMSSearchBot won't crawl documents that are under /do-not-crawl/
or /not-allowed/
or that end in .json
.
User-agent: DatoCMSSearchBot # DatoCMS's user agentDisallow: /do-not-crawl/ # disallow this directoryDisallow: /*.json$ # disallow all JSON files
User-agent: * # any robotDisallow: /not-allowed/ # disallow this directory
DatoCMSSearchBot does not currently support the crawl-delay
directive in robots.txt and robots meta tags on HTML pages such as nofollow
and noindex
.
If your robots.txt
requires more complex rules, DatoCMSSearchBot only respects the first matching Allow
or Disallow
directive for any given URL. For example, with a robots.txt
like:
# INCORRECT robots.txt example that accidentally disallows /other/
User-agent: DatoCmsSearchBotAllow: /blog/Disallow: /Allow: /other/
/blog/my-article
, /blog/2/
, etc. will be crawled
/other/my-page
will NOT be crawled, even though it's Allow
ed, because it would've first matched the Disallow: /
line right above it
In other words, ALL paths not starting with /blog
will be skipped because of the Disallow: /
rule. This is counterintuitive because it is the order of the directives, not their specificity, that our crawler respects.
If you wish to allow both /blog/
and /other/
, then you MUST place their Allow
directives first, like:
# Correct robots.txt example that allows both /blog/ and /other/
User-agent: DatoCmsSearchBotAllow: /blog/Allow: /other/Disallow: /
This way, everything under /blog/
and /other/
will be crawled
Everything else will be skipped
In addition to following the links found within pages, if your website provides a Sitemap file, the crawler will utilize it as an extra source of URLs to crawl. Sitemap Index files are also supported.
The crawler will first look for for sitemap
directives in the robots.txt file. If a robots.txt file does not exist, or it does not offer any sitemap directive, the crawler will try with /sitemap.xml
under the root of your domain.
Any link to domains different than the one configured as the "Website frontend URL" in your build trigger settings will be ignored by the bot.
Through the HTML global lang
attribute present on a page — or language-detection heuristics, if the attribute is missing — we detect the language of every crawled page, so that indexing will happen with proper stemming.
That is, if the visitor searches for "cats", we'll also return results for "cat", "catlike", "catty", etc.
The crawler does not execute JavaScript on the spidered pages, it only parses plain HTML. If your website is a Single Page App, you'll need to setup pre-rendering to make it readable by our bot.
To give your users the best experience, it's often useful to instruct DatoCMSSearchBot to exclude certain parts of your pages from indexing — ie. website headers and footers. Those sections are repeated in every page, thus can only degrade your search results.
To do that, you can simply add a data-datocms-noindex
attribute to the HTML elements of your page you want to exclude: everything cointained in those elements will be ignored during indexing.
<body> <div class="header" data-datocms-noindex> ... </div> <div class="main-content"> ... </div> <div class="footer" data-datocms-noindex> ... </div></body>
The time needed to finish the crawling operation depends on the number of pages in your website and your hosting's performances, but normally it's about ~20 indexed pages/sec.