Go to the Project Settings > Build triggers section of your project and select a build trigger;
Check the Site search option and specify your Website frontend URL: that's the address from which crawling will begin;
Press the Publish changes button. This will start the rebuild of the frontend, and subsequently a website spidering.
Anytime you want, you can also trigger a respidering of your frontend using a specific CMA endpoint.
Once the publishing of the website ends, in the Project Settings > Deployment > Activity log section you will see that DatoCMS will start spidering your website. When the spidering ends (it may take a while, depending on the size of your website), you'll see a Site spidering completed with success event in your log.
Clicking on the Show details link will present you the complete list of spidered pages.
The spidering starts from the URL you configure as Website frontend URL in your build trigger settings, and recursively follows all the hyperlinks pointing to your domain. If your website has a Sitemap file (sitemap.xml
under the root of your domain), we'll use it as well. Sitemap Index files are also supported.
By default, our spider will look for sitemap.xml
under the root of your domain. If you're using a sitemap index, it should also start with that filename, but the other sitemaps the index links to can be named anything you like.
Alternatively, you can also use a robots.txt
file to specify the location of your sitemap(s), using directives like Sitemap: https://example.com/sitemap.xml
or Sitemap: https://example.com/sitemap-index.xml
, each on a new line
Through the HTML global lang
attribute present on a page — or language-detection heuristics, if it's missing — we detect the language of every spidered page, so that indexing will happen with proper stemming. That is, if the visitor searches for "cats", we'll also return results for "cat", "catlike", "catty", etc.
The crawler does not execute JavaScript on the spidered pages, it only parses plain HTML. If your website is a Single Page App, you'll need to setup pre-rendering to make it readable by our bot. The User-Agent
used by our crawler is DatoCmsSearchBot
.
The time needed to finish the spidering operation depends on the number of pages in your website and your hosting's performances, but normally it's about ~20 indexed pages/sec;