How to solve Siteimprove crawling issues

Topics covered on this page:

Why your site's crawl is failing
Crawling issues for sites on old/shared servers
Crawling issues on very large sites
Siteimprove isn’t finding all of your pages or subsites
How to create an html sitemap
How to filter out Siteimprove crawler data from Google Analytics

Why your crawl is failing

There are a few reasons your crawl may fail:

Most common: Your security tool may perceive the Siteimprove crawler as a threat and block it. The solution? You need to add Siteimprove’s IP addresses to your site’s safelist.
For very large sites, crawls may take so long that a new crawl is initiated before the previous crawl is complete. The solution? Submit a Siteimprove support request and request less frequent crawls.
Your home page is redirecting (Siteimprove can't use a redirected home page). The solution? Ideally, remove the redirect. Alternatively, you may use a sitemap as the indexing URL instead of the home page. Submit a Siteimprove support request to arrange for this.
There's an issue with your robots.txt file. (For more info, refer to Siteimprove's update to the robots.txt parser.) Make sure your robots.txt file isn't blocking all crawlers and to allow Siteimprove to crawl your site, add:

User-agent: SiteimproveBot
User-agent: SiteimproveBot-Crawler
Allow: /

If you're certain that none of these reasons apply, submit a Siteimprove Support Request for assistance.

Sites on old, shared servers

If your site lives on an old, shared server, it is possible that Siteimprove’s crawler may cause performance issues for your users while it’s crawling.

To solve this, try the following in the order that they appear:

Use a robots.txt file to reduce the crawl speed. Review the DAP site’s robot.txt file as an example.
If you can not add a robots.txt file, talk to us about other options:
1. Have Siteimprove reduce the crawl speed
2. Reduce frequency of crawl (bimonthly or monthly instead of every 4-5 days)
3. Limit the number of pages crawled (requires DAP approval)
4. Add URL exclusions (requires DAP approval-- may only be used for large numbers of highly repetitive pages)

Very large sites (over 10,000 pages)

Issues that may occur with very large sites:

Your crawl may fail if the previous crawl is still running when a new one starts. Submit a Siteimprove support request and ask to reduce the crawling frequency.
You may have thousands of pages of highly repetitive results in your Siteimprove report. This may happen with some calendar plug-ins that create thousands of pages for future dates, even when the dates are event-free. This increases crawling times and adds to the processing burden. Solution? Work with us to identify URLs that may be excluded. (Exclusions must be approved by DAP.)

Siteimprove isn’t finding all of your pages

Siteimprove is designed to work by using a domain or subdomain’s home page as the indexing URL. It starts on this page and then crawls through the site to locate all of the pages it contains.

If a page isn't linked to anywhere (an orphan), Siteimprove will not find it. If a link to a page is formatted as a button, the Siteimprove crawler will not follow it.

Sometimes, a department has a lot of websites which have evolved over many years and for some sites, the home page is no longer used. It may be unpublished or it may redirect to a different website. But the site still contains other active content (or “subsites”).

Siteimprove can not use a defunct or redirected home page as an indexing URL. And for several reasons, we don’t want page-level URLs (subsites) entered as websites in Siteimprove.

berkeley.edu is a domain. OK to use in SI.
plantsaregreat.berkeley.edu is a subdomain. OK to use in SI.
plantsaregreat.berkeley.edu/hydrangeas might be a page or a subsite (also called subdirectory). Not OK to use in SI.
plantsaregreat.berkeley.edu/sitemap is a page also, but an exception will be made for this URL only.

So what do we do if plantsaregreat.berkeley.edu now redirects to plantsareawesome.berkeley.edu and we have lots of content we love living in the old site that we want to keep alive, but we don’t want to move over?

We will use a sitemap to solve this problem.

How to create an html sitemap

A sitemap is a simple web page on your site that includes the active subsites that you need to have monitored by Siteimprove. It does not have to include all of the pages of your site because the crawler will find the pages that your subsites link to. (But it can include all of the pages if you prefer.)

For example, let’s say our subsite, plantsaregreat.berkeley.edu/hydrangeas acts as a landing page that links to these pages:

plantsaregreat.berkeley.edu/hydrangeas/oakleaf
plantsaregreat.berkeley.edu/hydrangeas/climbing
plantsaregreat.berkeley.edu/hydrangeas/mophead
plantsaregreat.berkeley.edu/hydrangeas/lacecap

Although we can list these child pages in our sitemap (and that’s great for accessibility), for Siteimprove’s crawler, we only have to include plantsaregreat.berkeley.edu/hydrangeas in our sitemap because the crawler will find the child pages through the links.

What a sitemap looks like

In our example, we'll create a subsite using the URL, plantsaregreat.berkeley.edu/sitemap. This URL must never change because then it will stop working in Siteimprove.

We'll include only the 2nd level pages that function as subsites because these link out to all the 3rd level pages, which link out to 4th level pages and so on. If you have orphaned pages that are not linked to anywhere, Siteimprove will not find them.

Heading 1 (page title): Plants are Great - Sitemap

Hydrangeas
Succulents
Cacti
Ficuses
Ferns
Azaleas
Flowering perennials
Roses

[Note: This example doesn’t use actual links because these aren’t real web pages, but you will need to use active hyperlinks.]

That’s it! Easy peasy.

Once your sitemap is ready, submit a Siteimprove Support Request so we can add it as the indexing URL for your site.

If you decide to include all of the pages in your sitemap (or even just the 2nd and 3rd level child pages), there are free tools that can help you create a more detailed sitemap.

What about xml sitemaps?

These are different than html sitemaps. Siteimprove uses these too, but for the purposes of replacing a home page, we want to use an html sitemap.

How to filter out Siteimprove crawler data from Google Analytics

Site owners can add the Siteimprove IP addresses to the list of IPs to block/filter within the GA platform.

Quick Links