Troubleshooting completed Audits with errors | Sitebulb Support

Sitebulb has several mechanisms designed to help you set up successful crawls, like carrying out pre-audit and domain resolution checks. However, occasionally, you may find that your audit seems to have started and completed successfully, but a lack of crawled URLs or the final data indicates that Sitebulb ran into errors.

In this guide, we’ll address troubleshooting completed carwls that have errors. Follow the steps below to diagnose your crawl and implement the necessary solutions to ensure that Sitebulb can find all the relevant URLs on your site and successfully complete the crawl.

If your crawl has completed with errors, you will most likely notice that the final Audit has returned fewer URLs than expected. The error may also be flagged by Sitebulb in the Audit overview. To diagnose why this may be the case, follow these steps:

Check the audit overview for error messages

In some cases (see the examples below), Sitebulb will be able to flag up the error straight away with a message displayed on your Audit Overview. If this is the case, jump to the next section to find the relevant error and how to troubleshoot it.

Check your Internal Report for errors and failures

If you suspect your crawl is not complete, but no error message is present in the Audit Overview, investigate the HTTP response of your internal links for errors and more information on the crawl process by following these steps:

Navigate to your Internal URLs report - you’ll find it in the left-hand navigation menu of your Audit.

Review the distribution of the HTTP responses returned by the crawled pages by looking at the ‘Crawled URLs by Depth’ and ‘HTTP Status Codes’ graphs in the report overview.

Identify Timeouts, Forbidden, and Error URLs - if the crawl has encountered errors, you should be able to identify these straight away within the graphs. You can then jump into the respective URL Lists for more information about the server response.

Once you have identified the errors, move on to the next section to learn how to troubleshoot them.

Audit stopped due to a 429 (Too Many Requests) error

An HTTP status response of 429 ‘Too many requests’ is a method of rate limiting by the website server. This happens when a website server employs software to detect unnatural behaviour and blocks requests if it sees too many within a given time period.

Once Sitebulb runs into a 429 ‘Too many requests’ response, your audit will stop, even if it has not crawled all of the URLs queued. We do this to prevent a completed audit with lots of URLs with a 429 response code, which would provide no useful data.

If the audit has stopped due to encountering a 429 response, you will see this indicated in the audit overview with the following message:

“Audit Stopped!

The audit stopped early because: Website Crawler: The Website returned a HTTP Status of 429 ‘Too Many Requests’. Audit was stopped, as no more URLs could be audited.”

How to address 429 HTTP responses

Slow down the crawl

Some servers may have limits on how many requests a client can send in a specified amount of time, so slowing down your crawl can help you stay within this limit and finish your audit successfully.

You can limit the speed of your crawl by lowering the URLs per second limit under crawl settings.

Adjust the settings of your project, then use the green ‘Finish this audit and update the audit reports and hints’ button on your Audit Overview to restart the crawl and finish auditing.

Whitelist Sitebulb

If slowing down the crawl does not work, you will need to get your IP or the crawl User Agent Whitelisted to be able to crawl this site.

You can do this by either:

Whitelisting Sitebulb’s IP address
Whitelisting the User Agent you are crawling with

Both of these methods are explained in our guide on How to whitelist Sitebulb for crawling.

Audit stopped early due to a 403 response

If your website requires authentication (i.e., returns a 401 or 403 status), you will most likely see this flagged up at the pre-audit and domain resolution checks.

However, in some cases, you may find that while the start URL does return an HTTP status of 200 ‘Success’, subsequent pages or sections of your site do not return the same successful status, leading to an incomplete crawl.

When a URL returns a 403 ‘Forbidden’ response, Sitebulb will not be able to successfully crawl the page, and therefore, no content or internal links are collected.

How to Address 403 HTTP Responses

Whitelist Sitebulb

A 403 HTTP response indicates a lack of authorization, so you will need to whitelist Sitebulb on your website server(s) by either:

Whitelisting Sitebulb’s IP address
Whitelisting a custom User Agent, or choosing an existing whitelisted UA within your Audit Settings

Both of these methods are explained in our guide on How to whitelist Sitebulb for crawling.

Audit stopped early due to a server error (5xx)

HTTP 5XX server errors are generic responses that catch server or website errors. They are usually not related to Sitebulb, your machine, or your internet connection.

If your pages do not return a successful response from the server, they will not be rendered correctly, and therefore their content and internal links won’t be processed. This means that Sitebulb will not be able to discover new links, causing the crawl to finish early.

How to address server errors

5XX errors are difficult to troubleshoot because a range of issues can trigger them, and they are often transient, so we recommend trying to audit this website again at a different time.

When you’re ready to try again, use the green ‘Re-audit Failed URLs' button on the Audit Overview to restart the crawl.

Alternatively, you will need to ask your developer or server admin to look at their logs to diagnose this issue.

Audit stopped early due to a 408 response

A 408 HTTP response indicates that Sitebulb did not receive a timely response from the server, so the page rendering timed out. This can be the result of the server being overloaded.

How to Address 408 HTTP Responses

Slow the crawl down

Since 408 responses are likely to be caused by the server being overloaded, you can try re-crawling at a slower speed.

You can limit the speed of your crawl by lowering the ‘Max HTML URLs per second’ limit and increasing the ‘Render Timeout’ under crawl settings.

Crawl during downtimes

It is also advised that you attempt to re-crawl your website during downtimes, which could be overnight or during the weekends, for example. This will ensure that you are not overwhelming the server with requests at already high-traffic times.

You can implement this by using Sitebulb’s scheduling settings to determine when the crawl should start.

Try re-auditing failed URLs

Once you have adjusted your crawl speed and schedule, try re-auditing your failed URLs.

You should see the option to 'Re-Audit Failed URLs' just under the crawl details in your Audit Overview:

If you hit this button, Sitebulb will queue up just the failed URLs (Not Found, Errors, etc.) and crawl them in order to extract content and any other missed internal links.

Missing URLs but no crawl errors?

If your crawl seems to have run successfully, but you believe not all relevant pages were picked up by the crawler, you may simply need to adjust your settings and crawl sources in order to ensure Sitebulb can find all URLs.

Check out the recommendations in this document:

Ensuring Sitebulb can find all URLs on your website

Audit failed and no URLs were crawled

How to Whitelist Sitebulb for Crawling

Problems starting an audit: Pre-audit and domain resolution failures

Using SPA to Troubleshoot Audit Settings

Choosing the right settings for efficient auditing