When you’re setting up new crawls with Sitebulb, you probably want to get to the finalised crawl data as soon as possible. If you find yourself sitting in the front of the progress screen, wondering about the crawl of your speed, this is for you.
Let’s discuss Sitebulb’s crawl speed and how you can control it.
What Influences Crawl Speed
Crawl speed is influenced by a broad range of factors, some of which are within your control. In broad terms, the key things affecting crawl speed are:
Time to First Byte (TTFB)
Before Sitebulb can crawl and analyse your pages, it needs to send a request to your website server and wait for a reply. TTFB is the time between the HTTP request and when the first byte of a response begins to arrive. This means that the speed of your website server directly influences the speed of your crawl.
As Sitebulb crawls your pages, you will see the TTFB metric listed in the Progress screen under the URL Log section.
Sitebulb’s page processing time is added to the TTFB time for each page, and in turn, this processing time is influenced by your audit settings. We’ll discuss this below.
By default, Sitebulb applies a 10-second TTFB limit to each page. If the page does not respond within this timeframe, Sitebulb will move on to the next page, and you’ll see a corresponding timeout message in the final Audit. This is essentially a security measure to ensure that the crawler doesn't get stuck waiting for pages to load. You can't change this setting.
Type of crawler
Your choice of crawl method will also impact the speed of your crawl:
The HTML Crawler (Default Option) utilizes 'traditional' HTML extraction. As such, it is notably faster.
The Chrome Crawler renders pages using headless Chrome and renders JavaScript on your pages. Crawling with the Chrome Crawler requires downloading all page resources, which results in a longer crawl time. If you need to crawl a website that heavily relies on JavaScript or uses a JavaScript framework, the Chrome Crawler is the preferred choice.
Audit Settings
Each audit and feature that you enable in your settings will have a small impact on the time it takes Sitebulb to process your URLs, but this is generally minimal. However, some Reports are particularly resource-intensive and will likely have a notable impact on speed when enabled. They include:
Performance
Spelling
Accessibility
Generally, we recommend that you only enable the reports that are relevant to your auditing goals. For more, check out our article on Choosing the right settings for efficient auditing.
Crawl Speed Settings
The Crawler Settings section of your audit setup gives you direct control over the speed of the crawl by allowing you to adjust the available resources and URLs per second limit.
Increasing the number of Threads or Instances of Chrome gives Sitebulb more resources, which can speed up the crawl
Adding a Max HTML URLs per Second limit stops Sitebulb from crawling too fast, so in effect, it limits the speed of the crawl
The Crawler Settings will be slightly different depending on whether you are crawling with the HTML or Chrome crawler. For a full breakdown of each setting, check out the Crawler Settings reference documentation.
Here’s how you can control crawl speed by adjusting your crawler settings:
Number of Threads or Chrome Instances
You will see this metric referred to as ‘threads’ when crawling with the HTML crawler and ‘Chrome Instances’ when using the Chrome crawler. In both cases, this metric refers to the amount of resources available to Sitebulb for crawling.
In the case of Chrome instances, for example, this metric refers to the number of headless Chrome instances Sitebulb will fire up concurrently to use for crawling, and therefore how many pages it will be fetching and rendering at any one time.
Max HTML URLs per Second
When the Limit URL Speed setting is enabled, Sitebulb will not download more than the specified number of HTML URLs per second at any one time. This limit refers specifically to internal HTML URLs.
The Max HTML URLs per Second limit is a bit like a speed limiter in a car - it does not guarantee that Sitebulb will reach or maintain the speed you have set, but rather it limits the maximum speed you can ever hit even if you put your foot down (aka max out the available resources).
As Sitebulb crawls, you can monitor the speed of your crawl in your Progress screen:
It is worth noting that the Avg.Speed metric in the progress screen accounts for ALL URLs processed - so this speed could be higher than your Max HTML URLs per Second limit, as it accounts for HTML URLs + page resources and external URLs.
Render Timeout
When crawling with Chrome, you also need to account for the time needed to render the page and all of its resources. The render timeout setting determines how long Sitebulb will pause to wait for content to render before parsing the HTML. Increasing this value will slow down Sitebulb, as the crawler waits longer for each page to render before it can move on.
Conclusion and Caveats
The Threads or Chrome Instances setting is used in conjunction with the Max HTML URLs per Second limit to control crawl speed. If you remove the URLs/second limit, Sitebulb will crawl as fast as it is able to with the number of instances selected.
For example, if you select 5 Chrome instances, Sitebulb will be fetching/rendering 5 pages at any one time. Assuming we have not set a limit, how many URLs can be crawled per second is entirely based on how quickly the pages can be fetched (TTFB) and rendered. If your pages have a TTFB of 500ms, for example,it will take half a second before Sitebulb can even start rendering the page.
So to some extent, it takes a little bit of experimentation to find the right balance for each site you need to crawl. If, while crawling a site, Sitebulb takes 2 seconds to fetch, render, and fully process one URL, you may need 10 instances of Chrome in order to steadily crawl at 5 URLs/second. On speedier sites or with different crawl settings, you might only need 5 or 6 instances of Chrome to achieve the same result.
Decreased Crawl Speed Example
Limit URL Speed: yes
Instances of Chrome: 1
Max HTML URLs Per Second: 1
You can see with the above settings applied the Avg Speed is around 0.6 URLs per second. and in this 1 min clip, it managed to crawl 28 URLs. This is significantly slower than a normal everyday crawl and really helpful if you need to crawl a site at a slower rate for instance if you're getting a 429 Too Many Requests response code.
Increased Crawl Speed Example
Limit URL Speed: yes
Instances of Chrome: 8
Max HTML URLs Per Second: 10
As you can see with the crawl settings set to a higher amount the crawl is so much quicker the speed of URLs per second is between 7-11. In this 1 min clip, it managed to crawl 194 URLs. this is significantly more URLs that the Results above.