Skip to main content
Concurrent Crawling

Crawling multiple websites concurrently on Sitebulb Cloud

Updated over a week ago

For faster and more efficient auditing, Large and Custom Sitebulb Cloud plans allow you to crawl multiple websites concurrently.

To do this effectively, you need to take into account the number of resources available, the resources allocated to each crawl, and the speed at which you wish to crawl.

Resources = Threads or Chrome Instances

The underlying resource you need to take into account is available 'Threads' when using the HTML crawler or ‘Chroem Instances’ when using the Chrome Crawler. The size and capacity of your Cloud server determines the number of resources available for crawling.

You will see the maximum number of available resources on your Cloud server in the highlighted message under Crawler Settings:

This is the maximum number of threads or Chrome instances that can be used for crawling.

By default, Sitebulb is set to crawl with 5 threads or instances of Chrome. This setting can be adjusted under Crawler Settings to control the speed of the crawler.

Distributing resources for concurrent crawling

For audits to run concurrently, the total number of resources across the audit needs to be lower than the total amount of resources available. With a total of 24 available threads, you could run three audits with 8 threads each, concurrently.

If you were to adjust one of your crawls to use 10 threads, a second crawl to use 8 threads, and a third to use 8 threads, these would not be able to run concurrently, since not enough resources are available. The third audit will be queued until enough resources become available.

Things to consider

To fully take advantage of concurrent crawling, there are a few things to consider about how you set up your audits:

#1 Speed affects everything

With Sitebulb Cloud you could choose to crawl every site as fast as possible, using all the resources available for maximum speed. This would, however, remove the ability to crawl concurrently, as all the threads would be used for one audit.

On top of that, we always advise caution with speed, since crawling a website too fast can cause the website server to crash - instead we recommend crawling responsibly, which Sitebulb will try to do by default.

Typically it makes more sense to have 2 or 3 audits running at slower speeds, but you may certainly find exceptions to this rule.

#2 Think about your biggest websites

You may have one or two massive websites that will take a long time to crawl - Sitebulb Cloud can handle millions of URLs, so you might have these ticking away in the background for several days while other audits are being run at the same time.

If this is the case, it would make sense to space out your big sites so you are not crawling them at the same time, and other smaller audits can be completed while your big crawls are running.

#3 Don't schedule everything all at once

Sitebulb Cloud is perfectly suited to run regular recurring audits. But scheduling all your monthly client audits for the first day of the month means that they will all be drawing from the same pool of resources at once - and you may end up with lots of queued audits.

Instead, it would be reasonably straightforward to schedule your audits across the first few days of the month... this will allow you to manage available resources while still retaining the desired 'data ready when you need it.'

Make sure that you pay attention to the Crawler Settings, so resources are well distributed among the audits scheduled to run on the same day.

It also makes sense to set weekly/monthly scheduled audits to run outside of working hours, so if you need to run other audits during the day, the resources will be available to do this.

#4 Adjust your Concurrent Crawling settings

You may have more than one queued audit at any one time. Your admin user can adjust how this queue is handled, and how many audits can run at any one time under Server Settings:

Did this answer your question?