Skip to main content
Concurrent Crawling

Crawling multiple websites concurrently on Sitebulb Cloud

Updated over 4 months ago

With Sitebulb Cloud plans of 'Large' or 'Custom', you are able to crawl multiple websites simultaneously. To do this effectively, you need to consider the number of resources available and the speed with which you wish to crawl.

Resources = threads

The underlying resource you need to take into account is 'threads'. You will see threads mentioned on the Crawler Settings of Sitebulb:

Threads warning

The highlighted notice explains the maximum number of threads that can be used for crawling - this will depend upon the size of your cloud package (smaller plans = less threads).

Just underneath, there is a dropdown entitled Number of Threads, which can be adjusted to fine-tune the speed of the audit. More threads will mean you can crawl faster, and less threads will slow the audit down.

If you kept the crawler set to use the default value of 5 threads, you would be able to run 4 audits all at the same time, with 5 threads each (because 5 x 4 = 20 - the max number of threads available on this cloud server).

However, imagine you adjusted the crawler to make use of all 20 threads, and then set your audit running. Now, since all available resources have been used up crawling this site, if you try and run any other audits at the same time, they will not be able to run - they will need to wait until this audit has finished, so they will get queued up.

Threads = Chrome instances

The above example applies to the HTML Crawler. If you switch to the Chrome Crawler (renders JavaScript), you will see a slightly different message:

Chrome Instances

As you can see, Chrome instances are allocated in the same way that threads are, so they can be considered one and the same in that regard (1 Chrome instance = 1 thread).

Things to consider

You can take advantage of concurrent crawling to enable your team to be more efficient. There are a few things to consider about how you set up your audits:

#1 Speed affects everything

With Sitebulb Cloud you could choose to crawl every site as fast as possible, using all the resources available for maximum speed. This would however remove the ability to crawl concurrently, as all the threads would be used for one audit.

On top of that, we always advise caution with speed, since crawling a website too fast can cause the website server to crash - instead we recommend crawling responsibly, which Sitebulb will try to do by default.

Typically it makes more sense to have 2 or 3 audits running at slower speeds, but you may certainly find exceptions to this rule.

#2 Think about your biggest websites

You may have one or two massive websites that will take a long time to crawl - Sitebulb Cloud can handle millions of URLs, so you might have these ticking away in the background for several days while other audits are being run at the same time. If this is the case, it would make sense to space out your big sites so you are not crawling them at the same time.

#2 Don't schedule everything all at once

Sitebulb Cloud is perfectly suited to regular recurring audits. But scheduling all your monthly client audits for the first day of the month means that your server will end up getting hammered on the 1st every single month - and you'll have tons of audits that will end up queued.

Instead, it would be reasonably straightforward to schedule a handful for the 1st, a handful for the 2nd, a handful for the 3rd etc... while still retaining the desired 'data ready when you need it.'

It also makes sense to set weekly/monthly scheduled audits to run outside of working hours, so if you need to run other audits during the day, the resources will be available to do this.

Did this answer your question?