For certain websites or audits, you may wish to have more control over how Sitebulb will crawl, for example increasing the speed or adding crawl limits. You can do this via the Crawler Settings menu option on the left-hand side.
Crawler type
The Crawler Type has two options:
HTML Crawler - this is the default option, and will be suitable for most websites. The HTML Crawler uses 'traditional' HTML extraction, and is the quickest option.
Chrome Crawler - select this option if your website uses a JavaScript framework or renders content through JavaScript. The Chrome Crawler will render the page using a version of headless Chrome (essentially, a Chrome browser without a user interface). In order to render, the Chrome Crawler will need to download all the page resources, so you can expect a longer crawl time.
Using the correct crawler is critical if you're doing JavaScript SEO. To help determine if a website needs to be crawled using the Chrome Crawler, you can:
Run a sample crawl using the Chrome Crawler and check the compare the response and rendered HTML in your final Sitebulb audit, to determine if important page content is dependent upon JavaScript rendering.
Use the Single Page Analysis tool to analyze the Response vs. Render results for key pages.
Crawl speed (HTML Crawler)
The crawl speed settings will be different depending on the type of crawler you have selected. The HTML crawler provides the following settings to control the crawler speed:
Number of Threads - this setting controls how much CPU usage is allocated to Sitebulb, and therefore how many requests Sitebulb can send to the website server at any one time. In general, the more threads you use, the faster your audit can run. The maximum number of threads you can set is capped by the number of logical processors in your machine.
Limit URL Speed - a toggle you can use to switch the "max URLs/second" speed cap on and off. If switched on, Sitebulb will not crawl faster than the specified Max HTML URLs per Second limit.
Max HTML URLs per Second - When URL speed is limited, Sitebulb will not download more than the specified number of HTML URLs per second at any one time.
Note that the Max HTML URLs per Second limit acts as a maximum, but it does not mean that Sitebulb will always meet the limit, since other factors like TTFB, and the Number of Threads, can impact crawl speed.
To learn more about crawl speed and how to control it with Sitebulb settings, read our dedicated documentation on How to control crawl speed.
To learn more about crawling fast, we suggest you read our documentation How to Crawl Really Fast. Alternatively, to examine the benefits of a more measured approach, then check out our article, 'How to Crawl Responsibly.'
Crawl speed (Chrome Crawler)
The Chrome Crawler speed configuration section will look slightly different.
Adjusting these values will affect how fast Sitebulb is able to crawl:
Render Timeout - this determines how long Sitebulb will pause to wait for content to render, before parsing the HTML. Increasing this value will slow down Sitebulb, as the crawler waits longer for each page to render before it can move on.
Instances of Chrome - this setting is equivalent to the ‘Number of Threads’ you can set when using the HTML crawler. It determines how many requests Sitebulb can send to the website server at any one time, by determining how many logical processors will be used for rendering with headless Chrome. The higher the value you use, the faster Sitebulb can crawl, within the limitations of logical processors available on your machine.
Limit URL Speed - when ticked, the crawler will not exceed the Max HTML URLs per Second limit set below.
Max HTML URLs per Second - Limit the speed of the crawler by capping the number of HTML URLs crawled per second. Lower speeds limit the number of concurrent connections, which helps prevent server slowdown for website users.
Note that this limit acts as a maximum, but it does not mean that Sitebulb will always meet the limit, since other factors like TTFB, Instances of Chrome and Render Timeout can impact crawl speed.
To learn more about crawl speed and how to control it with Sitebulb settings, read our dedicated documentation on How to control crawl speed.
To learn more about crawling fast, we suggest you read our documentation How to Crawl Really Fast. Alternatively, to examine the benefits of a more measured approach, then check out these articles, 'How to Crawl Responsibly' and 'How to control URLs/second for chrome crawler.'
Crawled URL limits
There are two ways to limit the crawler so that it crawls and audits less (or more) URLs:
Maximum URLs to Audit - The total number of URLs Sitebulb will crawl. Once it hits this limit, Sitebulb will stop crawling and generate the reports. The maximum URLs you are able to crawl per Audit depends on your Sitebulb plan.
You may choose to reduce this limit in order to produce a sample audit.
Maximum Crawl Depth - The maximum number of page levels Sitebulb will crawl in your site hierarchy (where the homepage is level 0, and all URL links followed from the homepage are 1 level deep, and so on). This is useful if you have extremely deep pagination that keeps spawning new pages.
Optional Crawler Settings
Finally, you have the option to enable a number of crawler settings that change how Sitebulb behaves in particular instances. These settings are disabled by default, and should only be enabled if necessary.
Enable Cookies - checking this option will save cookies throughout the crawl, which is necessary for crawling some websites. The default value of ‘No’ is suitable for most websites.
Analyse 4XX URLs - By default, Sitebulb does not analyze the HTML of pages that return 4XX HTTP status codes. When this option is enabled, Sitebulb will not treat 4XX pages as errors, and analyze the HTML anyway.
Parse Tables - By default, Sitebulb does not crawl HTML tables, since the data in tables is often not relevant from a technical SEO perspective, and parsing large tables will slow the crawler down. Only enable this setting if necessary.
Advanced Chrome Crawler Settings
These settings control how and when the Chrome crawler renders the HTML. Most websites will render correctly with default settings, but having control over these can be useful when diagnosing rendering issues, or if Sitebulb isn't rendering the expected content.
Considered Load Event - Sets the event that Sitebulb will use to determine that the page has finished loading. ‘Wait for the Load event’ is the recommended setting, as this fires when the whole page has loaded, including all dependent resources (stylesheets, fonts, images).
Navigation Timeout - Sets a time frame for how long Sitebulb will wait for the Load Event to fire. Once this time has lapsed, Sitebulb will abort the request and will display ‘Timed Out’ as the HTTP Status in the audit results.
Disabling the feature can help diagnose rendering issues, but it is not recommended for standard setups.Enable Infinite Scroll - When enabled, Sitebulb will scroll down the page 10,000 pixels, which will load in lazy-loaded images and content.
Flatten Shadow Dom - When enabled, the crawler can only see the content that is visible in the rendered HTML.
Flatten Iframes - Enabled by default to mimic Googlebot’s behavior - iframes are fitted into a div element in the rendered HTML of the parent page.
Incognito (Session Isolation) - Enable this setting to crawl in Incognito - this will isolate each session and avoid ‘cross-tab tracking’
Enable Service Workers - This setting is relevant specifically to websites that render content via service workers. Only enable if this is relevant or when diagnosing rendering issues.