Sometimes you may find that you want (or need) to restrict the crawler in order for it to crawl the website as you want it to be crawled.
For example:
Certain folders on the website account for thousands of extra pages, and offer very little insight, so you would prefer not to crawl them.
You are working on a specific section of the website, so you only want to crawl certain areas of it.
When the crawler encounters faceted navigation, it spawns thousands (or millions!) of variations of effectively the same URL, which all get added to the crawl queue and take forever to crawl.
The website adds click path parameters to every URL, which also have a canonical to the original URL, so your website audit becomes much less usable.
Your page resources include resources from external domains, which you wish to exclude.
In all of these cases, you will need to configure the crawler to exclude certain URLs so that they do not end up being added to the crawl queue, which you can do via the Include & Exclude URLs option from the left-hand menu of the audit setup.
As you navigate through the tabs in this window, you will see that there are 6 different ways to limit the crawl, each of which is covered below.
Internal URL Exclusions
Using Internal URL Exclusions is a method for restricting the crawler that allows you to specify URLs or entire directories to avoid. This tab also allows you to select URL Query String Parameters exclusion settings.
URL Paths to Exclude
Any URL that matches the excluded list will not be crawled at all. This also means that any URL only reachable via an excluded URL will not be crawled, even if it does not match the excluded list.
As an example, if I were crawling the Sitebulb website and wanted to avoid all the 'Product' pages, I would simply add the line:
/product/
Exclude Internal URL Query String Parameters
By default, Sitebulb will crawl all internal URLs with query string parameters. However, on some sites, you may wish to avoid this. For example, on sites with a large, crawlable, faceted search system.
To stop Sitebulb from crawling all URLs with (any) parameters at all, untick the 'Crawl Parameters' box. In the box below for 'Safe Query String Parameters', you can add in parameters which you do want Sitebulb to crawl, such as pagination parameters (e.g. 'page' or 'p').
Internal URL Inclusions
Using Internal URL Paths Inclusion settings is a method for restricting the crawler to only the URLs or directories specified.
As an example, if I were crawling the Sitebulb website and only wanted to crawl the 'Product' pages, I would simply add the line:
/product/
It is worth noting a couple of things:
Excluded URLs over-ride included URLs, so ensure your rules do not clash.
Your Start URL must contain at least one link to an included URL, otherwise the crawler will simply crawl 1 URL and then stop. To work around this, you can use the URL Seed List settings.
URL Seed List
For any URLs included in the seed list, Sitebulb will also parse the HTML on these pages and extract links - in addition to the Start URL and any other pages crawled.
In conjunction with the inclusion paths you have listed, this feature can be useful when using inclusion rules in cases where the start URL doesn't contain links to all the paths you wish to crawl.
You can also use this feature, in conjunction with Subdomain Options, to ensure all your subdomains are found and crawled.
โ
URL Rewriting
URL Rewriting is a method for instructing Sitebulb to modify URLs it discovers on the fly. It is most useful when you have a site that appends parameters to URLs in order to track things like the click path. Typically these URLs are canonicalized to the 'non-parameterized' version, which really just completely mess up your audit...unless you use URL rewriting.
You use URL Rewriting to strip parameters, so for example:
Can become:
And you end up with 'clean' URLs in your audit.
To set up the example above, you would enter the parameter 'ut_source' in the box. If you also wish to add other parameters, add one per line.
Then, you can test your settings at the bottom by entering example URLs.
Alternatively, the top tickboxes at the top allow you to automatically rewrite all upper case characters into lower case, or remove ALL parameters, respectively. The latter option means you do not need to bother writing parameters into the box, it will just strip everything.
External Domain Exclusions
The Include & Exclude URLs settings also allow you to set exclusions for external domains at the audit level. On the External Domain Exclusions tab you can specify domains that should be excluded in the External Page analysis or from Page Resources.
If you want to exclude both links and page resources from a certain domain, you will need to include it in both lists.
Block Third Party URLs
Third party URLs, like ads and tracking scripts can cause audits to get bloated with URLs that generally don't need to be audited. Blocking them will result in a much cleaner audit, so Sitebulb does this by default - untick this option to include them in your audit.
Hubspot platform, in particular, can spawn tons of tracking scripts, so again this is blocked by default.
There is an instance where you may wish to unblock tracking scripts in order to get better audit data, and this is when the website is set up to dynamically insert or change on-page data via Google Tag Manager (read about how to set this up here).