There are numerous ways in which you can stop Sitebulb from crawling specific URLs, paths or domains. This guide consolidates all of these methods, to help you understand which rules you will need to customise your crawl.
As a core premise, internal and external URLs are treated differently by the software, so they will each have their own section below.
Internal URLs Inclusion and Exclusion rules
There are 4 ways in which you can exclude particular internal URLs from being crawled:
Excluding specific URLs or paths
Including specific URLs or paths (subtle but important difference)
Excluding query string parameters
Rewriting URLs on the fly
In all of these cases, you will need to configure the crawler to exclude certain URLs so that they do not end up being added to the crawl queue, which you can do via the Include & Exclusde URLs settings in the left hand menu of the audit setup.
As you navigate though the top tabs, you will see the different ways to exclude URLs, each of which is covered below.
#1 Excluding specific URLs or paths
Using Excluded URLs is a method for restricting the crawler, and this method allows you to specify URLs or entire directories to avoid.
Any URL that matches the excluded list will not be crawled at all. This also means that any URL only reachable via an excluded URL will also not be crawled, even if it does not match the excluded list.
The list is pre-filled with some common patterns, which you can either over-write or add to using the lines underneath. As an example, if I were crawling the Sitebulb website and wanted to avoid all the 'Product' pages, I would simply add the line:
/product/
#2 Including specific URLs or paths
Using Included URLs is a method for restricting the crawler, and this method allows you to restrict the crawl to only the URLs or directories specified.
As an example, if I were crawling the Sitebulb website and only wanted to crawl the 'Product' pages, I would simply add the line:
/product/
It is worth noting a couple of things:
Excluded URLs over-ride included URLs, so ensure your rules do not clash.
Your Start URL must contain at least one link to an included URL, otherwise the crawler will simply crawl 1 URL and then stop.
#3 Excluding query string parameters
By default Sitebulb will crawl all internal URLs with query string parameters. However, on some sites you may wish to avoid this, such as on sites with a large, crawlable, faceted search system.
To stop Sitebulb crawling all URLs with (any) parameters at all, untick the 'Crawl Parameters' box. In the box below for 'Safe Query String Parameters', you can add in parameters which you do want Sitebulb to crawl, such as pagination parameters (e.g. 'page' or 'p').
#4 Rewriting URLs on the fly
URL Rewriting is a method for instructing Sitebulb to modify URLs it discovers on the fly. It is most useful when you have a site that appends parameters to URLs in order to track things like the click path. Typically these URLs are canonicalized to the 'non-parameterized' version, which really just completely mess up your audit...unless you use URL rewriting.
You use URL Rewriting to strip parameters, so for example:
Can become:
And you end up with 'clean' URLs in your audit.
To set up the example above, you would enter the parameter 'ut_source' in the box in the middle of the page. If you also wish to add other parameters, add one per line.
Alternatively, the top tickboxes at the top allow you to automatically rewrite all upper case characters into lower case, or remove ALL parameters, respectively. The latter option means you do not need to bother writing parameters into the box, it will just strip everything.
Then, you can test your settings at the bottom by entering example URLs.
Excluding External URLs
When it comes to external URLs, it is worth noting that Sitebulb does not actually 'crawl' them in the first place - it merely does a HTTP status check on them. This allows you to check for broken links and redirects, without extracting and following links from another website (and accidentally crawling the entire internet...).
Excluding external URLs can be controlled in two different sections:
In the audit settings, which only affects a specific audit
In the global settings, which affects every audit
#1 Audit Settings
When setting your audit, make sure that 'Search Engine Optimization' is toggled on in the 'Audit Data' section (it is always on by default), then hit the 'Advanced Settings' button to open up the options underneath.
If you wish for Sitebulb to not check links to external websites, you need to uncheck this option.
#2 Global settings
While the above options give you most of the flexibility you need, sometimes you may require a bit more control. For instance, if you DID want to crawl external links and get their status codes, but DID NOT want to do this for a specific domain.
The URL Profiler site, for instance, links out to t.co a bunch of times:
In order to exclude only these t.co links, you need to go to the global settings, navigate to Excluded External URLs and add 't.co' to the Excluded Hosts.
The typical use case for this is if you do want to check external links in general, but you know that you have tens or hundreds of thousands of links to a specific domain and you don't want them included in your audit as they make it more difficult to navigate. For instance, social sharing links on every single product page of an ecommerce store.
Excluding external subdomains
A quick note on external subdomains, as they are treated differently to 'internal' subdomains (i.e. subdomains of the start URL).
Consider these external links to Majestic's site from URL Profiler:
If I only wanted to exclude the link to the blog subdomain, I would need to add this rule to the Excluded Hosts:
blog.majestic.com
But if I wanted to exclude all of the links in the table above, I would need to add this rule to the Excluded Hosts:
majestic.com
Excluding external paths
By adding paths to the list of Excluded Paths you will stop any external URLs that include these paths from being scheduled and checked by the Sitebulb crawler.
Adding in 'tweet' would exclude:
Any URLs that had /tweet/ in the folder name (e.g. https://example.com/tweet/abc)
Any URLs that had tweet in the filename (e.g. https://example.com/abc/tweet.php)
You can limit this to make it more specific, for instance adding 'tweet.php' will only match URLs with that specific string.