Skip to main content

Include & Exclude URLs settings

Updated over a week ago

Sometimes you may find that you want (or need) to restrict the crawler in order for it to crawl the website as you want it to be crawled.

For example:

  • Certain folders on the website account for thousands of extra pages, and offer very little insight, so you would prefer not to crawl them.

  • You are working on a specific section of the website, so you only want to crawl the relevant directories.

  • When the crawler encounters faceted navigation, it spawns thousands (or millions!) of variations of effectively the same URL, which all get added to the crawl queue and take forever to crawl.

  • The website adds click path parameters to every URL, which also have a canonical to the original URL, so your website audit becomes much less usable.

  • Your page resources include resources from external domains, which you wish to exclude.

In all of these cases, you will need to configure the crawler to exclude certain URLs so that they do not end up being added to the crawl queue, which you can do via the Include & Exclude URLs tab in the Audit Settings, which you'll find in the left-hand menu.

Include and Exclude URLs Configuration

As you navigate through the tabs in the Include & Exclude URLs window, you will see that there are 6 different ways to limit the crawl, each of which is covered below.

Internal URL Exclusions

Using Internal URL Exclusions is a method for restricting the crawler that allows you to specify URLs or entire directories Sitebulb should not craw.

This tab also allows you to select URL Query String Parameters exclusion settings.

Exclude Internal URL Paths

Any URL that matches the list of Exclude Internal URL Paths will not be crawled at all. This also means that any URL only reachable via an excluded URL will not be crawled, even if they do not match the excluded list.

As an example, if I were crawling the Sitebulb website and wanted to avoid all the 'Product' pages, I would simply add the line in the Exclude Internal URL Paths list.
/product/

Internal URL Exclusions

Exclude Internal URL Query String Parameters

By default, Sitebulb will crawl all internal URLs with query string parameters. However, on some sites, you may wish to avoid this. For example, on sites with a large, crawlable, faceted search system.

To stop Sitebulb from crawling all URLs with (any) parameters, untick the 'Crawl Parameters' box.

If there are some parameters that you do want to crawl, use the 'Safe Query String Parameters' box to add in parameters that you do want Sitebulb to crawl, such as pagination parameters (e.g. 'page' or 'p').

Query String Parameter Exclusions

Internal URL Inclusions

Using Internal URL Paths Inclusion settings is a method for restricting the crawler to only the URLs or directories specified.

As an example, if I were crawling the Sitebulb website and only wanted to crawl the 'Product' pages, I would simply add the line:
/product/

Internal URL Inclusions

Inclusion and Exclusion rules do not apply to Page Resources

The Include & Exclude URLs settings only apply to HTML URLs. ​That means if you are crawling with Chrome and have your Page Resources report enabled, Sitebulb will be picking up and reporting on all of your resources - the exclusion rules do not apply to these.

Other Caveats

Excluded URLs override included URLs, so ensure your rules do not clash.

Your Start URL must contain at least one link to an included URL; otherwise, the crawler will simply crawl 1 URL and then stop. To work around this, you can use the URL Seed List settings.

URL Seed List

The URL Seed list essentially provides Sitebulb with extra 'Start URLs'. For any URLs included in the seed list, Sitebulb will also parse the HTML on these pages and extract links, in addition to the Start URL and any other pages crawled.

In conjunction with the inclusion paths you have listed, this feature can be useful when using inclusion rules in cases where the start URL doesn't contain links to all the paths you wish to crawl.

You can also use this feature, in conjunction with Subdomain Options, to ensure all your subdomains are found and crawled.

Seed list

URL Rewriting

URL Rewriting is a method for instructing Sitebulb to modify the URLs it discovers on the fly. It is most useful when you have a site that appends parameters to URLs in order to track things like the click path. Typically, these URLs are canonicalized to the 'non-parameterized' version, which really just completely messes up your audit...unless you use URL rewriting.

Use URL Rewriting to strip parameters. For example:

Can become:

And you end up with 'clean' URLs in your audit.

To set up the example above, you would enter the parameter 'ut_source' in the box. If you also wish to add other parameters, add one per line.

URL Rewriting

Then, you can test your settings at the bottom by entering example URLs.

Test URL Rewriting Settings

Alternatively, the top tickboxes at the top allow you to automatically rewrite all upper case characters into lower case, or remove ALL parameters, respectively. The latter option means you do not need to bother writing parameters into the box, it will just strip everything.

External Domain Exclusions

The Include & Exclude URLs settings also allow you to set exclusions for external domains at the audit level. On the External Domain Exclusions tab, you can specify domains that should be excluded in the External Page analysis or from Page Resources.

If you want to exclude both links and page resources from a certain domain, you will need to include it in both lists.

External Domains Exclusions

External Domain Exclusions do not apply to Page Resources when Performance is enabled

When you enable the performance report, Sitebulb has to analyze and include all resource URLs in the audit in order to accurately report on performance and performance issues.

External Domain Exclusion rules won't apply to Page Resources URLs when Performance is enabled.

Block Third-Party URLs

Third-party URLs, like ads and tracking scripts can cause audits to get bloated with URLs that generally don't need to be audited. Blocking them will result in a much cleaner audit, so Sitebulb does this by default - untick this option to include them in your audit.

Hubspot platform, in particular, can spawn tons of tracking scripts, so again, this is blocked by default.

Block Scripts

There is an instance where you may wish to unblock tracking scripts in order to get better audit data, and this is when the website is set up to dynamically insert or change on-page data via Google Tag Manager (read about how to set this up here).

Did this answer your question?