Skip to main content

Include & Exclude URLs settings

Updated over 2 weeks ago

When you run a Sitebulb audit, the crawler starts at your homepage (or a URL you specify) and follows every link it finds, building up a picture of your website. But most websites have pages you don’t want to include in the audit - login pages, search results, faceted navigation URLs, staging environments, and more.

The Include & Exclude URLs settings give you precise control over what the crawler does and doesn’t visit. You can:

  • Exclude specific pages, folders, or URL patterns from being crawled

  • Include only certain sections of your site, ignoring everything else

  • Provide seed URLs to make sure the crawler finds important pages

  • Rewrite URLs to handle duplicate content caused by parameters or casing

  • Block third-party resources to speed up your audit

  • Exclude links found inside specific page elements like sidebars or navigation widgets

  • Control subdomain handling to decide whether subdomains are treated as part of your site

These settings are found in the “Include & Exclude URLs” section of the audit setup sidebar, under Crawler Settings. Inside, you will find eight sub-tabs. There is also a separate “Subdomain Options” tab in the sidebar.

When You Don’t Need These Settings

If you simply want to crawl your entire website with no restrictions, you don’t need to change anything here. By default, Sitebulb crawls all internal pages it can find (subject to robots.txt rules, crawl depth limits, and page limits). The default settings work well for most standard audits.

When You Should Use These Settings

You should configure these settings when:

  • Your site has large sections you want to exclude (thousands of faceted navigation pages on an e-commerce site)

  • You only want to audit a specific section (just the blog, or just the product pages)

  • Your site generates duplicate URLs through query parameters (tracking codes, session IDs)

  • You want to speed up the audit by blocking unnecessary third-party scripts

  • You need to audit a specific set of known URLs

Understanding the Pattern Syntax

Most of the tabs in this feature use a pattern language based on robots.txt directive syntax. If you’ve ever written a robots.txt file, you’ll recognise it immediately. If you haven’t, don’t worry - it’s straightforward once you understand a few rules.

What Patterns Match Against

This is important: patterns match against the path and query string portion of the URL only. They do not match against the protocol (https://) or the domain name (www.example.com).

Part of the URL

Included in matching?

https://

No

No

/blog/article

Yes

?page=2

Yes

So the pattern you would write to match this URL is /blog/article* or /blog/* - you never include the domain.

Wildcards and Special Characters

Character

What it does

Example

*

Matches any sequence of characters (including none)

/blog/* matches /blog/, /blog/my-post, /blog/2024/january/post

$

Anchors the match to the exact end of the URL

/blog$ matches /blog but NOT /blog/post or /blog?page=2

/

The forward slash that starts every URL path

Always begin patterns with / or *

How Matching Works

Without a leading *: The pattern matches from the start of the URL path. Think of it as “the URL must begin with this.”

/blog/*

This matches: /blog/, /blog/my-post, /blog/2024/post This does NOT match: /en/blog/my-post (because the path starts with /en, not /blog)

With a leading *: The pattern matches anywhere in the URL path. Think of it as “the URL must contain this.”

*/blog/*

This matches: /blog/my-post, /en/blog/my-post, /fr/blog/2024/post

With a trailing $: The pattern must match the exact end of the URL. Think of it as “the URL must end exactly here.”

/women$

This matches: /women This does NOT match: /women/coats, /women?page=2

Practical Examples

Pattern

What it matches

What it does NOT match

/blog/*

/blog/my-post, /blog/

/en/blog/my-post

*/blog/*

/blog/my-post, /en/blog/my-post

/blogroll/page

/products$

/products exactly

/products/shoes, /products?sort=price

*/filter/*

/shoes/filter/size/10, /us/products/filter/colour/red

/filter-tips/

*?sort=

/products?sort=price, /shoes?sort=name

/products (no query string)

Patterns Starting with ?

If you write a pattern that starts with a question mark (to match query parameters), Sitebulb automatically treats it as if you wrote *? - meaning it will match the query string anywhere in the URL.

So these two patterns are equivalent:

?cardSize=
*?cardSize=

Both match: /browse/home-garden?cardSize=small

Multiple Patterns

Enter one pattern per line. Each pattern is evaluated independently - a URL only needs to match one pattern to be affected.

/blog/* /news/* */archive/*

This would match any URL containing /blog/, /news/ at the start, or /archive/ anywhere in the path.

Note: Patterns are case-sensitive. The pattern /Blog/* will NOT match the URL /blog/my-post. Make sure your pattern’s letter casing matches the actual URLs on your website. If your site uses mixed casing, consider using the URL Rewriting feature to convert all URLs to lowercase before the patterns are applied.

Note: Do not use regular expressions (regex). Only the robots.txt wildcard syntax described above is supported. Writing regex patterns like ^/blog/.*$ will not work as expected.

Internal URL Exclusions

The first sub-tab, and the one you’ll likely use most often. This is where you tell Sitebulb which internal pages or sections to skip during the crawl.

There are two separate controls on this tab: path exclusions and query parameter exclusions.

Exclude Internal URL Paths

This is the main text area where you enter URL patterns for pages you want to exclude.

What It Does

Any internal URL that matches one of your patterns will not be crawled. Sitebulb will still record that the URL exists (if it finds a link to it), but it won’t visit the page, download its content, or follow any links on it.

When You’d Use It

  • Faceted navigation on e-commerce sites: Product listing pages often generate thousands of filtered URLs (/shoes/filter/colour/red/size/10). These are rarely useful to audit and can dramatically slow down your crawl.

  • User-generated content sections: Comment pages, forum threads, or review pages that you don’t need to audit.

  • Staging or test content: URLs like /staging/* or /test/* that shouldn’t be part of a production audit.

  • Paginated archives: If you have deep pagination (/blog/page/2, /blog/page/3, …, /blog/page/500), you might want to exclude it.

  • Search result pages: Internal search generates unique URLs for every query, which can make your crawl balloon in size.

Examples of What TO Do

Scenario

Pattern

Explanation

Exclude all blog posts

/blog/*

Matches anything starting with /blog/

Exclude faceted navigation

*/filter/*

Matches /filter/ anywhere in the URL path

Exclude search results

/search*

Matches /search, /search?q=shoes, /search-results

Exclude a specific page

/about/old-team-page$

The $ ensures only this exact path is matched

Exclude multiple language versions of a section

*/careers/*

Matches /careers/, /en/careers/, /fr/careers/

Examples of What NOT to Do

Mistake

Why it’s wrong

Correct version

blog/*

Missing leading / - URL paths always start with /, so this pattern will never match anything

/blog/*

Patterns don’t include the protocol or domain

/blog/*

/Blog/* when your URLs are /blog/...

Patterns are case-sensitive

/blog/*

^/blog/.*$

This is regex, not robots.txt syntax

/blog/*

/products without $ or *

This matches /products, /products/shoes, /productsearch - probably more than you intended

/products$ for the exact page, or /products/* for everything under it

Exclude Internal URL Query String Parameters

Below the path exclusion text area, you’ll find a separate section for handling query string parameters. Query strings are the parts of URLs that come after the ? - for example, in /products?sort=price&page=2, the query parameters are sort and page.

The “Crawl Parameters” Checkbox

By default, Sitebulb does crawl URLs with query parameters. This means /products, /products?sort=price, and /products?sort=name are all treated as separate pages.

If you untick “Crawl Parameters”, Sitebulb will ignore all URLs that contain query parameters. Only the base URL (/products) would be crawled.

Safe Query String Parameters

When you untick “Crawl Parameters”, a new text area appears: Safe Query String Parameters. This lets you make exceptions. Enter parameter names (one per line) that you still want Sitebulb to crawl.

For example, if your site uses page for pagination and you want to crawl paginated pages but nothing else:

  1. Untick “Crawl Parameters”

  2. In the Safe Query String Parameters box, type:

    page

Now Sitebulb will crawl /products?page=2 but skip /products?sort=price or /products?sessionid=abc123.

When You’d Use It

  • Tracking parameters: Marketing campaigns add parameters like utm_source, utm_medium, fbclid, gclid. These create thousands of duplicate URLs.

  • Session IDs: Some older sites add session identifiers to URLs.

  • Sort and filter parameters: E-commerce sites often have sort, order, view parameters that create duplicate content.

  • Pagination: If you want to crawl paginated pages but nothing else with parameters, use the safe list.

Note: If your site relies heavily on query parameters for navigation (/products?category=shoes), be careful with this setting. Turning off parameter crawling could cause Sitebulb to miss important pages.

External URL Exclusions

This tab works identically to the path exclusion part of Internal URL Exclusions, but for external URLs - links pointing to other websites.

What It Does

When Sitebulb finds links to external websites, it normally records them and checks their HTTP status (to find broken links, for example). External URL exclusions tell Sitebulb to skip checking certain external URL patterns entirely.

When You’d Use It

  • Known redirect services: If your site links to a URL shortener or redirect service that always returns the same status, you might want to exclude it.

  • Affiliate links: Links to affiliate networks that you don’t need to audit.

  • Social media profile links: If you link to social media from every page, these will appear thousands of times in your audit.

  • Login-required external pages: External pages behind login walls that always return errors.

Examples

Scenario

Pattern

Explanation

Exclude all links to a specific path on an external site

/affiliate/*

Skips external URLs whose path starts with /affiliate/

Exclude tracking redirect URLs

*/click*

Skips external URLs with /click anywhere in the path

Exclude specific external page

/profile/sitebulb$

Skips only this exact path on external sites

Note: Remember, these patterns match against the path and query string of external URLs, not the domain. If you want to exclude an entire external domain, use the External Domain Exclusions tab instead.

Internal URL Inclusions

Inclusions are like a special case of exclusions. Instead of saying “crawl everything except these”, inclusions say “crawl only these and nothing else.”

What It Does

When you add inclusion patterns, Sitebulb will only crawl internal URLs that match at least one of your patterns. Everything else is ignored. Think of it as putting a spotlight on just the pages you care about.

When You’d Use It

  • Auditing a single section: You only want to audit the /blog/ section of a large website.

  • Auditing a specific language version: You only want to audit /en/ pages on a multilingual site.

  • Focused technical audit: You want to check only product pages (/products/*) for structured data issues.

  • New section launch: You just launched a new /resources/ section and want to audit only that.

Examples

Scenario

Pattern

Explanation

Only crawl the blog

/blog/*

Only URLs starting with /blog/ will be crawled

Only crawl the English version

/en/*

Only URLs under the /en/ path

Only crawl products and categories

/products/* on one line, /categories/* on the next

URLs matching either pattern will be crawled

Important: Your Start URL Must Link to Included Pages

This is a crucial point that catches many people out. Sitebulb starts crawling from your start URL (usually your homepage) and follows links. If your homepage doesn’t contain any links to URLs that match your inclusion patterns, the crawl will find nothing.

Example of the problem:

  • You set an inclusion pattern of /blog/*

  • Your homepage links to /about, /products, /contact - but not directly to any /blog/ page

  • Result: Sitebulb never finds a matching page and the crawl returns almost no results

Solutions:

  1. Make sure your start URL links to at least one page matching your inclusion pattern

  2. Use the URL Seed List tab to provide direct URLs that match your patterns

  3. Change your start URL to a page within the included section (set your start URL to https://example.com/blog/)

Links on Included Pages

When you set inclusion patterns, a second text area appears: Links on Included Pages. This is an optional, more advanced feature.

What It Does

Normally, when you use inclusion patterns, Sitebulb only crawls pages that match those patterns. But pages within your included section might link to pages outside it - for example, a blog post might link to a product page. Without this setting, those linked product pages would be ignored.

“Links on Included Pages” lets you define a secondary set of patterns. If a page matching your primary inclusion patterns links to a page matching these secondary patterns, that linked page will also be crawled - even though it doesn’t match the primary inclusion rules.

When You’d Use It

  • Blog audit with linked product pages: You’re auditing /blog/* but want to also check any product pages that blog posts link to.

  • Section audit with cross-links: You’re auditing /resources/* but resources link to /tools/* pages that you also want to include.

Example

Primary inclusion pattern:

/blog/*

Links on Included Pages pattern:

/products/*

Result: Sitebulb crawls all /blog/* pages. For any /products/* page linked from a blog post, Sitebulb will also crawl that product page.

Note: Secondary inclusion only works one level deep. If a secondary-included product page links to another page (say /reviews/product-123), that review page will NOT be crawled unless it matches the primary inclusion rules. The secondary patterns only rescue pages directly linked from primary-included pages.

URL Seed List

The URL Seed List is conceptually different from the other tabs. Instead of patterns that match many URLs, you provide specific, complete URLs that you want Sitebulb to visit.

What It Does

Seed URLs are added to the crawl queue alongside the URLs Sitebulb discovers by following links. The crawler will visit each seed URL and then follow any links it finds on those pages (subject to your other inclusion/exclusion rules).

When You’d Use It

  • Orphan page detection: You have pages that aren’t linked from anywhere on your site, but you want them included in the audit.

  • Supporting inclusion rules: Your inclusion patterns target a section that isn’t linked from your start URL (see the note in Internal URL Inclusions above).

  • Specific page audit: You have a list of exact URLs you want checked.

  • Sitemap-like input: You exported a list of URLs from another tool and want to make sure they’re all crawled.

  • Deep pages: Important pages buried deep in your site architecture that the crawler might not reach within its depth limit.

How to Use It

Enter one full URL per line, including the protocol and domain:

Important Differences from Patterns

Seed URLs

Pattern Rules

Full URLs with protocol and domain

Path-only patterns without protocol or domain

Match a single specific page

Match many pages at once

Add pages to the crawl

Remove pages from the crawl (exclusions) or restrict the crawl (inclusions)

The crawler will follow links found on these pages

Patterns don’t cause the crawler to visit new pages, they just filter

Note: Seed URLs are still subject to your exclusion rules. If you add a seed URL that matches an exclusion pattern, it will still be excluded. Make sure your seed URLs don’t conflict with your exclusion rules.

URL Rewriting

URL Rewriting transforms URLs before the crawler processes them. This is different from exclusions - instead of skipping URLs, it changes them into a normalised form.

What It Does

URL rewriting modifies URLs as Sitebulb discovers them, before any crawl decisions are made. This helps consolidate duplicate URLs that only differ in casing or unnecessary parameters.

Settings

There are two catch-all settings options you can enable:

1 - Convert All URLs to Lowercase

When enabled, every internal URL is converted to lowercase before being processed.


When you’d use it: Your site treats /Products/Shoes and /products/shoes as the same page, but the links on your site use inconsistent casing. Without this setting, Sitebulb would crawl both as separate pages.

Example:

  • Before rewriting: /Products/Shoes, /products/shoes, /PRODUCTS/SHOES

  • After rewriting: all become /products/shoes (crawled once)

Note: Only enable this if your server genuinely treats URLs as case-insensitive. On Linux servers, /Products/ and /products/ are often different pages. If in doubt, test by visiting both URLs in a browser.

2 - Remove All Query String Parameters

When enabled, everything after the ? in a URL is stripped.

When you’d use it: Your site appends many different parameters that don’t change the page content, and you want a clean audit without parameter variations.

Example:

  • Before rewriting: /products/shoes?sort=price&utm_source=google&ref=homepage

  • After rewriting: /products/shoes

Parameters to Remove

Instead of removing all parameters, you can specify individual parameter names to remove. Enter one per line.

When you’d use it: You want to keep meaningful parameters (like page for pagination) but remove tracking parameters.

Example - parameters to remove:

utm_source
utm_medium
utm_campaign
fbclid
gclid
ref

With these settings, the URL /blog/post?page=2&utm_source=google&fbclid=abc becomes /blog/post?page=2.

Live Test Tool

The URL Rewriting tab includes a live preview tool. Enter a URL and see immediately how your rewriting rules would transform it. Use this to verify your settings before starting the audit.

Test URL Rewriting Settings

Note: URL rewriting happens early in the processing pipeline, before inclusion and exclusion patterns are evaluated. This means your patterns should match the rewritten form of URLs, not the original. For example, if you enable lowercase conversion, write your exclusion patterns in lowercase.

Note: URL rewriting happens early in the processing pipeline, before inclusion and exclusion patterns are evaluated. This means your patterns should match the rewritten form of URLs, not the original. For example, if you enable lowercase conversion, write your exclusion patterns in lowercase.

External Domain Exclusions

While External URL Exclusions (described above) let you exclude specific URL paths on external sites, External Domain Exclusions let you block entire domains.

What It Does

Any external links pointing to excluded domains will be skipped entirely. Sitebulb won’t check their status, follow their redirects, or include them in your audit data.

There are two separate lists:

  1. Links (External HTML Sources): Excludes domains for external page links (the URLs you see in <a href=""> tags).

  2. Page Resources: Excludes domains for external resources like images, scripts, and stylesheets.

When You’d Use It

  • Known safe external domains: Domains you link to frequently that you don’t need to audit (your own CDN).

  • Slow external domains: External sites that respond slowly and are slowing down your crawl.

  • Internal tool domains: Links to internal tools or intranets that aren’t publicly accessible.

  • Resource CDNs: Block external resource domains you don’t care about checking.

How to Use It

Enter one domain per line, without the protocol:

cdn.example.com
tracking.analytics-service.com

Wildcard Subdomains

You can use *. at the beginning to match all subdomains of a domain:

*.analytics-service.com

This matches tracking.analytics-service.com, cdn.analytics-service.com, www.analytics-service.com, and any other subdomain.

Examples

Scenario

What to enter

Where to enter it

Stop checking links to your partner site

partner-site.com

Links (External HTML Sources)

Block a slow CDN from being checked

slow-cdn.example.com

Page Resources

Block all subdomains of an analytics provider

*.analytics-provider.com

Page Resources

Note: This tab uses plain domain names, NOT the robots.txt pattern syntax used in other tabs. Don’t add paths, protocols, or wildcards other than the *. prefix.

Third Party Resources

This tab controls which third-party resources (scripts, stylesheets, images, fonts, etc.) the crawler loads when visiting your pages. Blocking unnecessary resources can significantly speed up your audit.

What It Does

When Sitebulb visits a page, it can load all the external resources that page references - JavaScript files, CSS stylesheets, fonts, images, tracking pixels, and more. Many of these resources are third-party services (analytics, advertising, chat widgets, consent banners) that aren’t relevant to your SEO audit.

Blocking these resources makes the crawl faster and reduces bandwidth usage without affecting the audit data you care about.

Resource Blocking Level

A dropdown at the top provides four preset levels:

Level

What it blocks

All (Block Everything)

Blocks all third-party resources: ads, tracking, analytics, social media, consent banners, chat widgets, fonts, images, and media

Third Party

Blocks common third-party service categories

Tracking Only

Blocks only tracking and advertising scripts

None (Block Nothing)

No resources are blocked - everything loads normally

For most audits, Third Party or Tracking Only is a good starting point. Use None only if you specifically need to analyse third-party resource loading (for page speed audits that depend on third-party scripts).

Custom Rules

Below the preset dropdown, three sub-tabs let you fine-tune the blocking:

Domain Exclusions

Enter domains (one per line) whose resources should be blocked. No protocol needed.

ads.doubleclick.net
pixel.facebook.com

URL Path Exclusions

This sub-tab uses a different pattern syntax from the rest of the Include & Exclude URLs feature. Here, you use glob-style patterns that include the protocol:

*://*.tracking-service.com/*
*://ads.example.com/pixel*

The *:// matches any protocol (http or https). The * works as a wildcard for any characters, similar to the robots.txt syntax, but the full URL including protocol is used here.

Pattern

What it matches

*://*.tracking-service.com/*

Any URL on any subdomain of tracking-service.com

*://ads.example.com/pixel*

Any URL starting with /pixel on ads.example.com

*://*.cdn.example.com/fonts/*

Font files from any subdomain of cdn.example.com

Allowed Domains

The reverse of domain exclusions - domains entered here will never be blocked, even if they match a blocking category from the preset level. Use this for third-party services that are essential to how your pages render.

essential-service.example.com
required-cdn.example.com

When you’d use it: You’ve set the blocking level to “Third Party” but your site uses a third-party JavaScript framework from a CDN that’s needed for the page to render properly. Add that CDN to the allowed list.

Note: The URL Path Exclusions sub-tab within Third Party Resources uses a different pattern format (full URL with protocol, glob-style) compared to the robots.txt style syntax used in the Internal/External URL Exclusions tabs. Don’t mix up the formats.

Exclude Link Settings

This tab works fundamentally differently from all the others. Instead of matching URLs, it matches HTML elements on the page using CSS selectors.

What It Does

When Sitebulb parses a page, it extracts all the links it finds. Exclude Link Settings lets you tell Sitebulb to ignore links found inside specific HTML containers. The links are skipped entirely - the crawler won’t follow them or record them.

This operates at the HTML parsing stage, before any URL-level rules are applied.

When You’d Use It

  • Faceted navigation: E-commerce sites often have filter sidebars that generate thousands of link variations. Instead of writing complex URL patterns, you can simply exclude the sidebar container.

  • Related products widgets: “You might also like” sections that link to hundreds of other products.

  • Comment sections: User comments containing links you don’t want in your audit.

  • Footer mega-menus: If your footer contains hundreds of links that you don’t need to crawl.

  • Tag clouds or category lists: Navigation widgets that create many low-value links.

How to Use It

Enter standard CSS selectors, one per line. If you’ve ever used CSS or browser developer tools, these will be familiar. If not, here are the most useful types:

Selector

What it targets

Example

.class-name

Any element with that CSS class

.facet-nav

#element-id

The element with that specific ID

#related-products

element

All elements of that type

aside

.parent .child

Elements matching .child inside elements matching .parent

.sidebar .widget

Practical Examples

E-commerce site with faceted navigation: Your product listing page has a sidebar with class facet-navigation that contains hundreds of filter links. Add:

.facet-navigation

All links inside any element with the class facet-navigation will be ignored.

Blog with a related posts widget: Your blog has a <div id="related-posts"> section at the bottom of each article. Add:

#related-posts

Multiple elements to exclude:

.facet-nav
#related-products
.sidebar-widget
.comment-section

How to Find the Right CSS Selector

  1. Open your website in a browser (Chrome, Firefox, Edge)

  2. Right-click on the element containing the links you want to exclude

  3. Select “Inspect” or “Inspect Element”

  4. Look at the HTML element that wraps the links

  5. Note its class name (after class=") or ID (after id=")

  6. Use .class-name or #id-name as your selector

Note: These are standard CSS selectors, not URL patterns. Don’t enter URLs or robots.txt patterns here.

Note: Links excluded this way are completely invisible to the crawl. They won’t appear in any reports and the crawler won’t follow them. Make sure you’re not accidentally excluding important navigation.

Subdomain Options

This is a separate top-level tab in the audit setup sidebar (not a sub-tab of Include & Exclude URLs), but it’s closely related to URL inclusion and exclusion behaviour.

What It Does

When Sitebulb encounters links to subdomains of your primary domain (blog.example.com when auditing www.example.com), this setting controls how those subdomains are treated.

The Three Options

Option

What it does

Check HTTP Status (default)

Sitebulb checks whether the subdomain URLs return a valid response (200, 301, 404, etc.) but doesn’t crawl them as part of your site. They appear as external links.

Audit and Report

Sitebulb treats subdomain URLs as internal pages and fully crawls them. Use this when your subdomains are part of the same website (shop.example.com alongside www.example.com).

Exclude All

Sitebulb completely ignores links to subdomains. They won’t be checked or recorded.

Including and Excluding Specific Subdomains

Depending on your chosen option, additional text areas appear where you can list specific subdomains:

When “Audit and Report” is selected: You can list specific subdomains to exclude from the audit. All other subdomains will be audited.

When “Exclude All” is selected: You can list specific subdomains to include in the audit. All other subdomains will be excluded.

Enter subdomain names (without the main domain), one per line:

blog
shop
help

Note: When both inclusion and exclusion subdomain lists are in play, inclusions take priority over exclusions. If a subdomain appears in both lists, it will be included.

When You’d Use It

  • Multi-subdomain website: Your brand has www., blog., shop., and help. subdomains and you want to audit them all together. Choose “Audit and Report.”

  • Single focus: You only want to audit www.example.com and don’t care about other subdomains. Choose “Exclude All.”

  • Selective subdomain audit: You want to audit www. and blog. but not staging. or dev.. Choose “Audit and Report” and exclude staging and dev.

How Rules Work Together

When you configure multiple rules across different tabs, Sitebulb evaluates them in a specific order. Understanding this order helps you avoid unexpected behaviour.

Evaluation Order

For every URL the crawler encounters, Sitebulb follows this decision process:

  1. URL Rewriting is applied first - the URL is transformed (lowercase, parameter removal) before any other rules are checked

  2. Has this URL already been scheduled or excluded? If yes, skip it (unless secondary inclusion applies)

  3. Is this an external URL on an excluded domain? If yes, skip it

  4. Does this external URL match an external URL exclusion pattern? If yes, skip it

  5. Is this a third-party resource that should be blocked? If yes, skip it

  6. Does this internal URL match an internal URL exclusion pattern? If yes, skip it

  7. Is this a parameterised URL, and are parameters disallowed? If yes, skip it

  8. Are inclusion rules active? If yes, the URL must match an inclusion pattern (or be rescued by secondary inclusion)

The Golden Rule: Exclusions Beat Inclusions

If a URL matches both an exclusion pattern and an inclusion pattern, the exclusion wins. Exclusions are evaluated first, and an excluded URL will not be crawled regardless of any inclusion rules.

Example of a conflict:

  • Exclusion pattern: */archive/*

  • Inclusion pattern: /blog/*

  • URL: /blog/archive/2023/post

  • Result: Excluded - because it matches the exclusion pattern

To avoid this, make sure your exclusion and inclusion patterns don’t overlap in unintended ways.

The Exception: Secondary Inclusion

The one scenario where an excluded URL can be “rescued” is through secondary inclusion (“Links on Included Pages”). If a URL was excluded but it matches a secondary inclusion pattern AND it was linked from a primary-included page, the secondary inclusion can override the exclusion.

However, this only works one level deep and is a niche scenario. For most users, the simpler rule is: exclusions always win.

CSS Selectors Run First

The Exclude Link Settings (CSS selectors) operate at the HTML parsing stage - before URLs even enter the scheduling pipeline. If a link is inside an excluded CSS selector, the URL is never extracted from the page in the first place, so none of the URL-based rules even see it.

URL Rewriting Affects Pattern Matching

Because URL rewriting happens before pattern matching, your patterns need to match the rewritten URLs, not the originals.

Example:

  • You enable “Convert all URLs to lowercase”

  • Your exclusion pattern is /Blog/*

  • A URL /Blog/my-post is rewritten to /blog/my-post

  • Result: Your pattern /Blog/* does NOT match /blog/my-post because of the case difference

  • Fix: Change your pattern to /blog/*

Common Mistakes and How to Avoid Them

1. Missing the Leading Forward Slash

The mistake: Writing products/* instead of /products/*.

Why it’s wrong: All URL paths start with /. Without it, the pattern has no way to match the beginning of a path. The crawler will never find a URL whose path starts with products (no slash).

The fix: Always start your patterns with / (to match from the beginning of the path) or * (to match anywhere in the path).

2. Forgetting That Patterns Are Case-Sensitive

The mistake: Writing /blog/* when your site’s URLs use /Blog/my-post.

Why it’s wrong: Pattern matching is case-sensitive. /blog/* and /Blog/* are completely different patterns.

The fix: Check the actual URLs on your site (look in the browser address bar). Match the exact casing. If your site uses inconsistent casing, enable “Convert all URLs to lowercase” in the URL Rewriting tab and write all your patterns in lowercase.

3. Setting Inclusion Rules Without a Path to Matching Pages

The mistake: Adding an inclusion pattern for /resources/* when your start URL (homepage) doesn’t link to any /resources/ page.

Why it’s wrong: Sitebulb follows links from your start URL. If no link leads to a page matching your inclusion pattern, the crawler discovers nothing.

The fix: Either change your start URL to a page within the included section, or add seed URLs (in the URL Seed List tab) that match your inclusion patterns.

4. Not Realising Exclusions Override Inclusions

The mistake: Adding an exclusion for */archive/* and then wondering why /blog/archive/2023/post isn’t crawled, even though /blog/* is in your inclusions.

Why it’s wrong: Exclusion rules are always evaluated first and always win.

The fix: Review both your exclusion and inclusion lists together. If you want certain URLs to be crawled, make sure they don’t match any exclusion pattern. Remove or narrow the conflicting exclusion.

5. Using the Wrong Syntax in the Wrong Tab

The mistake: Entering a full URL with protocol in the Internal URL Exclusions tab, or entering a path-only pattern in the Third Party Resources URL Path Exclusions.

Why it’s wrong: Different tabs use different formats:

  • Most tabs: path-only patterns using robots.txt syntax

  • Third Party Resources (URL Path Exclusions): full URL patterns with protocol (*://)

  • External Domain Exclusions: plain domain names

  • Seed URLs: full URLs with protocol and domain

  • Exclude Link Settings: CSS selectors

The fix: Refer to the specific tab’s instructions and use the correct format for that tab.

6. Overly Broad Patterns

The mistake: Using /p* to exclude /products/ but accidentally also excluding /pages/, /privacy-policy, and /pricing.

Why it’s wrong: * is greedy and matches everything. A short prefix pattern like /p* will match far more URLs than you intend.

The fix: Be as specific as possible. Use /products/* instead of /p*. Test your patterns mentally against URLs you do want to crawl to make sure they won’t be accidentally caught.

7. Forgetting That Secondary Inclusion Is One Level Deep

The mistake: Setting up secondary inclusion expecting it to cascade through multiple levels of linked pages.

Why it’s wrong: If Page A (primary included) links to Page B (secondary included), and Page B links to Page C, Page C is NOT automatically included. Only the direct links from primary-included pages are rescued by secondary inclusion.

The fix: If you need deeper cascading, expand your primary inclusion patterns instead of relying on secondary inclusion.

Quick Reference

Summary of All Tabs

Tab

What it controls

Pattern format

Input type

Internal URL Exclusions

Which internal pages to skip

Path-only, robots.txt syntax

Patterns (one per line)

Internal URL Exclusions (Parameters)

Whether to crawl URLs with query strings

Parameter names

Names (one per line) + checkbox

External URL Exclusions

Which external URL paths to skip checking

Path-only, robots.txt syntax

Patterns (one per line)

Internal URL Inclusions

Restrict crawl to only matching pages

Path-only, robots.txt syntax

Patterns (one per line)

Links on Included Pages

Extra pages to crawl one level deep from included pages

Path-only, robots.txt syntax

Patterns (one per line)

URL Seed List

Specific URLs to add to the crawl

Full URLs with protocol and domain

URLs (one per line)

URL Rewriting

Normalise URLs before processing

Parameter names

Names (one per line) + checkboxes

External Domain Exclusions

Entire domains to skip for external links/resources

Domain names (with optional *. prefix)

Domains (one per line)

Third Party Resources

Block third-party scripts, styles, fonts, etc.

Preset levels + domain names + glob-style URL patterns with protocol

Mixed (dropdown + domains + patterns)

Exclude Link Settings

Ignore links found inside specific page elements

CSS selectors

Selectors (one per line)

Subdomain Options

How to treat links to your subdomains

Subdomain names (without main domain)

Dropdown + names (one per line)

Pattern Syntax Cheat Sheet

Pattern

Meaning

Example match

/path/*

Anything starting with /path/

/path/to/page

*/path/*

Anything containing /path/ anywhere

/en/path/to/page

/exact-page$

Only this exact path, nothing after it

/exact-page

*?param=

Any URL with this query parameter

/page?param=value

/path/*/end

Path with any middle segment

/path/something/end

Default Behaviour (No Rules Configured)

When you don’t configure any inclusion or exclusion rules:

  • All internal pages are crawled (up to your page limit and crawl depth)

  • All external links are checked for HTTP status

  • URLs with different query parameters are treated as separate pages

  • Third-party resources are loaded normally

  • All subdomains are checked for HTTP status but not fully crawled

  • Robots.txt rules on your site are still respected (this is separate from user-defined rules)

Did this answer your question?