When you run a Sitebulb audit, the crawler starts at your homepage (or a URL you specify) and follows every link it finds, building up a picture of your website. But most websites have pages you don’t want to include in the audit - login pages, search results, faceted navigation URLs, staging environments, and more.
The Include & Exclude URLs settings give you precise control over what the crawler does and doesn’t visit. You can:
Exclude specific pages, folders, or URL patterns from being crawled
Include only certain sections of your site, ignoring everything else
Provide seed URLs to make sure the crawler finds important pages
Rewrite URLs to handle duplicate content caused by parameters or casing
Block third-party resources to speed up your audit
Exclude links found inside specific page elements like sidebars or navigation widgets
Control subdomain handling to decide whether subdomains are treated as part of your site
These settings are found in the “Include & Exclude URLs” section of the audit setup sidebar, under Crawler Settings. Inside, you will find eight sub-tabs. There is also a separate “Subdomain Options” tab in the sidebar.
When You Don’t Need These Settings
If you simply want to crawl your entire website with no restrictions, you don’t need to change anything here. By default, Sitebulb crawls all internal pages it can find (subject to robots.txt rules, crawl depth limits, and page limits). The default settings work well for most standard audits.
When You Should Use These Settings
You should configure these settings when:
Your site has large sections you want to exclude (thousands of faceted navigation pages on an e-commerce site)
You only want to audit a specific section (just the blog, or just the product pages)
Your site generates duplicate URLs through query parameters (tracking codes, session IDs)
You want to speed up the audit by blocking unnecessary third-party scripts
You need to audit a specific set of known URLs
Understanding the Pattern Syntax
Most of the tabs in this feature use a pattern language based on robots.txt directive syntax. If you’ve ever written a robots.txt file, you’ll recognise it immediately. If you haven’t, don’t worry - it’s straightforward once you understand a few rules.
What Patterns Match Against
This is important: patterns match against the path and query string portion of the URL only. They do not match against the protocol (https://) or the domain name (www.example.com).
For example, given the URL https://www.example.com/blog/article?page=2:
Part of the URL | Included in matching? |
| No |
No | |
| Yes |
| Yes |
So the pattern you would write to match this URL is /blog/article* or /blog/* - you never include the domain.
Wildcards and Special Characters
Character | What it does | Example |
| Matches any sequence of characters (including none) |
|
| Anchors the match to the exact end of the URL |
|
| The forward slash that starts every URL path | Always begin patterns with |
How Matching Works
Without a leading *: The pattern matches from the start of the URL path. Think of it as “the URL must begin with this.”
/blog/*
This matches: /blog/, /blog/my-post, /blog/2024/post This does NOT match: /en/blog/my-post (because the path starts with /en, not /blog)
With a leading *: The pattern matches anywhere in the URL path. Think of it as “the URL must contain this.”
*/blog/*
This matches: /blog/my-post, /en/blog/my-post, /fr/blog/2024/post
With a trailing $: The pattern must match the exact end of the URL. Think of it as “the URL must end exactly here.”
/women$
This matches: /women This does NOT match: /women/coats, /women?page=2
Practical Examples
Pattern | What it matches | What it does NOT match |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Patterns Starting with ?
If you write a pattern that starts with a question mark (to match query parameters), Sitebulb automatically treats it as if you wrote *? - meaning it will match the query string anywhere in the URL.
So these two patterns are equivalent:
?cardSize=
*?cardSize=
Both match: /browse/home-garden?cardSize=small
Multiple Patterns
Enter one pattern per line. Each pattern is evaluated independently - a URL only needs to match one pattern to be affected.
/blog/* /news/* */archive/*
This would match any URL containing /blog/, /news/ at the start, or /archive/ anywhere in the path.
Note: Patterns are case-sensitive. The pattern /Blog/* will NOT match the URL /blog/my-post. Make sure your pattern’s letter casing matches the actual URLs on your website. If your site uses mixed casing, consider using the URL Rewriting feature to convert all URLs to lowercase before the patterns are applied.
Note: Do not use regular expressions (regex). Only the robots.txt wildcard syntax described above is supported. Writing regex patterns like ^/blog/.*$ will not work as expected.
Internal URL Exclusions
The first sub-tab, and the one you’ll likely use most often. This is where you tell Sitebulb which internal pages or sections to skip during the crawl.
There are two separate controls on this tab: path exclusions and query parameter exclusions.
Exclude Internal URL Paths
This is the main text area where you enter URL patterns for pages you want to exclude.
What It Does
Any internal URL that matches one of your patterns will not be crawled. Sitebulb will still record that the URL exists (if it finds a link to it), but it won’t visit the page, download its content, or follow any links on it.
When You’d Use It
Faceted navigation on e-commerce sites: Product listing pages often generate thousands of filtered URLs (
/shoes/filter/colour/red/size/10). These are rarely useful to audit and can dramatically slow down your crawl.User-generated content sections: Comment pages, forum threads, or review pages that you don’t need to audit.
Staging or test content: URLs like
/staging/*or/test/*that shouldn’t be part of a production audit.Paginated archives: If you have deep pagination (
/blog/page/2,/blog/page/3, …,/blog/page/500), you might want to exclude it.Search result pages: Internal search generates unique URLs for every query, which can make your crawl balloon in size.
Examples of What TO Do
Scenario | Pattern | Explanation |
Exclude all blog posts |
| Matches anything starting with |
Exclude faceted navigation |
| Matches |
Exclude search results |
| Matches |
Exclude a specific page |
| The |
Exclude multiple language versions of a section |
| Matches |
Examples of What NOT to Do
Mistake | Why it’s wrong | Correct version |
| Missing leading |
|
Patterns don’t include the protocol or domain |
| |
| Patterns are case-sensitive |
|
| This is regex, not robots.txt syntax |
|
| This matches |
|
Exclude Internal URL Query String Parameters
Below the path exclusion text area, you’ll find a separate section for handling query string parameters. Query strings are the parts of URLs that come after the ? - for example, in /products?sort=price&page=2, the query parameters are sort and page.
The “Crawl Parameters” Checkbox
By default, Sitebulb does crawl URLs with query parameters. This means /products, /products?sort=price, and /products?sort=name are all treated as separate pages.
If you untick “Crawl Parameters”, Sitebulb will ignore all URLs that contain query parameters. Only the base URL (/products) would be crawled.
Safe Query String Parameters
When you untick “Crawl Parameters”, a new text area appears: Safe Query String Parameters. This lets you make exceptions. Enter parameter names (one per line) that you still want Sitebulb to crawl.
For example, if your site uses page for pagination and you want to crawl paginated pages but nothing else:
Untick “Crawl Parameters”
In the Safe Query String Parameters box, type:
page
Now Sitebulb will crawl /products?page=2 but skip /products?sort=price or /products?sessionid=abc123.
When You’d Use It
Tracking parameters: Marketing campaigns add parameters like
utm_source,utm_medium,fbclid,gclid. These create thousands of duplicate URLs.Session IDs: Some older sites add session identifiers to URLs.
Sort and filter parameters: E-commerce sites often have
sort,order,viewparameters that create duplicate content.Pagination: If you want to crawl paginated pages but nothing else with parameters, use the safe list.
Note: If your site relies heavily on query parameters for navigation (/products?category=shoes), be careful with this setting. Turning off parameter crawling could cause Sitebulb to miss important pages.
External URL Exclusions
This tab works identically to the path exclusion part of Internal URL Exclusions, but for external URLs - links pointing to other websites.
What It Does
When Sitebulb finds links to external websites, it normally records them and checks their HTTP status (to find broken links, for example). External URL exclusions tell Sitebulb to skip checking certain external URL patterns entirely.
When You’d Use It
Known redirect services: If your site links to a URL shortener or redirect service that always returns the same status, you might want to exclude it.
Affiliate links: Links to affiliate networks that you don’t need to audit.
Social media profile links: If you link to social media from every page, these will appear thousands of times in your audit.
Login-required external pages: External pages behind login walls that always return errors.
Examples
Scenario | Pattern | Explanation |
Exclude all links to a specific path on an external site |
| Skips external URLs whose path starts with |
Exclude tracking redirect URLs |
| Skips external URLs with |
Exclude specific external page |
| Skips only this exact path on external sites |
Note: Remember, these patterns match against the path and query string of external URLs, not the domain. If you want to exclude an entire external domain, use the External Domain Exclusions tab instead.
Internal URL Inclusions
Inclusions are like a special case of exclusions. Instead of saying “crawl everything except these”, inclusions say “crawl only these and nothing else.”
What It Does
When you add inclusion patterns, Sitebulb will only crawl internal URLs that match at least one of your patterns. Everything else is ignored. Think of it as putting a spotlight on just the pages you care about.
When You’d Use It
Auditing a single section: You only want to audit the
/blog/section of a large website.Auditing a specific language version: You only want to audit
/en/pages on a multilingual site.Focused technical audit: You want to check only product pages (
/products/*) for structured data issues.New section launch: You just launched a new
/resources/section and want to audit only that.
Examples
Scenario | Pattern | Explanation |
Only crawl the blog |
| Only URLs starting with |
Only crawl the English version |
| Only URLs under the |
Only crawl products and categories |
| URLs matching either pattern will be crawled |
Important: Your Start URL Must Link to Included Pages
This is a crucial point that catches many people out. Sitebulb starts crawling from your start URL (usually your homepage) and follows links. If your homepage doesn’t contain any links to URLs that match your inclusion patterns, the crawl will find nothing.
Example of the problem:
You set an inclusion pattern of
/blog/*Your homepage links to
/about,/products,/contact- but not directly to any/blog/pageResult: Sitebulb never finds a matching page and the crawl returns almost no results
Solutions:
Make sure your start URL links to at least one page matching your inclusion pattern
Use the URL Seed List tab to provide direct URLs that match your patterns
Change your start URL to a page within the included section (set your start URL to
https://example.com/blog/)
Links on Included Pages
When you set inclusion patterns, a second text area appears: Links on Included Pages. This is an optional, more advanced feature.
What It Does
Normally, when you use inclusion patterns, Sitebulb only crawls pages that match those patterns. But pages within your included section might link to pages outside it - for example, a blog post might link to a product page. Without this setting, those linked product pages would be ignored.
“Links on Included Pages” lets you define a secondary set of patterns. If a page matching your primary inclusion patterns links to a page matching these secondary patterns, that linked page will also be crawled - even though it doesn’t match the primary inclusion rules.
When You’d Use It
Blog audit with linked product pages: You’re auditing
/blog/*but want to also check any product pages that blog posts link to.Section audit with cross-links: You’re auditing
/resources/*but resources link to/tools/*pages that you also want to include.
Example
Primary inclusion pattern:
/blog/*
Links on Included Pages pattern:
/products/*
Result: Sitebulb crawls all /blog/* pages. For any /products/* page linked from a blog post, Sitebulb will also crawl that product page.
Note: Secondary inclusion only works one level deep. If a secondary-included product page links to another page (say /reviews/product-123), that review page will NOT be crawled unless it matches the primary inclusion rules. The secondary patterns only rescue pages directly linked from primary-included pages.
URL Seed List
The URL Seed List is conceptually different from the other tabs. Instead of patterns that match many URLs, you provide specific, complete URLs that you want Sitebulb to visit.
What It Does
Seed URLs are added to the crawl queue alongside the URLs Sitebulb discovers by following links. The crawler will visit each seed URL and then follow any links it finds on those pages (subject to your other inclusion/exclusion rules).
When You’d Use It
Orphan page detection: You have pages that aren’t linked from anywhere on your site, but you want them included in the audit.
Supporting inclusion rules: Your inclusion patterns target a section that isn’t linked from your start URL (see the note in Internal URL Inclusions above).
Specific page audit: You have a list of exact URLs you want checked.
Sitemap-like input: You exported a list of URLs from another tool and want to make sure they’re all crawled.
Deep pages: Important pages buried deep in your site architecture that the crawler might not reach within its depth limit.
How to Use It
Enter one full URL per line, including the protocol and domain:
https://www.example.com/blog/important-post https://www.example.com/products/featured-item https://www.example.com/landing-page-2024
Important Differences from Patterns
Seed URLs | Pattern Rules |
Full URLs with protocol and domain | Path-only patterns without protocol or domain |
Match a single specific page | Match many pages at once |
Add pages to the crawl | Remove pages from the crawl (exclusions) or restrict the crawl (inclusions) |
The crawler will follow links found on these pages | Patterns don’t cause the crawler to visit new pages, they just filter |
Note: Seed URLs are still subject to your exclusion rules. If you add a seed URL that matches an exclusion pattern, it will still be excluded. Make sure your seed URLs don’t conflict with your exclusion rules.
URL Rewriting
URL Rewriting transforms URLs before the crawler processes them. This is different from exclusions - instead of skipping URLs, it changes them into a normalised form.
What It Does
URL rewriting modifies URLs as Sitebulb discovers them, before any crawl decisions are made. This helps consolidate duplicate URLs that only differ in casing or unnecessary parameters.
Settings
There are two catch-all settings options you can enable:
1 - Convert All URLs to Lowercase
When enabled, every internal URL is converted to lowercase before being processed.
When you’d use it: Your site treats /Products/Shoes and /products/shoes as the same page, but the links on your site use inconsistent casing. Without this setting, Sitebulb would crawl both as separate pages.
Example:
Before rewriting:
/Products/Shoes,/products/shoes,/PRODUCTS/SHOESAfter rewriting: all become
/products/shoes(crawled once)
Note: Only enable this if your server genuinely treats URLs as case-insensitive. On Linux servers, /Products/ and /products/ are often different pages. If in doubt, test by visiting both URLs in a browser.
2 - Remove All Query String Parameters
When enabled, everything after the ? in a URL is stripped.
When you’d use it: Your site appends many different parameters that don’t change the page content, and you want a clean audit without parameter variations.
Example:
Before rewriting:
/products/shoes?sort=price&utm_source=google&ref=homepageAfter rewriting:
/products/shoes
Parameters to Remove
Instead of removing all parameters, you can specify individual parameter names to remove. Enter one per line.
When you’d use it: You want to keep meaningful parameters (like page for pagination) but remove tracking parameters.
Example - parameters to remove:
utm_source
utm_medium
utm_campaign
fbclid
gclid
ref
With these settings, the URL /blog/post?page=2&utm_source=google&fbclid=abc becomes /blog/post?page=2.
Live Test Tool
The URL Rewriting tab includes a live preview tool. Enter a URL and see immediately how your rewriting rules would transform it. Use this to verify your settings before starting the audit.
Note: URL rewriting happens early in the processing pipeline, before inclusion and exclusion patterns are evaluated. This means your patterns should match the rewritten form of URLs, not the original. For example, if you enable lowercase conversion, write your exclusion patterns in lowercase.
Note: URL rewriting happens early in the processing pipeline, before inclusion and exclusion patterns are evaluated. This means your patterns should match the rewritten form of URLs, not the original. For example, if you enable lowercase conversion, write your exclusion patterns in lowercase.
External Domain Exclusions
While External URL Exclusions (described above) let you exclude specific URL paths on external sites, External Domain Exclusions let you block entire domains.
What It Does
Any external links pointing to excluded domains will be skipped entirely. Sitebulb won’t check their status, follow their redirects, or include them in your audit data.
There are two separate lists:
Links (External HTML Sources): Excludes domains for external page links (the URLs you see in
<a href="">tags).Page Resources: Excludes domains for external resources like images, scripts, and stylesheets.
When You’d Use It
Known safe external domains: Domains you link to frequently that you don’t need to audit (your own CDN).
Slow external domains: External sites that respond slowly and are slowing down your crawl.
Internal tool domains: Links to internal tools or intranets that aren’t publicly accessible.
Resource CDNs: Block external resource domains you don’t care about checking.
How to Use It
Enter one domain per line, without the protocol:
cdn.example.com
tracking.analytics-service.com
Wildcard Subdomains
You can use *. at the beginning to match all subdomains of a domain:
*.analytics-service.com
This matches tracking.analytics-service.com, cdn.analytics-service.com, www.analytics-service.com, and any other subdomain.
Examples
Scenario | What to enter | Where to enter it |
Stop checking links to your partner site |
| Links (External HTML Sources) |
Block a slow CDN from being checked |
| Page Resources |
Block all subdomains of an analytics provider |
| Page Resources |
Note: This tab uses plain domain names, NOT the robots.txt pattern syntax used in other tabs. Don’t add paths, protocols, or wildcards other than the *. prefix.
Third Party Resources
This tab controls which third-party resources (scripts, stylesheets, images, fonts, etc.) the crawler loads when visiting your pages. Blocking unnecessary resources can significantly speed up your audit.
What It Does
When Sitebulb visits a page, it can load all the external resources that page references - JavaScript files, CSS stylesheets, fonts, images, tracking pixels, and more. Many of these resources are third-party services (analytics, advertising, chat widgets, consent banners) that aren’t relevant to your SEO audit.
Blocking these resources makes the crawl faster and reduces bandwidth usage without affecting the audit data you care about.
Resource Blocking Level
A dropdown at the top provides four preset levels:
Level | What it blocks |
All (Block Everything) | Blocks all third-party resources: ads, tracking, analytics, social media, consent banners, chat widgets, fonts, images, and media |
Third Party | Blocks common third-party service categories |
Tracking Only | Blocks only tracking and advertising scripts |
None (Block Nothing) | No resources are blocked - everything loads normally |
For most audits, Third Party or Tracking Only is a good starting point. Use None only if you specifically need to analyse third-party resource loading (for page speed audits that depend on third-party scripts).
Custom Rules
Below the preset dropdown, three sub-tabs let you fine-tune the blocking:
Domain Exclusions
Enter domains (one per line) whose resources should be blocked. No protocol needed.
ads.doubleclick.net
pixel.facebook.com
URL Path Exclusions
This sub-tab uses a different pattern syntax from the rest of the Include & Exclude URLs feature. Here, you use glob-style patterns that include the protocol:
*://*.tracking-service.com/*
*://ads.example.com/pixel*
The *:// matches any protocol (http or https). The * works as a wildcard for any characters, similar to the robots.txt syntax, but the full URL including protocol is used here.
Pattern | What it matches |
| Any URL on any subdomain of tracking-service.com |
| Any URL starting with |
| Font files from any subdomain of cdn.example.com |
Allowed Domains
The reverse of domain exclusions - domains entered here will never be blocked, even if they match a blocking category from the preset level. Use this for third-party services that are essential to how your pages render.
essential-service.example.com
required-cdn.example.com
When you’d use it: You’ve set the blocking level to “Third Party” but your site uses a third-party JavaScript framework from a CDN that’s needed for the page to render properly. Add that CDN to the allowed list.
Note: The URL Path Exclusions sub-tab within Third Party Resources uses a different pattern format (full URL with protocol, glob-style) compared to the robots.txt style syntax used in the Internal/External URL Exclusions tabs. Don’t mix up the formats.
Exclude Link Settings
This tab works fundamentally differently from all the others. Instead of matching URLs, it matches HTML elements on the page using CSS selectors.
What It Does
When Sitebulb parses a page, it extracts all the links it finds. Exclude Link Settings lets you tell Sitebulb to ignore links found inside specific HTML containers. The links are skipped entirely - the crawler won’t follow them or record them.
This operates at the HTML parsing stage, before any URL-level rules are applied.
When You’d Use It
Faceted navigation: E-commerce sites often have filter sidebars that generate thousands of link variations. Instead of writing complex URL patterns, you can simply exclude the sidebar container.
Related products widgets: “You might also like” sections that link to hundreds of other products.
Comment sections: User comments containing links you don’t want in your audit.
Footer mega-menus: If your footer contains hundreds of links that you don’t need to crawl.
Tag clouds or category lists: Navigation widgets that create many low-value links.
How to Use It
Enter standard CSS selectors, one per line. If you’ve ever used CSS or browser developer tools, these will be familiar. If not, here are the most useful types:
Selector | What it targets | Example |
| Any element with that CSS class |
|
| The element with that specific ID |
|
| All elements of that type |
|
| Elements matching |
|
Practical Examples
E-commerce site with faceted navigation: Your product listing page has a sidebar with class facet-navigation that contains hundreds of filter links. Add:
.facet-navigation
All links inside any element with the class facet-navigation will be ignored.
Blog with a related posts widget: Your blog has a <div id="related-posts"> section at the bottom of each article. Add:
#related-posts
Multiple elements to exclude:
.facet-nav
#related-products
.sidebar-widget
.comment-section
How to Find the Right CSS Selector
Open your website in a browser (Chrome, Firefox, Edge)
Right-click on the element containing the links you want to exclude
Select “Inspect” or “Inspect Element”
Look at the HTML element that wraps the links
Note its class name (after
class=") or ID (afterid=")Use
.class-nameor#id-nameas your selector
Note: These are standard CSS selectors, not URL patterns. Don’t enter URLs or robots.txt patterns here.
Note: Links excluded this way are completely invisible to the crawl. They won’t appear in any reports and the crawler won’t follow them. Make sure you’re not accidentally excluding important navigation.
Subdomain Options
This is a separate top-level tab in the audit setup sidebar (not a sub-tab of Include & Exclude URLs), but it’s closely related to URL inclusion and exclusion behaviour.
What It Does
When Sitebulb encounters links to subdomains of your primary domain (blog.example.com when auditing www.example.com), this setting controls how those subdomains are treated.
The Three Options
Option | What it does |
Check HTTP Status (default) | Sitebulb checks whether the subdomain URLs return a valid response (200, 301, 404, etc.) but doesn’t crawl them as part of your site. They appear as external links. |
Audit and Report | Sitebulb treats subdomain URLs as internal pages and fully crawls them. Use this when your subdomains are part of the same website ( |
Exclude All | Sitebulb completely ignores links to subdomains. They won’t be checked or recorded. |
Including and Excluding Specific Subdomains
Depending on your chosen option, additional text areas appear where you can list specific subdomains:
When “Audit and Report” is selected: You can list specific subdomains to exclude from the audit. All other subdomains will be audited.
When “Exclude All” is selected: You can list specific subdomains to include in the audit. All other subdomains will be excluded.
Enter subdomain names (without the main domain), one per line:
blog
shop
help
Note: When both inclusion and exclusion subdomain lists are in play, inclusions take priority over exclusions. If a subdomain appears in both lists, it will be included.
When You’d Use It
Multi-subdomain website: Your brand has
www.,blog.,shop., andhelp.subdomains and you want to audit them all together. Choose “Audit and Report.”Single focus: You only want to audit
www.example.comand don’t care about other subdomains. Choose “Exclude All.”Selective subdomain audit: You want to audit
www.andblog.but notstaging.ordev.. Choose “Audit and Report” and excludestaginganddev.
How Rules Work Together
When you configure multiple rules across different tabs, Sitebulb evaluates them in a specific order. Understanding this order helps you avoid unexpected behaviour.
Evaluation Order
For every URL the crawler encounters, Sitebulb follows this decision process:
URL Rewriting is applied first - the URL is transformed (lowercase, parameter removal) before any other rules are checked
Has this URL already been scheduled or excluded? If yes, skip it (unless secondary inclusion applies)
Is this an external URL on an excluded domain? If yes, skip it
Does this external URL match an external URL exclusion pattern? If yes, skip it
Is this a third-party resource that should be blocked? If yes, skip it
Does this internal URL match an internal URL exclusion pattern? If yes, skip it
Is this a parameterised URL, and are parameters disallowed? If yes, skip it
Are inclusion rules active? If yes, the URL must match an inclusion pattern (or be rescued by secondary inclusion)
The Golden Rule: Exclusions Beat Inclusions
If a URL matches both an exclusion pattern and an inclusion pattern, the exclusion wins. Exclusions are evaluated first, and an excluded URL will not be crawled regardless of any inclusion rules.
Example of a conflict:
Exclusion pattern:
*/archive/*Inclusion pattern:
/blog/*URL:
/blog/archive/2023/postResult: Excluded - because it matches the exclusion pattern
To avoid this, make sure your exclusion and inclusion patterns don’t overlap in unintended ways.
The Exception: Secondary Inclusion
The one scenario where an excluded URL can be “rescued” is through secondary inclusion (“Links on Included Pages”). If a URL was excluded but it matches a secondary inclusion pattern AND it was linked from a primary-included page, the secondary inclusion can override the exclusion.
However, this only works one level deep and is a niche scenario. For most users, the simpler rule is: exclusions always win.
CSS Selectors Run First
The Exclude Link Settings (CSS selectors) operate at the HTML parsing stage - before URLs even enter the scheduling pipeline. If a link is inside an excluded CSS selector, the URL is never extracted from the page in the first place, so none of the URL-based rules even see it.
URL Rewriting Affects Pattern Matching
Because URL rewriting happens before pattern matching, your patterns need to match the rewritten URLs, not the originals.
Example:
You enable “Convert all URLs to lowercase”
Your exclusion pattern is
/Blog/*A URL
/Blog/my-postis rewritten to/blog/my-postResult: Your pattern
/Blog/*does NOT match/blog/my-postbecause of the case differenceFix: Change your pattern to
/blog/*
Common Mistakes and How to Avoid Them
1. Missing the Leading Forward Slash
The mistake: Writing products/* instead of /products/*.
Why it’s wrong: All URL paths start with /. Without it, the pattern has no way to match the beginning of a path. The crawler will never find a URL whose path starts with products (no slash).
The fix: Always start your patterns with / (to match from the beginning of the path) or * (to match anywhere in the path).
2. Forgetting That Patterns Are Case-Sensitive
The mistake: Writing /blog/* when your site’s URLs use /Blog/my-post.
Why it’s wrong: Pattern matching is case-sensitive. /blog/* and /Blog/* are completely different patterns.
The fix: Check the actual URLs on your site (look in the browser address bar). Match the exact casing. If your site uses inconsistent casing, enable “Convert all URLs to lowercase” in the URL Rewriting tab and write all your patterns in lowercase.
3. Setting Inclusion Rules Without a Path to Matching Pages
The mistake: Adding an inclusion pattern for /resources/* when your start URL (homepage) doesn’t link to any /resources/ page.
Why it’s wrong: Sitebulb follows links from your start URL. If no link leads to a page matching your inclusion pattern, the crawler discovers nothing.
The fix: Either change your start URL to a page within the included section, or add seed URLs (in the URL Seed List tab) that match your inclusion patterns.
4. Not Realising Exclusions Override Inclusions
The mistake: Adding an exclusion for */archive/* and then wondering why /blog/archive/2023/post isn’t crawled, even though /blog/* is in your inclusions.
Why it’s wrong: Exclusion rules are always evaluated first and always win.
The fix: Review both your exclusion and inclusion lists together. If you want certain URLs to be crawled, make sure they don’t match any exclusion pattern. Remove or narrow the conflicting exclusion.
5. Using the Wrong Syntax in the Wrong Tab
The mistake: Entering a full URL with protocol in the Internal URL Exclusions tab, or entering a path-only pattern in the Third Party Resources URL Path Exclusions.
Why it’s wrong: Different tabs use different formats:
Most tabs: path-only patterns using robots.txt syntax
Third Party Resources (URL Path Exclusions): full URL patterns with protocol (
*://)External Domain Exclusions: plain domain names
Seed URLs: full URLs with protocol and domain
Exclude Link Settings: CSS selectors
The fix: Refer to the specific tab’s instructions and use the correct format for that tab.
6. Overly Broad Patterns
The mistake: Using /p* to exclude /products/ but accidentally also excluding /pages/, /privacy-policy, and /pricing.
Why it’s wrong: * is greedy and matches everything. A short prefix pattern like /p* will match far more URLs than you intend.
The fix: Be as specific as possible. Use /products/* instead of /p*. Test your patterns mentally against URLs you do want to crawl to make sure they won’t be accidentally caught.
7. Forgetting That Secondary Inclusion Is One Level Deep
The mistake: Setting up secondary inclusion expecting it to cascade through multiple levels of linked pages.
Why it’s wrong: If Page A (primary included) links to Page B (secondary included), and Page B links to Page C, Page C is NOT automatically included. Only the direct links from primary-included pages are rescued by secondary inclusion.
The fix: If you need deeper cascading, expand your primary inclusion patterns instead of relying on secondary inclusion.
Quick Reference
Summary of All Tabs
Tab | What it controls | Pattern format | Input type |
Internal URL Exclusions | Which internal pages to skip | Path-only, robots.txt syntax | Patterns (one per line) |
Internal URL Exclusions (Parameters) | Whether to crawl URLs with query strings | Parameter names | Names (one per line) + checkbox |
External URL Exclusions | Which external URL paths to skip checking | Path-only, robots.txt syntax | Patterns (one per line) |
Internal URL Inclusions | Restrict crawl to only matching pages | Path-only, robots.txt syntax | Patterns (one per line) |
Links on Included Pages | Extra pages to crawl one level deep from included pages | Path-only, robots.txt syntax | Patterns (one per line) |
URL Seed List | Specific URLs to add to the crawl | Full URLs with protocol and domain | URLs (one per line) |
URL Rewriting | Normalise URLs before processing | Parameter names | Names (one per line) + checkboxes |
External Domain Exclusions | Entire domains to skip for external links/resources | Domain names (with optional | Domains (one per line) |
Third Party Resources | Block third-party scripts, styles, fonts, etc. | Preset levels + domain names + glob-style URL patterns with protocol | Mixed (dropdown + domains + patterns) |
Exclude Link Settings | Ignore links found inside specific page elements | CSS selectors | Selectors (one per line) |
Subdomain Options | How to treat links to your subdomains | Subdomain names (without main domain) | Dropdown + names (one per line) |
Pattern Syntax Cheat Sheet
Pattern | Meaning | Example match |
| Anything starting with |
|
| Anything containing |
|
| Only this exact path, nothing after it |
|
| Any URL with this query parameter |
|
| Path with any middle segment |
|
Default Behaviour (No Rules Configured)
When you don’t configure any inclusion or exclusion rules:
All internal pages are crawled (up to your page limit and crawl depth)
All external links are checked for HTTP status
URLs with different query parameters are treated as separate pages
Third-party resources are loaded normally
All subdomains are checked for HTTP status but not fully crawled
Robots.txt rules on your site are still respected (this is separate from user-defined rules)














