Content Extraction Settings | Sitebulb Support | Tool documentation & How-to's

Sitebulb's content extraction feature allows you to configure the crawler to collect custom data points as it crawls. These will be presented in your audit in addition to the data the crawler collects by default, such as H1s, H2s, title tags, meta descriptions, etc.

This guide covers the basic process for setting up content search within Sitebulb, including several examples.

Adding content extraction to your website audit

To get started, navigate to your Audit Settings and enable Content Extraction in the respective setting tab.

Once you have toggled Content Extraction on, use the green Add Rule button to set and test your extraction rules.

The next step will look slightly different depending on whether you are using Sitebulb Desktop or Sitebulb Cloud, but the key features of the Add Rule window are the same on both. At the top of the window, you will see three tabs:

Rule - Use the Rule tab to set up your extraction parameters. On this tab, you will find three fields to fill in:
- Rule Name: This name is used as a column heading in reports
- CSS Selector Path: This is the CSS selector that Sitebulb will use to identify the element to extract
- Extract Data using: determines whether Sitebulb should extract data using text, HTML, a specific attribute, or a regex pattern.

Data - on the Data tab, you can determine the operation Sitebulb will perform on your selected element and the type of data the content extraction should return.
- Operation: This setting determines which extraction operation Sitebulb should perform - extracting the first matched item, all matched items, a word count, or simply indicating that a match exists.
- Type of Data: determines what type of data Sitebulb will return, which may be text, numbers, URLs, etc.

URLs - Use the URLs tab to determine on which pages Sitebulb should perform the content extraction.

Test - The Test tab will ONLY be available when setting up content extraction on Sitebulb Desktop.

Content Extraction setup on Sitebulb Desktop

When using Sitebulb Desktop, the Add Rule button will open up the on-screen rule wizard, which allows you to easily select the element you want to extract by loading in an example URL and using the point-and-click feature.

To set up your content extraction rule on Desktop, follow these steps:

Enter an example URL and click the green Go button to load it in. This URL should be an example of the kind of pages you wish to collect data from.
Once the page has loaded it, find the element you want to scrape and point and click to select it. This will automatically fill in the 'CSS Selector Path' field.
Enter a name for this rule - you want to choose a name that meaningfully reflects the data you will be collecting. Be succinct, as the Rule name will be used as a column heading in reports.
Now you can configure the remaining settings if necessary. The default settings should work for most basic text extractions.
Check the 'Test' tab to ensure that the data extracted is what you expected.
Click Add Rule.

Super straightforward.

Here's a quick gif example showing me extracting the 3rd breadcrumb from one of our documentation pages:

Once you've added your rule, you can stop there, or just keep adding more rules. You will see all your rules in the audit setup page, ready for you to start the audit.

Once you're done adding rules and any other audit setup configurations, hit Start Now at the bottom right of the screen to start the audit.

Content Extraction Setup on Sitebulb Cloud

The on-screen Content Extraction wizard is only available in the desktop version, as the setup used to pull through the live view of the website is not compatible with the Sitebulb Cloud version in your browser.

The Content Extraction setup features are the same as the ones on Desktop, but you will not have the option to point-and-click to choose the CSS selector path.

If you would prefer to use the point-and-click setup wizard, you can connect to the Cloud server through your desktop application in order to use the Content Extraction helper while setting up your project.

How to manually choose your CSS Selector path

Alternatively, you can manually input your CSS selector path. Follow the steps below to find the correct selector:

Decide what element you'll need to extract
Use Chrome Dev Tools to identify the CSS selector. To do this, navigate to the element you want to extract > right click > Inspect
Once you have identified the element that you are trying to extract, right click > Copy > Copy selector
Paste this selector in the Content Extractor tool
Configure the remaining settings if necessary.
Click Add Rule.

Viewing extracted data

Once your audit is complete, you will find the Content Extraction report in the left-hand menu.

On the Content Extraction overview, you'll find a list of your extraction rules alongside key data:

Total Found refers to the total number of matching elements that Sitebulb extracted, corresponding to each rule.
Found on URLs represents the number of unique URLs that Sitebulb found the matching content extraction selector on.

Switching to the URLs tab shows you the URLs alongside the extracted data, each populated in one column per rule on the right.

As always with URL Lists, you can add or remove columns so that you can easily combine technical crawl data with your extracted data. You can also create filters on the data to gain additional insights.

Content Extraction use cases

The Content Extraction feature gives you additional flexibility when analysing data. Some key examples include:

Collecting author names and the number of comments on a blog, to understand which writer gets the best engagement.
Collecting breadcrumb links to help categorise and segment page types.
Collecting product and price data from competitor websites for benchmarking.
Extract the character count of your product description to find pages with short descriptions.
Count matching elements on a page, for example, CTA buttons.

You should NOT use the content extraction feature to extract all page content. If you're looking to extract full-page content, consider using the Saving Crawl Data features.

Content extraction examples

In the examples below, I will show you some straightforward use-cases for content extraction data, and how to set up the content extraction rules for each case. This will make use of the additional data options mentioned above.

Extracting e-commerce product data

Let's say I'm crawling a competitor ecommerce site and I want to scrape some pricing data. I need to grab the product name and the price.

In the browser window, I load up a product page as my Example URL and use the point-and-click method to select the product name.

You see the selector highlighted in red, and the CSS Selector Path field is populated. I will enter 'Product Name' as the Rule Name.

I can verify that the selector is set up correctly by navigating to the Test tab on the far right. Sure enough, the product name correctly appears in the green box, so I am confident with my selector, and I hit the green Add Rule button in the bottom right.

For the next rule I want to scrape the price, so I scroll down on the loaded page and point-and-click on the price to fill in the CSS Selector Path.

You can then proceed to test and add the rule. When you run your audit, you'll now see two columns containing the product name and price for all of the product pages.

Extarcting blog engagement data

Let's say I've got a popular blog and I want to figure out which of my posts garners the most attention.

I am looking to extract the engagement data, specifically, the number of comments on each blog post.

I use the point and click feature to select the comments metric on the page, and name the rule appropriately.

The 'Test' tab shows me I have the exact text data as displayed on screen:

"1983 Comments"

This is great! Although...

...it would actually be a bit cleaner if I just extracted the number, without the 'Comments' bit.

This is where we can use one of the more advanced customisation options in the Data tab. I will switch the Type of Data option to 'Number':

Now, when we test again, the test result is a number (1983). Perfect!

Let's also set up a rule to extract the views data, as this should also be a good barometer for successful content:

I will select the element on the page and name the rule appropriately.

This blog reports 1.5M Views, as the data presented is not strictly a number (1.5M vs 1,500,000), I will leave the Type of Data option to 'Text'.

Now, once we add the second rule, we can see on the Content Extraction settings page that we have different formats for our two different rules:

Once the analysis has run, we can easily sort or filter the data to find the best-performing content:

Extracting directory listings data

Sometimes this sort of data scraping is useful for activities beyond website auditing - for example, for sales prospecting. Let's say I have my own flour company and I want to try and sell to commercial bakeries, setting Sitebulb up as a scraper would allow me to collect useful prospecting information from directory listings websites, en mass.

If I pick a site like Yell.com, I could easily scrape URLs for some local bakeries (using a free scraping extension such as Linkclump), then utilise Sitebulb in list mode and add some content extractors. I want to get the business listings URLs, which are URLs of the form https://www.yell.com/biz/business-name/.

I need to enter one of these URLs as the Example URL, then I can start adding my selectors, such as 'Business Name'.

Since I'm prospecting, I might be interested in the phone number:

There is a built-in Type of Data for 'Telephone', which is actually not needed in this example, but might be useful on other sites where the formatting is not so clean:

Adding the 'Address' requires a few clicks before I manage to grab the correct selector, which I can verify with the Test tab:

Finally, we want the website URL. This one requires a bit more work, as we aren't actually interested in the text as displayed on screen; in fact we want to grab an element from the underlying HTML. This time we need to change the Type of Data we are actually extracting (from 'Text' to 'Inner HTML'), and then also change Data type to 'URL' from the drop-down menu.

This is where the Test tab really comes into its own, allowing us to verify at each step what data the tool will try to collect.

Again, it takes a bit of clicking around to select the correct element, and we can make use of the pre-built data type for 'URL' to simplify things:

This guide covers a lot of the typical use cases and methodologies you will need to set up your extractors correctly. The important thing to realise is that while the point-and-click interface is super useful, it is very important that you test and make adjustments as you go.

Video Guidance

For video guidance on understanding and using the Content Extraction and Content Search features, see the latest training session recording:

Audit settings

Audit Data

Content Search Settings

Include & Exclude URLs settings

Choosing the right settings for efficient auditing