How to crawl large websites
The aim of this post is to provide a methodology for crawling large websites with Sitebulb.
Crawling large websites is a tricky subject, primarily because of the number of unknowns. Until you crawl the website, you may not know if you're working with a 1,000 page website or a 100,000 page website. And that is before you start thinking about embedded resources, external links, and subdomains.
But when you are dealing with a big website, crawl data can be incredibly valuable, as it can reveal patterns and issues that you'd struggle to detect otherwise.
Considerations
Before we jump into our example and recommendations, here are some considerations that should inform how to plan your crawl and audit settings.
The size of your Sitebulb plan
The size of your plan determines the maximum URLs that you can crawl per Audit. Take this limit into consideration when planning your crawl. When dealing with large websites, you may need to consider excluding external URLs and/or page resources from your crawl, for example, to ensure that all internal URLs can be audited.
The data you need
We know that SEOs LOVE data. The more, the better, right?
Well, when it comes to crawling huge websites, being measured in your approach and limiting your crawl to the data you need can help you run much faster and efficient audits. Trust us on this.
We’ve talked about audit settings and how to select the right ones in this dedicated article. Here, we’re focusing specifically on what URLs you want to include in your Audit when crawling a large site. Here are some things to consider:
External URLs - Sitebulb is set to carry out external link analysis by default. That means it will check and report on the HTTP status of any external URLs found as it crawls your website. This may not be a priority to your analysis, so excluding external links from the audit can be a first step to limiting its size.
Subdomains - Just like external URLs, Sitebulb checks the HTTP status of subdomains by default. If you don’t need this data, opt to exclude these from your audit to limit its size.
Page Resources - Consider whether you require data on Page Resources at all, or whether you only require data regarding certain resources. More on this later.
Parametrised URLs - Depending on the setup of your site and navigation, you may encounter large numbers of URLs with query string parameters, which can bloat your crawl. Don’t worry if you’re unsure about what to choose here, we’ll address how to identify and limit these below.
Whole site vs directories - The last thing to consider is what part of the site you will be crawling. Do you need an overview of every single URL, or are you focusing your analysis on an area or group of pages?
How large is a large website?
This can depend on perspective. An experienced enterprise SEO who is familiar with 1 million+ page websites might see a 5000 page site as tiny. But to a solo in-house SEO in their first job, it can feel enormous.
For the sake of argument, we'll draw the line at 100,000 URLs. This is the point at which you might need to start thinking about which crawl and analysis options you have switched on in Sitebulb.
In general, for sites smaller than 100,000 URLs, you should be pretty safe turning on whichever crawl options you like (although it’s always worth considering the factors above to ensure you have clean and relevant data).
How to Crawl Large Websites
There are some simple steps we can recommend following before hitting the START button on a full crawl of any large website:
Run a sample Audit - run an audit of a limited number of pages with your desired settings in order to determine if the data you're getting is accurate and relevant.
Analyse the data - Look at the results Sitebulb has returned and identify what’s relevant and what isn’t. Are you getting all the data you need? Are you getting irrelevant data (parametrised URLs, areas of your site, external URLs, etc.)
Limit your crawl and adjust your settings accordingly
Worked example: Patient.info
To illustrate what the steps above mean in practice, we’ll work through an example.
Say I needed to crawl the Patient website, which is a UK-based health advice site, for doctors and patients alike.
If I were actually working with the client, one of my initial Q & A questions would be to ask them about scale, but since I'm not, we'll lean on our friend Google.
Using the site: search we can assume there are at least 281,000 indexable pages on this website. However, this does NOT mean that we will crawl exactly 281,000 pages.
Not included in this total are noindexed URLs, disallowed URLs, canonicalized URLs, page resource URLs, external links, and links to subdomains - yet all of this stuff could end up in the scope of your Audit if you are not careful.
If anything, what Google displays can only really be considered a lower bound for the purposes of planning a crawl.
At this point, we know it's a pretty big site, so we should carry out a Sample Audit before going ahead with a full crawl.
Running a Sample Audit
The point of a Sample Audit is to crawl a small subset of the website in order to get a feel for what will be included in a full audit with your desired settings.
1. Set up a new Project
We first need to set up a new Project in Sitebulb.
2. Select the Sample Audit and set your limits
Under ‘Audit Type,’ select Sample. Once selected, the sample crawl settings will appear, which is how we will limit the crawl.
In this case, we are only going to crawl 10 levels deep, and a maximum of only 1500 URLs at each level (Sitebulb will choose 1500 random URLs to crawl at each level).
Once we hit Save and Continue, Sitebulb will go off and perform a number of 'pre-audit checks', such as checking the robots.txt file to make sure we can actually crawl the website in the first place.
3. Select your Audit Data and Settings
You’ll then land on the Audit Setting page, where you can select the audit data you want to include in your final Audit and other crawl settings.
Check out this article on choosing the right settings for efficient auditing if you’re unsure.
Choose these settings as you would for your final audit of the site. If you are unsure of what to choose, go with the default settings - these will work for most websites.
4. Monitor Crawl Progress
Once you click ‘Start Now’, Sitebulb will start analyzing your pages, and you can monitor the crawl in the Progress screen.
If you keep an eye on the speed, this will give you an idea of how fast the site can be crawled and, therefore, how long it might take to complete the main audit with the resources and limits you have set under Crawler Settings.
In this case, we're at around 8 URLs/second, which is relatively fast.
You can also experiment with increasing the crawl speed and seeing how fast it will comfortably go (watch out for errors creeping in; this typically means you are going too fast).
Analyzing the Sample Audit Data
Once the Sample Audit has finished running, we can use the data collected to make inferences about how a full Audit would work on the site.
Type of URLs Crawled
One thing we can clearly see is that there is roughly one page resource URL crawled for every internal URL crawled.
This means that if we included Page Resources in our main Audit, we would roughly double the number of URLs we need to crawl, so 300,000 suddenly turns into 600,000.
There was also roughly one external URL crawled for every internal one, so across 300,000 internal pages, this would add another 300,000 external URLs.
The point in doing this is to help us build up a profile of the website crawl to answer the question: 'What would it look like to do a full crawl of everything on the site?'
Then we will get an idea of how long it will take to help decide if we are willing to wait this long for all the data.
Indexability
Another element worth considering is how many indexable pages our sample crawl turned up, which we can get from the Indexability report:
The data in this report is potentially game-changing - this is because all our assumptions are based on how many URLs are currently indexable.
So, for example, if you came across 2 'Not Indexable' pages for every one that is indexable, your crawl would likely be three times bigger than expected.
In this case, it's not such a massive problem, but we still have one Not Indexable page for every 10 Indexable.
We can now build up our profile of what a full Audit might look like:
We have a baseline of around 300,000 URLs from Google site: search
If we decide to crawl page resources, this could add another 300,000 URLs (1 to 1)
External URLs add another 300,000 URLs (1 to 1)
Not Indexable URLs would add another 30,000 URLs (1 to 10)
So if we wanted to crawl everything, we'd be looking at crawling around 930,000 URLs. At a rate of 8 URLs/second, this would take about 32 hours to complete.
Internal URLs
The final thing to check will be the internal URLs crawled by Sitebulb and their relevance to the goal of your audit. This will help us limit the scope of our final audit.
This is particularly relevant to e-commerce websites, where product variants and faceted navigation can create large numbers of parameterized URLs that may not be particularly valuable to your analysis.
Adjusting Settings & Limiting the Audit
At this point, we have a fairly good estimate for how long a full Audit would take to run with the HTML Crawler and our desired settings.
If we wanted to move forward and carry out the full Audit, we could ask ourselves if we really want to wait this long, or if we'd be comfortable omitting some of the data.
There's no way to not crawl 'Not Indexable URLs' (because Sitebulb only knows they're not indexable once it has crawled them...), but we can exclude external URLs and page resources, which would keep the total down closer to the 300,000 we started with.
We’ll go ahead and set up our full crawl by starting a new Project and selecting the default Standard Audit under ‘Audit Type’.
Excluding External Links
You can disable External Link analysis in the Search Engine Optimisation Advanced Settings to stop Sitebulb from checking the HTTP status of links to external websites.
Exclude Subdomains
You can also stop Sitebulb from crawling subdomain URLs by unticking this option in the Subdomain Options tab.
Disabling both External URLs and Subdomain checks is the best way you can limit the size of the Audit, without deliberately excluding internal pages in the site. Of course, whether you are willing to do so will depend on your website configuration and how valuable this data may be if included.
Excluding Page Resources
You could stop Sitebulb from crawling page resources altogether by disabling this setting. Alternatively, open the Advanced Settings panel for more granular control of the type of resources you want Sitebulb to crawl.
Note that the Performance & Mobile Friendly report requires Page Resources to be enabled.
Limiting the crawl to selected Internal HTML URLs
This may mean excluding irrelevant directories or limiting the crawl to a particular area of the site. You can use the Include & Exclude URLs settings to do this.
You can also exclude all parameterized URLs or limit the crawl to only include specified Safe Query String Parameters to your crawl, in order to avoid bloating your data with irrelevant parameterized URLs.
Managing large Crawls on Desktop
A final note on how to manage your computer when running large crawls on Sitebulb Desktop. Ideally, you want to leave Sitebulb to run continuously, complete the crawl and generate all the reports - this is where Sitebulb Cloud is the ideal choice.
However, with bigger Audits run on Sitebulb Desktop, it may not be possible to leave your computer on for a long period (e.g., needing to shut down your machine overnight), so the best option is to utilise the 'Pause' feature to make sure any crawl interruptions are controlled.
While an Audit is in progress, just hit the purple Pause button in the top right.
Once you've done this, wait around 5 seconds for the purple button at the top left to change from 'Pausing' to 'Paused.'
At this point, you'll notice that there is now an option to 'Resume' in the top right.
Once an Audit has been paused, you can close Sitebulb down and shut your machine safely. Then, when you reopen Sitebulb the next day, you'll see a message informing you that you have an incomplete audit, so you can jump back into it and get it going again.
To see a list of paused audits, click to view the Paused tab:
Hit View Progress to return to the progress screen, where you can then hit Resume to set the Audit running again.
If pausing is not an option, and/or you regularly need to crawl large sites continuously for several days, Sitebulb Cloud may be a better choice for you and your team. It allows you to run large audits uninterrupted and at scale, as well as collaborate on shared projects. More about Sitebulb Cloud here.
A Word on Disk Space
A final consideration when crawling large websites on a Desktop is to make sure you have enough space on your hard drive for the audit. One of the reasons Sitebulb can crawl so many pages is that it writes the data to disk instead of holding it in RAM.
A site with 100,000+ URLs could take up anything from 250 MB to around 2 GB.
There's no easy way to know how much space you'll need before you start the Audit, but bear in mind that the more data you collect, the more disk space will be required.
In particular, 'Link Analysis' can take up a lot of disk space, as the numbers can grow huge very quickly.
To put this into perspective, I crawled a site recently with 1.6 million internal URLs. I crawled it once with 'Link Analysis' switched off, and this took up 6 GB of space on my hard drive. I crawled it again and switched on 'Link Analysis', and it was 36 GB!
Why so large? It had 142,600,000 links! That's why.
TL;DR
Do a sample Audit.
Analyse the sample Audit to understand potential timeframes and limitations.
Adjust your crawl settings to limit the Audit (optional)
When running a full audit, pause and resume if you are unable to leave the crawler on continuously.
Make sure you've got enough disk space on your machine (if running on a Desktop).