Skip to main content

Authentication for crawling staging sites

How to add authentication details to crawl staging sites with Sitebulb

Updated over 2 months ago

Dev websites and staging servers are often protected by authentication, meaning that you need to log in using a username/password in order to access the page content. These sites may also require you to adjust your robots politeness settings in order to work around rules like root path disallows and noindex directives.

This article takes you through the several methods of authentication and settings available in Sitebulb, which will allow you to successfully crawl staging websites. These can all be found via the Authentication tab on the left-hand menu within Audit Settings.

How to crawl sites with HTTP authentication

When HTTP authentication is in place, the server will 'challenge' a client request and prompt the client to provide authentication information, so you don't even get to access the page until you provide authentication. You will be required to enter your login credentials at the stage where Sitebulb does the domain check, when setting up the Project.

The start URL will be flagged up as 'Unauthorised' during the domain resolution test carried out at the point of setting up a new Project. At this point, you can use the green 'Save and Continue' button to move on to the Audit Settings step, where you can add HTTP Authentication:

Site requires authentication

Once you click 'Save and Continue', you will be challenged to enter the HTTP Authentication credentials by a pop-up modal.

Enter the HTTP username and password, then hit 'Try Again'. Assuming you have entered them correctly, Sitebulb will progress forward to the Expected Audit Settings page.

Enter HTTP Authentication credentials

You can also find and adjust the HTTP Authentication setting under the Authentication tab in Audit Settings.

Enable the HTTP Authentication checkbox, then enter your HTTP Username and HTTP Password. When you start your audit, Sitebulb will use these details to log in to the website.

HTTP Authentication

How to crawl sites with forms-based authentication

If your website uses forms-based authentication, it is likely that the start URL will return a 200 status code during the domain resolution test carried out at the point of setting up a new Project, but will include an embedded form which requires authenticating before you can access the full page content.

If your dev website redirects to a login page when visited, make sure to use the login URL as your start URL.

At this point, you can add your Forms Authentication details under Advanced Settings:

Select Yes in the Forms Authentication dropdown:

This will open a small browser window, taking you to the site you are going to be crawling.

Navigate to your login form, and enter your site's login details.

Once you have successfully logged in, click the green Add Authentication button at the bottom of the page.

Add forms authentication

You may also need to enable Cookies under Advanced Settings for Sitebulb to be able to carwl successfully.

Authenticating Sitebulb with Custom HTTP Headers

You may also come across dev websites that require a customer header to be sent with the request in order to access the requested content. Custom HTTP Headers can be used to whitelist and authenticate Sitebulb.

You can add Custom HTTP Headers details at the point of setting up your new project under Advanced Settings on the New Project screen:

You will also find your Custom HTTP Headers settings under the Authentication tab, scroll down to the relevant area and add the header name and value provided by your dev team:

Authenticating Sitebulb with Cookies

Although less common, you may also encounter website servers that require a specific cookie to be sent with the request in order to authenticate it and provide the desired response.

You can set your custom cookie value through the Custom HTTP Headers settings, found under the Authentication tab:

Robots.txt Settings for Staging Sites

Staging sites are likely to have directives in place to prevent the content from being crawled and indexed. Since Sitebulb respects robots directives by default, you will need to adjust your robots Politeness settings to ensure that Sitebulb can crawl successfully.

Navigate to the Robots Directives tab on your Audit Settings and find the Politeness section. Here, you can tick 'Is Staging Site' - this will remove the root path disallow from the robots.txt file, but instruct Sitebulb to respect the other robots directives.

Additionally, you can also choose to fully disable the 'Respect Robots Directives' setting, and even enable the carwl disallowed and nofollow settings, giving Sitebulb full access to the content you may want to crawl and save.

Did this answer your question?