A Guide to Crawling Enterprise Sites Efficiently


Time and resource constraints are two of the top SEO challenges.

These pain points often stem from a lack of understanding and an under-estimation of the efficacy of SEO initiatives from clients, colleagues and management, which in turn leads to underinvestment.

While we should be fighting for more budget and do a better job of communicating the value of SEO, we also need to be more efficient with the time and resources that are available to us.

Challenges for Enterprise SEO Pros

For SEO professionals working with enterprise businesses, there are often larger budgets to work with, but the sheer scale of these sites means that time and resource are still of the essence.

Enterprise sites can have URLs numbering in the millions, tens of millions or even hundreds of millions.

It’s unlikely that the time and resources are going to be available to crawl a site of that magnitude in full.

So you’re going to need to get tactical when crawling a mammoth site.

In this post, I’m going to introduce time-efficient approaches to crawling larger sites, so that you can squeeze out more timely insights with fewer resources.

How Should You Go About Crawling an Enterprise Site?

While the prospect of crawling an enterprise site can be daunting, fortunately, it isn’t usually necessary to crawl every page on a site.

In most cases, you only need enough data to validate issues, which won’t necessarily mean obtaining a full set of data.

This approach is called segmented crawling and involves breaking your site down into smaller sections which only include the necessary areas and volumes of pages needed to understand the lay of the land.

Segmented Crawling

You can think of segmented crawling like an incomplete picture.

It might not be complete, but you can still work out the overall gist despite not having the full picture.

In the same way, you can design segments consisting of carefully selected URL groupings which, together, can help you to understand a site’s main trends and patterns without having to crawl every single URL.

Segmented crawling is an excellent solution that allows you to bypass the constraints of scale and time encountered when tackling large sites.

Segmented crawling is all about assembling the smallest segments possible that will give you a representative picture of the website as a whole.

Creating these nuanced and minimal segments can’t be achieved overnight.

It’s going to take time to gain a thorough understanding of an enterprise site and this will be determined, in some part, by the constraints you’re operating under.

What Are Your Constraints?

The scope of your initial crawls is going to be determined, to some extent, by the time and resources you have available.

You will need set up crawls based on how quickly you need the insights and how much resource you have to available to crawling your site.

Segmented crawling comes down to a trade-off between the completeness of the data and the timeliness of the insights.

Phase 1: Designing Segments

Now we know what segmented crawling is and some of the considerations you’ll need to take into account before crawling an enterprise site, let’s take a look at how you can go about designing these segmented crawls.

Setting off Sample Crawls

To get an initial understanding of a site, it’s worth running an unrestricted sample crawl of around 100,000 URLs.

This first crawl will enable you to see what useless URLs are coming through (such as faceted pages, URLs with parameters, etc.) so you can exclude them from future crawls without wasting more resources on crawling similar junk URLs.

Once you’ve filtered out all of the junk URLs from your sample crawl within your settings, you’ll want to run the largest crawl you can to get as big a sample of meaningful URLs as possible.

This crawl will be the closest you get to a complete picture of the site within a single crawl and, from here, you will want to look at ways you can cut the site into more targeted segments going forward.

Slicing Your Site

Here are just a few ways that you can segment a site to get more focused insights from your crawls.

Vertical Slice or Single Channel/Section

You could take a vertical slice of a site, which will give you a sample of some URLs from each level of the site and to reveal how they are connected.

If you’re only interested in a particular section of a site then you can restrict your crawl to only look at that part of the site.

For example, a publisher might only want to evaluate the state of a particular section of news on their site or a B2B site might want to assess the performance of their blog.

This method can also be applied to analyzing the mobile version of a site, where you may want to keep this separate from the desktop crawl.

Limited Level Crawl

Crawling a specified number of levels of a site is a good approach to get a good feel for the breadth of a site without going down URL rabbit holes and crawling lower-lying URLs unnecessarily.

For example, if you are crawling an international site with several language versions, you may only want to crawl the first three levels initially to see how the site different language versions are connected.

Taxonomy/All Categorization

Another approach is to crawl the categories or site taxonomy to get a slice that reveals the core structure or architecture of the site but excludes groups of pages like paginated sets.

This might be a useful approach to take when analyzing ecommerce sites or other sites with complex categorization.

Page Templates

A further resource-efficient approach is to only crawl the different page templates.

A site is going to have a limited number of page types compared to the total number of URLs on a site.

Crawling instances of each page template will give you a good understanding of the key issues that exist, and solving these will likely benefit all pages using that template.

Setting up Frequent & Focused Benchmark Crawls

Deciding on how to slice a site for the purpose of analysis is going to depend on the nature of the site and what you’re aiming to achieve, which may likely involve a combination of the different methods above, or perhaps others.

Once you’ve set up different segments, you’ll begin to develop a better understanding of the site and will be able to refine the scope of these crawls so they are as targeted as possible.

Ideally, you want to get to a point where you can establish lots of targeted, benchmark crawls that consist of small segments (~10k URLs) that you can run regularly (weekly is ideal), and ad hoc when you need to test a release.

Phase 2: Creating Segmented Crawls

Now that we’ve explored some of the approaches you can use to segment your site for the purposes of auditing, let’s take a more hands-on approach to how you can create these segments.

To reiterate, the architecture of large sites are very different so you need to know the specific patterns of a site so you can crawl it efficiently.

We’ve covered some of the different ways you can partially crawl a site, which this involves running multiple crawls to narrow your focus down to achieve targeted benchmark crawls.

Here are some ways you can restrict the scope of crawls to create segments.

Disclaimer: I work for DeepCrawl and will use some examples from our tool because it’s the crawler that I’m most familiar with. However, that isn’t to say that you wouldn’t be able to achieve the same results with other crawlers.

Domain Scope

Enterprise sites can cover many separate business units which live on separate domains and/or subdomains.

Unless you’re running an exploratory crawl, it’s likely that you’ll want to include some rules to define the domain scope. This can be achieved in numerous ways by:

  • Choosing a preferred domain mapping.

DeepCrawl domain mapping

  • Defining exclusion rules to prohibit the crawling of parts of the site that won’t form part of your audit. Conversely, you might want to create inclusion rules for the specific parts of the site you do want included in your crawl.

DeepCrawl domain scope

  • Configuring your crawls to analyze separate mobile sites with a mobile user-agent.

DeepCrawl mobile crawling

Excluding

One way to make your crawls more focused is to use exclusion and inclusion rules once pages have been understood. Big sites are nearly always made up of a large number of data items.

Often there’s one particular thing that causes an enterprise site to have millions of URLs (e.g., item pages on an auctions site).

As such, these data items are likely to share a lot of commonalities, meaning they don’t all need to be crawled.

Digging down to find out what isn’t contributing to the site so you can exclude it in the future is a great way of reducing the size of crawls, enabling you to crawl more efficiently.

For example, when our team started working with a large review site, they started by crawling several company profile pages to understand internal linking.

Once the linking patterns on these pages had been understood, it meant that it was no longer necessary to crawl every company profile page because the team could extrapolate this knowledge to understand all company profile pages.

Parameter Removal

A further way of removing these junk URLs from your crawls is by setting up exclusion rules for pages with parameters that don’t change the content.

These URLs will increase the size of your crawls unnecessarily and if Google isn’t crawling these pages, then neither should you.

The example below shows an exclusion rule which means that any URLs with UTM parameters won’t be crawled.

DeepCrawl URL Exclusion

An alternative to excluding specific parameters would be to exclude URLs with more than a specified number of parameters.

For example, if you wanted to exclude URLs with 5 or more parameters you could use the following regex string:

“?[^&]+([^&]+){4,}”

Pagination

Pagination is another reason why crawls can unnecessarily increase in size.

Without setting any rules, crawlers (and search engine bots) can waste time crawling paginated sets that don’t provide any valuable content and, more importantly, in this case, don’t further your understanding of the site.

This can easily be remedied with a regex string that excludes the majority of the paginated series.

For example, you might add a rule that restricts crawls to the first three pages in a paginated set as after that there is usually nothing new to learn.

The exclusion rule outlined in the example below will exclude everything except the first three pages as part of a crawl.

DeepCrawl pagination exclusion

Sampling

A slightly different approach that will decrease the size of your crawls is sampling.

Rather than running crawls that look at all pages, with a given set of rules you can specify that only a sample of the pages are actually crawled.

Sampling is useful because you don’t need to crawl every page to understand a site’s issues, often a smaller subset of these pages will yield the same insights but more quickly and using fewer resources.

For example, a large listings site looking to understand millions of items pages may choose to crawl a sample of 10 percent of these by using DeepCrawl’s Page Grouping feature to identify the key issues without crawling every page.

DeepCrawl page grouping

Exclude Mobile/AMP

Previously, I mentioned that you might want to keep the mobile and desktop versions of a site in separate crawls.

This can be configured during the crawl setup by excluding the crawling of URLs in mobile and AMPHTML alternate tags.

Making this exclusion will help you to separate out the crawling of desktop, mobile, and AMP pages.

DeepCrawl link restrictions

Exclude Nofollowed Pages

A further exclusion you might want to consider is the crawling of nofollow links.

If a link is nofollowed then that page isn’t intended for crawling or indexing and, therefore, doesn’t need to be a part of your audit.

Within the setup of a crawl, you can choose to disable the option to crawl nofollow links.

DeepCrawl crawl restrictions

Including External Data

Layering external data on top of a crawl provides a ton of useful insights for all sites great and small.

However, with enterprise sites you need to be aware that even though restricting a crawl’s scope will automatically filter all of the URLs from external data sources, the data still needs to be retrieved first from APIs, or uploaded manually.

Pre-filtering the data will mean you can save some time, so here are some ways you can do this:

Custom Search Console Properties

Within Search Console, you can create custom properties such as subfolders to break down large data sets and speed up the process of crawling.

Search COnsole property set

Sitemaps

It isn’t necessary to include all sitemaps as part of a segmented crawl.

It is much more efficient to only include the sitemaps which are relevant.

DeepCrawl sitemap upload

Log Files & Analytics

Using inclusion rules, it is usually possible to generate a filtered log summary report which only includes the data for the URLs in the crawl.

Summary

Understanding the patterns, trends, and issues on enterprise sites is a process that takes time.

Hopefully this post has given you some ideas on how to get to this understanding more efficiently.

More Enterprise SEO Resources:


Image Credits

Featured Image: Unsplash
All screenshots taken by author, September 2018

Subscribe to SEJ

Get our weekly newsletter from SEJ’s Founder Loren Baker about the latest news in the industry!

Ebook





Source link

Leave a reply:

Your email address will not be published.

Sliding Sidebar

About Me

About Me

Read about current trends on WordPress blogs, SEO, technology and more on plolu

Social Profiles