Connect with us


Why & How to Tackle Technical SEO Before Link Building



Why & How to Tackle Technical SEO Before Link Building

When you consider a link building campaign, you may not be completely reaping the benefits of your SEO efforts if you ignore technical SEO.

The best results happen when you consider all the points of your website’s SEO:

In fact, there are situations when you must tackle technical SEO before ever thinking about getting links.

If your website is weak in technical SEO areas, or extremely confusing for search engines, it won’t perform as well regardless of the quality and quantity of backlinks you have.

Your top goals with technical SEO is to make sure that your site is:

  • Easily crawled by search engines.
  • Has top cross-platform compatibility.
  • Loads quickly on both desktop and mobile.
  • Employs efficient implementation of WordPress plugins.
  • Does not have any issues with misconfigured Google Analytics code.

These five points illustrate why it’s important to tackle technical SEO before link building.

If your site is unable to be crawled or is otherwise deficient in technical SEO best practices, you may suffer from poor site performance.

The following chapter discusses why and how you should be tackling technical SEO before starting a link building campaign.

Make Sure Your Site is Easily Crawled by Search Engines

Your HTTPS Secure Implementation

If you have recently made the jump to an HTTPS secure implementation, you may not have had the chance to audit or otherwise identify issues with your secure certificate installation.

A surface-level audit at the outset can help you identify any major issues affecting your transition to HTTPS.

Major issues can arise later on when the purchase of the SSL certificate did not initially take into account what the site would be doing later.

One thing to keep in mind is that you must take great care in purchasing your certificate and making sure it covers all the subdomains you want it to.

If you don’t, you may end up with some issues as a result, such as not being able to redirect URLs.

If you don’t get a full wildcard certificate, and you have URL parameters on a subdomain – using absolute URLs – that your certificate doesn’t cover, you won’t be able to redirect those URLs to https://.

This is why it pays to be mindful of the options you choose during the purchase of your SSL certificate because it can negatively affect your site later.

No Errant Redirects or Too Many Redirects Bogging Down Site Performance

It’s easy to create an HTTPS secure implementation with errant redirects.

For this reason, an eagle eye’s view of the site’s current redirect states will be helpful in correcting this issue.

It can also be easy to create conflicting redirects if you don’t keep watch on the redirects you are creating.

In addition, it’s easy to let redirects run out of control and lead to tens or many more redirects per site URL, in turn, leads to bogging down site performance.

The easiest way to fix this issue moving forward: make sure that your redirects are all created in a 1:1 ratio.

You should not have 10-15 or more redirect URLs per URL on your site.

If you do, something is seriously wrong.

Example of correct redirects

Content on HTTPS & HTTP URLs Should Not Load at the Same Time

The correct implementation is that one should redirect to the other, not both.

If you have both of them loading at the same time, something is wrong with the secure version of your site.

If you type in your site’s URLs into your browser, try and test https:// and http:// separately.

If both URLs load, you are displaying two versions of your content, and duplicate URLs can lead to duplicate content issues.

To make sure that you do not run into this issue again, you will want to do one of the following, depending on your site’s platform:

  • Create a full redirect pattern in HTACCESS (on Apache / CPanel servers)
  • Use a redirect plugin in WordPress to force the redirects from http://

Instead, this is an example of exactly what we want to display to users and search engines:

How to Create Redirects in htaccess on Apache / Cpanel Servers

You can perform global redirects at the server level in .htaccess on Apache / CPanel servers.

Inmotionhosting has a great tutorial on how to force this redirect on your own web host. But, for our purposes, we’ll focus on the following ones.

To force all web traffic to use HTTPS, this is the following code you will want to use.

You want to make sure to add this code above any code that has a similar prefix (RewriteEngine On, RewriteCond, etc.)

RewriteEngine On
RewriteCond %{HTTPS} !on
RewriteCond %{REQUEST_URI} !^/[0-9]+..+.cpaneldcv$
RewriteCond %{REQUEST_URI} !^/.well-known/pki-validation/[A-F0-9]{32}.txt(?: Comodo DCV)?$
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

If you want to redirect only a specified domain, you will want to use the following lines of code in your htaccess file:

RewriteCond %{REQUEST_URI} !^/[0-9]+..+.cpaneldcv$
RewriteCond %{REQUEST_URI} !^/.well-known/pki-validation/[A-F0-9]{32}.txt(?: Comodo DCV)?$
RewriteEngine On
RewriteCond %{HTTP_HOST} ^ [NC]
RewriteCond %{SERVER_PORT} 80
RewriteRule ^(.*)$$1 [R=301,L]

Don’t forget to change any URLs in the above examples to what is the correct implementation on your domain name.

There are other solutions in that tutorial which may work for your site.

WARNING: if you do not have confidence in your abilities to make the correct changes at the server level on your server, please make sure to have your server company/IT person perform these fixes for you.

You can screw up something major with these types of redirects if you do not know exactly what you are doing.

Secure Site redirect plugin

Use a Plugin If You Are Operating a WordPress Site

The easiest way to fix these redirect issues, especially if you operate a WordPress site, is to just use a plugin.

There are many plugins that can force http:// to https:// redirects but here are a few that will help make this process as painless as possible:

Caution about plugins – don’t just add another plugin if you’re already using too many plugins.

You may want to investigate if your server can use similar redirect rules mentioned above (such as if you are using an NGINX-based server).

It must be stated here: plugin weight can affect site speed negatively, so don’t always assume that the latest plugin will help you.

All Links On-site Should Be Changed From HTTP:// To HTTPS://

Even if you perform the redirects above, you should perform this step.

This is especially true if you are using absolute URLs, as opposed to relative URLs, where the former always displays the hypertext transfer protocol that you’re using.

If you are using the latter, this is less important and you probably don’t need to pay much attention to this.

Why do you need to change links on-site when you are using absolute URLs?

Because Google can and will crawl all of those links and this can result in duplicate content issues.

It seems like a waste of time, but it’s really not. You are making sure the end result is that Google sees exactly the site you want them to see.

One version.

One set of URLs.

One set of content.

No confusion.

Examples of links that should be changed from http:// to https://

No 404s From HTTP:// To HTTPS:// Transitions

A sudden spike in 404s can make your site almost impossible to crawl, especially if the links between http:// and https:// pages exist.

Difficulty crawling a site is one of the most common issues that can result from a spike in 404s.

Also, crawl budget wasted due to too many 404s showing up, and Google not finding pages that it should.

Why this impacts site performance, and why it matters:

While John Mueller of Google mentions that crawl budget doesn’t matter except for extremely large sites:

“Google’s John Mueller said on Twitter that he believes that crawl budget optimization is overrated in his mind. He said for most sites, it doesn’t make a difference and that it only can help really massive sites.

John wrote “IMO crawl-budget is over-rated.” “Most sites never need to worry about this. It’s an interesting topic, and if you’re crawling the web or running a multi-billion-URL site, it’s important, but for the average site owner less so,” he added.”

A great article by Yauhen Khutarniuk, Head of SEO at SEO PowerSuite, puts this perfectly:

“Quite logically, you should be concerned with crawl budget because you want Google to discover as many of your site’s important pages as possible. You also want it to find new content on your site quickly. The bigger your crawl budget (and the smarter your management of it), the faster this will happen.”

It’s important to optimize for crawl budget because finding new content on your site quickly should be the priority, while discovering as many of your site’s high priority pages as possible.

How to Fix Any 404s You May Have

Primarily, you want to redirect any 404s from the old URL to the new, existing URL.

Check out Benj Arriola’s Search Engine Journal article for more information on 404s vs. soft 404s, and how to fix them.

One of the easier ways, especially if you have a WordPress site, would be to crawl the site with Screaming Frog and perform a bulk upload of your 301 redirect rules using the Redirection WordPress plugin.

Otherwise, you may have to create redirect rules in .htaccess.

Your URL Structure Should Not Be Overly Complex

The structure of your URLs is an important consideration when getting your site ready for technical SEO.

You must pay attention to things like randomly generating dynamic parameters that are being indexed, URLs that are not easy to understand, and other factors that will cause issues with your technical SEO implementation.

These are all important factors because they can lead to indexation issues that will hurt your site’s performance.

More Human-Readable URLs

When you create URLs, you likely think about where the content is going, and then you create URLs automatically.

This can hurt you, however.

The reason why is because automatically generated URLs can follow a few different formats, none of which are very human-readable.

For example:

  • /content/date/time/keyword
  • /content/date/time/string-of-numbers
  • /content/category/date/time/
  • /content/category/date/time/parameters/

None of these formats that you encounter are very human readable, are they?

The reason why it’s important is that communicating the content behind the URL properly is a large part of user intent.

It’s even more important today also because of accessibility reasons.

The more readable your URLs are, the better:

  • Search engines can use these to determine exactly how people are engaging with those URLs vs. those who are not engaging with those URLs.
  • If someone sees your URL in the search results, they may be more apt to click on it because of the fact that they will see exactly how much that URL matches what they are searching for. In short – match that user search intent, and you’ve got another customer.
  • This is why considering this part of URL structure is so important when you are auditing a site.

Many existing sites may be using outdated or confusing URL structures, leading to poor user engagement.

Identifying which URLs can be more human readable can create better user engagement across your site.

Duplicate URLs

One important technical SEO consideration that should be ironed out before any link building is duplicate content.

When it comes to duplicate content issues, these are the main causes:

  • Content that is significantly duplicated across sections of the website.
  • Scraped content from other websites.
  • Duplicate URLs where only one piece of content exists.

This can hurt because it does confuse search engines when more than one URL represents one piece of content.

Search engines will rarely show the same piece of content twice, and not paying attention to duplicate URLs dilutes their ability to find and serve up each duplicate.

Avoid Using Dynamic Parameters

While dynamic parameters are, in and of themselves, not a problem from an SEO perspective, if you cannot manage your creation of them, and get consistent in their use, this can become a significant problem later.

Jes Scholz has an amazing article on Search Engine Journal covering the basics of dynamic parameters and URL handling and how it can affect SEO. If you are not familiar with dynamic parameters, I suggest reading her article ASAP before proceeding with the rest of this section.

Scholz explains that parameters are used for the following purposes:

  • Tracking
  • Reordering
  • Filtering
  • Identifying
  • Pagination
  • Searching
  • Translating

When you get to the point that your URL’s dynamic parameters are causing a problem, it usually comes down to basic mismanagement of the creation of these URLs.

In the case of tracking, using many different dynamic parameters when creating links that search engines crawl.

In the case of reordering, using these different dynamic parameters to reorder lists and groups of items that then create indexable duplicate pages that search engines then crawl.

You can inadvertently trigger excessive duplicate content issues if you don’t keep your dynamic parameters to a manageable level.

You should never need 50 URLs with UTM parameters to track the results of certain types of campaigns.

The creation of these dynamic URLs for one piece of content can really add up over time if you aren’t carefully managing their creation and will dilute the quality of your content along with its capability to perform in search engine results.

It leads to keyword cannibalization and on a large enough scale can severely impact your ability to compete.

Shorter URLs Are Better Than Longer URLs

A long-held SEO best practice has been shorter URLs are better than longer URLs.

Google’s John Mueller has discussed this:

“What definitely plays a role here is when we have two URLs that have the same content, and we try to pick one to show in the search results, we will pick the short one. So that is specifically around canonicalization.

It doesn’t mean it is a ranking factor, but it means if we have two URLs and one is really short and sweet and this other one has this long parameter attached to it and we know they show exactly the same content we will try to pick the shorter one.

There are lots of exceptions there, different factors that come into play, but everything else being equal – you have a shorter one and a longer one, we will try to pick the shorter one.”

There is also empirical evidence that shows that Google ranks shorter URLs for more terms, rather than long and specific.

If your site contains super long URLs everywhere, you may want to optimize them into better, shorter URLs that better reflect the article’s topic and user intent.

Examples of overly complex URLs

Make Sure Your Site Has Top Cross-Platform Compatibility & Fast Page Speed

Site glitches and other problems can arise when your site is not coded correctly.

These glitches can result from badly-nested DIV tags resulting in a glitched layout, code with bad syntax resulting in call-to-action elements disappearing, and bad site management resulting in the careless implementation of on-page elements.

Cross-platform compatibility can be affected along with page speed, resulting in greatly reduced performance and user engagement, long before link building ever becomes a consideration.

Nip some of these issues in the bud before they become major problems later.

Many of these technical SEO issues come down to poor site management and poor coding.

The more that you tackle these technical SEO issues at the beginning with more consistent development and website management best practices, the better off you’ll be later when your link building campaign takes off.

Poorly Coded Site Design

When you have a poorly coded site design, your user experience and engagement can suffer and will be adversely affected.

This is yet another element of technical SEO that can be easily overlooked.

A poorly coded site design can manifest in several ways with:

  • Poor page speed.
  • Glitches in the design appearing on different platforms.
  • Forms not working where they should (impacting conversions).
  • Any other call to actions not working on mobile devices (and desktop).
  • Any tracking code that’s not being accurately monitored (leading to poor choices in your SEO decision-making).

Any of these issues can spell disaster for your site when it can’t properly report on, capture leads, or engage with users to its fullest potential.

This is why these things should always be considered and tackled on-site before moving to link building.

If you don’t, you may wind up with weaknesses in your marketing campaigns that will be even harder to pin down, or worse – you may never find them.

All of these elements of a site design must be addressed and otherwise examined to make sure that they are not causing any major issues with your SEO.

Pages Are Slow to Load

Since July 2018, Google rolled out page speed as a ranking factor in its mobile algorithm to all users.

Slow loading pages can affect everything, so it’s something that you should pay attention to on an ongoing basis, and not just for rankings.

But for all of your users also.

What should you be on the lookout for when it comes to issues that impact page speed?

Slow Loading Images

If your site has many images approaching 1 MB (1 megabyte) in file size, you have a problem.

As the average internet connection speed approaches over 27.22 Mbps download on mobile, and fixed broadband approaches over 59.60 Mbps download, realistically, this becomes less of an issue, but can still be an issue.

You will still face slower loading pages when you have such large images on your site. If you use a tool like GTMetrix, you can see how fast your site loads these images.

Typical page speed analysis best practices say that you should take three snapshots of your site’s page speed.

Average out the three snapshots, and that’s your site’s average page speed.

It is recommended, on average, for most sites, that images should be at most 35 – 50K per image, not more. This is depending on resolution, and pixel density (including whether you are accommodating the higher pixel densities of iPhones and other devices).

Also, use lossless compression in graphics applications like Adobe Photoshop in order to achieve the best quality possible while resizing images.

Efficient Coding Best Practices

Some people believe that standard coding best practices say that you should be using W3C valid coding.

Google’s Webmaster Guidelines recommend using valid W3C coding to code your site.

Use valid HTML

But, John Mueller (and even Matt Cutts) have mentioned in the past that it’s not critical to focus on W3C-valid coding for ranking reasons.

Search Engine Journal staff Roger Montti discusses this conundrum in even further detail here: 6 Reasons Why Google Says Valid HTML Matters.

But, that’s the key word there. Focusing on it for ranking purposes.

You will find at the top of Google, for different queries, all sorts of websites that ascribe to different coding best practices, and not every site validates via the W3C.

Despite a lack of focus on that type of development best practice for ranking purposes, there are plenty of reasons why using W3C valid coding is a great idea, and why it can put you ahead of your competitors who are not doing it.

Before any further discussion takes place, it needs to be noted from a developer perspective:

  • W3C-standard validated code is not always good code.
  • Bad code is not always invalid code.
  • W3C validation should not be the be-all, end-all evaluation of a piece of coding work.
  • But, validation services like the W3C validator should be used for debugging reasons,
  • Using the W3C validator will help you evaluate your work more easily and avoid major issues as your site becomes larger and more complex after completion of the project.

But in the end, which is better, and why?

Picking a coding standard, being consistent with your coding best practices, and sticking with them is generally better than not.

When you pick a coding standard and stick with it, you introduce less complexity and less of a chance that things can go wrong after the final site launch.

While some see W3C’s code validator as an unnecessary evil, it does provide rhyme and reason to making sure that your code is valid.

For example, if your syntax is invalid in your header, or you don’t self-close tags properly, W3C’s code validator will reveal these mistakes.

If, during development, you transferred over an existing WordPress theme, from say XHML 1.0 to HTML 5 for server compatibility reasons, you may notice thousands of errors.

It means that you have incompatibility problems with the DOCTYPE in the theme and the language that is actually being used.

This happens frequently when someone copies and pastes old code into a new site implementation without regard to any coding rules whatsoever.

This can be disastrous to cross-platform compatibility.

Also, this simple check can help you reveal exactly what’s working (or not working) under the hood right now code-wise.

Where efficient coding best practices come into play, is doing things like inadvertently putting multiple closing DIV tags where they shouldn’t go, being careless about how you code the layout, etc.

All of these coding errors can be a huge detriment to the performance of your site, both from a user and search engine perspective.

Common Ways Too Many WordPress Plugins Can Harm Your Site

Using Too Many Plugins

Plugins can become major problems when their use is not kept in check.

Why is this? How can this be – aren’t plugins supposed to help?

In reality, if you don’t manage your plugins properly, you can run into major site performance issues down the line.

Here are some reasons why.

Extra HTTP Requests

All files that load on your site generate requests from the server or HTTP requests.

Every time someone requests your page, all of your page elements load (images, video, graphics, plugins, everything), and all of these elements require an HTTP request to be transferred.

The more HTTP requests you have, the more these extra plugins will slow down your site.

This can be mostly a matter of milliseconds, and for most websites does not cause a huge issue.

It can, however, be a major bottleneck if your site is a large one, and you have hundreds of plugins.

Keeping your plugin use in check is a great idea, to make sure that your plugins are not causing a major bottleneck and causing slow page speeds.

Increased Database Queries Due to Extra Plugins

WordPress uses SQL databases in order to process queries and maintain its infrastructure.

If your site is on WordPress, it’s important to know that every plugin you add will send out extra database queries.

These extra queries can add up, and cause bottleneck issues that will negatively affect your site’s page speed.

The more you load plugins up, the slower your site will get.

If you don’t manage the database queries well, you can run into serious issues with your website’s performance, and it will have nothing to do with how your images load.

It also depends on your host.

If you suffer from a large website with too many plugins and too little in the way of resources, now may be the time for an audit to see exactly what’s happening.

The Other Problem With Plugins: They Increase the Probability of Your Website Crashing

When the right plugins are used, you don’t have to worry (much) about keeping an eye on them.

You should, however, be mindful of when plugins are usually updated, and how they work with your WordPress implementation to make sure your website stays functional.

If you auto-update your plugins, you may have an issue one day where a plugin does not play nice with other plugins. This could cause your site to crash.

This is why it is so important to manage your WordPress plugins.

And make sure that you don’t exceed what your server is capable of.

This Is Why It’s Important to Tackle Technical SEO Before Link Building

Many technical SEO issues can rear their ugly head and affect your site’s SERP performance long before link building enters the equation.

That’s why it’s important to tackle technical SEO before you start link building.

Any technical SEO issues can cause significant drops in website performance long before link building ever becomes a factor.

Start with a thorough technical SEO audit to reveal and fix any on-site issues.

It will help identify any weaknesses in your site, and these changes will all work together with link building to create an even better online presence for you, and your users.

Any link building will be for naught if search engines (or your users) can’t accurately crawl, navigate, or otherwise use your site.


Timeframe: Month 1, 2, 3 and every quarter

Results Detected: 1-4 months after implementation

Tools needed:

  • Screaming Frog
  • DeepCrawl
  • Ahrefs (or Moz)
  • Google Search Console
  • Google Analytics

Link building benefits of technical SEO:

  • Technical SEO will help you get the maximum performance out of your links.
  • Technical SEO like a clean site structure and understanding of PR flow is very key for internal link placement.

Image Credits

Featured Image: Paulo Bobita
In-post images/screenshots taken by author, July 2019

Continue Reading
Click to comment

You must be logged in to post a comment Login

Leave a Reply


Crawl data analysis of 2 billion links from 90 million domains offer glimpse into today’s web



Crawl data analysis of 2 billion links from 90 million domains offer glimpse into today's web

The web is not only essential for people working in digital marketing, but for everyone. We professionals in this field need to understand the big picture of how the web functions for our daily work. We also know that optimizing our customers’ sites is not just about their sites, but also improving their presence on the web, which it is connected to other sites by links.

To get an overall view of information about the web we need data, lots of data. And we need it on a regular basis. There are some organizations that provide open data for this purpose like Httparchive. It collects and permanently stores the web’s digitized content and offers them as public dataset. A second example is Common Crawl, an organization that crawls the web every month. Their web archive has been collecting petabytes of data since 2011. In their own words, “Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.”

In this article, a quick data analysis of Common Crawl’s recent public data and metrics will be presented to offer a glimpse into what’s happening on the web today.

This data analysis was performed on almost two billion edges of nearly 90 million hosts. For the purposes of this article, the term “edge” will be used as a reference to a link. An edge from one host (domain) to another is counted only once if there is at least one link from one host to the other host. Also to note that the PageRank of hosts is dependent on the number of links received from other hosts but not on the number given to others.

There is also a dependency between the number of links given to hosts and the number of subdomains of a host. This is not a great surprise given that of the nearly 90 million hosts, the one receiving links from the maximum number of hosts is “,” while the host sending links to the maximum number of hosts is “” And the host having the maximum number of hosts (subdomains) is “”

The public Common Crawl data include crawls from May, June and July 2019.

The main data analysis is performed on three following compressed Common Crawl files.

These two datasets are used for the additional data analysis concerning the top 50 U.S. sites.

The Common Crawl data provided in three compressed files belongs to their recent domain-level graph. First, in the “domain vertices” file, there are 90 million nodes (naked domains). In the “domain edges” file, there are their two billion edges (links). Lastly, the file “domain ranks” contains the rankings of naked domains by their PageRank and harmonic centrality.

Harmonic centrality is a centrality measure like PageRank used to discover the importance of the nodes in a graph. Since 2017, Common Crawl has been using harmonic centrality in their crawling strategy for prioritization by link analysis. Additionally in the “domain ranks” dataset, the domains are sorted according to their harmonic centrality values, not to their PageRank values. Although harmonic centrality doesn’t correlate with PageRank on the final dataset, it correlates with PageRank in the top 50 U.S. sites data analysis. There is a compelling video “A Modern View of Centrality Measures”  where Paolo Boldi presents a comparison of PageRank and harmonic centrality measurements on the Hollywood graph. He states that harmonic centrality selects top nodes better than PageRank.

[All Common Crawl data used in this article is from May, June and July 2019.]

Preview of Common Crawl “domain vertices” dataset

Preview of common crawl “domain edges” dataset

Preview of Common Crawl “domain ranks” dataset sorted by harmonic centrality  

The preview of the final dataset obtained by three main Common Crawl datasets; “domain vertices,” “domain edges” and “domain ranks” sorted by PageRank 

Column names:

  • host_rev: Reversed host name, for example ‘’ becomes ‘’ 
  • n_in_hosts: Number of other hosts which the host receives at least one link from
  • n_out_hosts: Number of other hosts which the host sends at least one link to
  • harmonicc_pos: Harmonic centrality position of the host
  • harmonicc_val: Harmonic centrality value of the host
  • pr_pos: PageRank position of the host
  • pr_val: PageRank value of the host
  • n_hosts: Number of  hosts (subdomains) belonging to the host

Statistics of Common Crawl final dataset

*link : Counted as a link if there is at least one link from one host to other 

  • Number of incoming hosts: 
    • Mean, min, max of n_in_hosts  = 21.63548751, 0, 20081619
    • *The reversed host receiving links* from maximum number of hosts is ‘com.googleapis’.
  • Number of outgoing hosts: 
    • Mean, min, max of n_out_hosts  = 21.63548751, 0, 7813499
    • *The reversed host sending links* to maximum number of hosts is ‘com.blogspot’
  • PageRank 
    • mean, min, max of pr_val  = 1.13303402e-08, 0., 0.02084144
  • Harmonic centrality
    • mean, min, max of harmonicc_val  = 10034682.46655859, 0., 29977668.
  • Number of hosts (subdomains)
    • mean, min, max of n_hosts  = 5.04617139, 1, 7034608
    • *The reversed host having maximum number of hosts (subdomains) is ‘com.wordpress’’
  • Correlations
    • correlation(n_in_hosts, n_out_hosts) = 0.11155189
    • correlation(n_in_hosts, n_hosts) = 0.07653162
    • correlation(n_out_hosts, n_hosts) = 0.60220516
    • correlation(n_in_hosts, pr_val) = 0.96545709
    • correlation(n_out_hosts, pr_val) = 0.08552065
    • correlation(n_in_hosts, harmonicc_val) = 0.00527706
    • correlation(n_out_hosts, harmonicc_val) = 0.00440205
    • correlation(pr_val, harmonicc_val) = 0.00400214
    • correlation(pr_val, n_hosts) = 0.05847027
    • correlation(harmoniccc_val, n_hosts) = 0.00042441

The correlation results show that the number of incoming hosts (n_in_hosts) is correlated with PageRank value (pr_val) and number of outgoing hosts (n_out_hosts), while the former is very strong, the latter is weak. There is also a dependency between the number of outgoing hosts and number of hosts (n_hosts), subdomains of a host.

Data visualization: Distribution of PageRank

The graph below presents the plot of the count of pr_val values. It shows us that the distribution of PageRank on almost 90 million hosts is highly right skewed meaning the majority of the hosts have very low PageRank.

Distribution of the number of hosts

The following graph presents the plot of the count of n_hosts (subdomains) values. It shows us that the distribution of number of hosts (subdomains) of almost 90 million hosts is highly right-skewed meaning the majority of the hosts have a low number of subdomains.

Distribution of the number of incoming hosts

The graph below presents the plot of the count of n_in_hosts (number of incoming hosts) values. It shows us that this distribution is right-skewed, too.

Distribution of number of outgoing hosts

The following graph shows the plot of the count of n_out_hosts (number of outgoing hosts) values. Again, this distribution is also right-skewed.

Distribution of harmonic centrality 

The following graph presents the plot of the count of harmonicc_val column values. It shows that the distribution of harmonicc_val on almost 90 million hosts is not highly right-skewed like  PageRank or number of hosts distributions. It is not a perfect gaussian distribution but more gaussian than the distributions of PageRank and number of hosts. This distribution is multimodal.

Scatter plot of number of incoming hosts vs number of outgoing hosts

The graph below presents the scatter plot of the n_in_hosts in x-axis and the n_out_hosts in y-axis. It is showing that the number of outgoing and incoming hosts are not overall directly dependent on each other. In other words, when the number of links which a host receives from other hosts increase, its outgoing links to other hosts do not increase. When hosts do not have a significant number of incoming hosts, they easily give links to other hosts. However the hosts having an important number of incoming hosts are not that generous.

Scatter plot of number of incoming hosts vs. PageRank 

The graph below presents the scatter plot of the n_in_hosts values in x-axis and the pr_val values of hosts in y-axis. It shows us that there is a correlation between the number of incoming hosts to a host and its PageRank. In other words, the more hosts link to a host, the greater its PageRank value is.

Scatter plot of number of outgoing hosts vs. PageRank 

The graph below presents the scatter plot of the n_out_hosts  in x-axis and the pr_val value of hosts in y-axis. It shows us that the correlation between the number of incoming hosts and PageRank do not exist between the number of outgoing hosts and the PageRank. 

Scatter plot of PageRank and harmonic centrality 

As the majority of hosts have low PageRank, we see a vertical line when we scatter plot the PageRank and harmonic centrality values of hosts. But, we observe the detachment of the hosts’ PageRank values from the masses begins when their harmonic centrality value is closer to 1.5e7 and accelerates when it is greater than.

Top 50 US sites

Top 50 U.S. sites data are selected from the final Common Crawl dataset obtained in the beginning. Their hosts are reversed in order to match with the column “host_rev” in the Common Crawl final data set. For example, “” becomes “” Below is a preview from this selection. There are 49 sites instead of 50 because “”  doesn’t exist in Common Crawl dataset but “” does. 

The Majestic Million public dataset is also imported. The preview of this file is below

These two data sets; top U.S. 50 sites including Common Crawl data and metrics and the data set of Majestic Million are merged. The refips, refsubnets are summed up by reversed hosts.

The preview of this final dataset is below

Statistics of top 50 US sites final dataset

  • Number of incoming hosts: 
    • mean, min, max of n_in_hosts  = 1565724.63265306, 1015, 16537551
  • Number of outgoing hosts:
    •  mean, min, max of n_out_hosts  = 80812.70833333, 28., 2529655
  • PageRank
    • mean, min, max of pr_val  = 0.00105891, 9.73490741e-07, 0.01285745 
  • Harmonic centrality
    • mean, min, max of harmonicc_val  = 18871331.16326531, 14605537., 27867704
  • Number of hosts (subdomains)
    • mean, min, max of n_hosts  = 36426.79591837, 22, 1555402

From this dataset, which have the top 50 U.S. sites Common Crawl data and Majestic Million data, a pairwise scatterplot of metrics – pr_val, n_in_hosts, n_out_hosts, harmonicc_val, refips_sum, refsubnets_sum – are created can be seen below.

This pairwise scatter plot shows us that PageRank of the U.S. 50 top sites is somewhat correlated with all the metrics used in this graph except number of outgoing hosts, represented with legend n_out_hosts.

Below the correlation heatmap of these metrics is also available


The data analysis of the top 50 U.S. sites shows a dependency between the number of incoming hosts and referring IP addresses (refips) and the subdivision of an IP network that points to the target domain (refsubnets) metrics. Harmonic centrality is correlated between PageRank, number of incoming hosts, refIPs and refsubnets of the hosts.

Of the almost 90 million hosts ranks and their two billion edges (edges are links only counted once even if there are many from a single host), there is a strong correlation between PageRank and the number of incoming edges to each host. However, we can’t say the same for the number of outgoing edges from hosts.

In this data analysis, we find a correlation between the number of subdomains and the number of outgoing edges from one host to other hosts. The distribution of PageRank on this web graph is highly right-skewed meaning the majority of the hosts have very low PageRank.

Ultimately, the main data analysis tells us that the majority of domains on the web have low PageRank, a low number of incoming and outgoing edges and a low number of host subdomains. We know this because all of these features have the same highly right-skewed type of data distribution.

PageRank is still a popular and well-known centrality measure. One of the reasons for its success is its performance with similar types of data distribution comparable to the distribution of edges on domains.

Common Crawl is an invaluable and neglected public data source for SEO. The tremendous data are technically not easy to access even though they are public. However, it provides a once per three months “domain ranks” file that can be relatively easy to analyze compared to raw monthly crawl data. Due to a lack of resources, we can not crawl the web and calculate the centrality measures ourselves, but we can take advantage of this extremely useful resource to analyze our customers’ websites and their competitors rankings with their connections on the web.

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

About The Author

Aysun Akarsu is a trilingual data scientist specialized in machine intelligence for digital marketing wanting to help companies in making data driven decisions for reaching a broader, qualified audience. Aysun writes regulary about SEO data analysis on her blog,SearchDatalogy

Continue Reading


If Google says H1s don’t matter for rankings, why should you use them? Here’s why



If Google says H1s don't matter for rankings, why should you use them? Here's why

On October 3, Webmaster Trends Analyst John Mueller delivered an edition of #AskGoogleWebmasters describing how Google approaches H1 headings with regards to ranking. The explanation caused a bit of a stir.

What Google said

“Our systems aren’t too picky and we’ll try to work with the HTML as we find it — be it one H1 heading, multiple H1 headings or just styled pieces of text without semantic HTML at all,” Mueller said.

In other words, Mueller is saying Google’s systems don’t have to rely on specific headings structure to indicate the main focus of content on the page.

What’s the fuss?

Mueller’s answer would appear to counter a longstanding “best practice” to use and optimize a single H1 and subsequent headings on a page. This is even reflected in the weighting of +2 that headings were given in our own most recent Periodic Table of SEO Factors.

“This seems to directly contradict years of SEO advice I’ve been given by all the SEO experts,” Dr. John Grohol, founder of, tweeted, expressing a reaction shared by many. Others cited their own experiences of seeing how H1 implementations can affect organic visibility.

How headings are designed to be used

The hierarchy of headings communicates what the content on a page is about as well as how ideas are grouped, making it easy for users to navigate the page. Applying multiple H1s or skipping headings altogether can create a muddled page structure and make a page harder to read.

Accessibility is also a significant reason to use headings. A point made even more salient now that the courts have ruled that websites fall under the Americans with Disabilities Act.

“Heading markup will allow assistive technologies to present the heading status of text to a user,” the World Wide Web Consortium’s Web Content Accessibility Guidelines (WCAG) explains. “A screen reader can recognize the code and announce the text as a heading with its level, beep or provide some other auditory indicator. Screen readers are also able to navigate heading markup which can be an effective way for screen reader users to more quickly find the content of interest. Assistive technologies that alter the authored visual display will also be able to provide an appropriate alternate visual display for headings that can be identified by heading markup.”

Joost de Valk, founder of Yoast SEO WordPress plugin, noted that most WordPress themes are designed to have a single H1 heading just for post titles — “not for SEO (although that won’t hurt) but for decent accessibility.”

SEO consultant Alan Bleiweiss pointed to a WebAIM survey that found 69% of screen readers use headings to navigate through a page and 52% find heading levels very useful.

Many SEOs are concerned that Google’s lack of emphasis on accessibility standards, including rel=prev/next, may disincentive site owners to implement them, potentially making content harder to understand for users who depend on screen-reading technology, such as the visually impaired. Do that at your own risk.

H1s and SEO

“It is naive to think that Google completely ignores the H1 tag,” Hamlet Batista, CEO and founder of RankSense, told Search Engine Land.

“I’ve seen H1s used in place of title tags in the SERPs. So, it is a good idea to make the H1 the key topic of the page; in case this happens, you have a reasonably good headline,” Batista said, adding that having multiple H1s may provide less control of what text could appear in the search results if the H1 is used instead of the title.

Others said headings hiccups have hurt rankings.

In the comment above, which was left on Search Engine Roundtable’s coverage of the announcement, the commenter attributes the performance decline to an error that resulted in removal of H1s from his content.

You should still use proper headings

All John Mueller is saying is that Google can usually figure out what’s important on a page even when you’re not using headings or heading hierarchies. “It’s not a secret ranking push,” Mueller added in a follow up. “A script sees the page, you’re highlighting some things as ‘important parts,’ so often we can use that a bit more relative to the rest. If you highlight nothing/everything, we’ll try to figure it out.”

As Mueller said at the end of the #AskGoogleWebmasters video, “When thinking about this topic, SEO shouldn’t be your primary objective. Instead, think about your users: if you have ways of making your content accessible to them, be it by using multiple H1 headings or other standard HTML constructs, then that’s not going to get in the way of your SEO efforts.”

About The Author

George Nguyen is an Associate Editor at Third Door Media. His background is in content marketing, journalism, and storytelling.

Continue Reading


Revenge of the small business website



Revenge of the small business website

For several years, many SEOs have been proclaiming the end of small business (SMB) websites. The theory is that third party destinations (GMB, Facebook, Yelp, etc.) have taken over and SMB sites will rarely see consumer visits if at all. GMB now is so widely used and so complete, the argument goes, that consumers never need to visit the underlying SMB site.

Recent investments and M&A. That description of consumer behavior is partly correct but not entirely. Websites continue to be a critical SMB asset and content anchor. That fact is underscored by WordPress parent Automattic’s most recent funding round of $300 million (at a $3+ billion valuation) and Square’s April 2018 roughly $365 million acquisition of site builder Weebly.

On a smaller scale, ten-year old web design platform Duda recently raised $25 million (for just under $50 million in total funding). Duda has a network of more than 6,000 third party resellers and agencies that work with SMBs. It will continue to focus on websites and presence management rather than expand horizontally into other marketing channels.

New Yahoo web design service. In addition, late last week Verizon-owned Yahoo launched a new web design product for SMBs. There are two service tiers ($99 and $299 per month). The offering includes design consultation, ongoing maintenance and content updates (it’s a SaaS product).

Yahoo Small Business was at one time the premier hosting company for SMBs. During a long period of somnambulance, it was surpassed by GoDaddy and others. But following Verizon’s $4+ billion acquisition of Yahoo in 2016, the company has sought to invest and develop new small business products and services and regain momentum. Its brand has remained relatively strong among SMBs across the U.S. despite the decline of Yahoo itself.

Now, Yahoo is developing a new generation of marketing products and services for SMBs. The web design service is just the first announcement.

SMB sites more trusted, still visited. A May 2019 consumer survey from BrightLocal found nearly twice as many respondents (56%) expected SMB websites to be accurate compared with Google My Business (32%). This was a surprise. However, a 2018 survey from the SEO firm found that the most common consumer action by a fairly significant margin, after reading a positive review, was to visit the SMB’s website.

Why we should care. The Small Business Administration says (.pdf) there are now roughly 30 million SMBs in the U.S. The SBA defines “small business” as having a headcount of up to 499 employees. There’s a massive difference between a firm with three or even 20 employees and one that has 300. Regardless, well over 90% of U.S. SMBs have fewer than 10 employees.

While a majority of SMBs in theory, now have websites — 64% according to a 2018 Clutch survey — there’s still a significant opportunity for providers of websites. New businesses form and fail every quarter. And even with shrinking reach in organic search and social, websites are likely remainto the anchor of SMB digital marketing into the foreseeable future.

About The Author

Greg Sterling is a Contributing Editor at Search Engine Land. He writes about the connections between digital and offline commerce. He previously held leadership roles at LSA, The Kelsey Group and TechTV. Follow him Twitter or find him on LinkedIn.

Continue Reading


Copyright © 2019 Plolu.