Connect with us

WordPress

Crawl data analysis of 2 billion links from 90 million domains offer glimpse into today’s web

Published

on

Crawl data analysis of 2 billion links from 90 million domains offer glimpse into today's web


The web is not only essential for people working in digital marketing, but for everyone. We professionals in this field need to understand the big picture of how the web functions for our daily work. We also know that optimizing our customers’ sites is not just about their sites, but also improving their presence on the web, which it is connected to other sites by links.

To get an overall view of information about the web we need data, lots of data. And we need it on a regular basis. There are some organizations that provide open data for this purpose like Httparchive. It collects and permanently stores the web’s digitized content and offers them as public dataset. A second example is Common Crawl, an organization that crawls the web every month. Their web archive has been collecting petabytes of data since 2011. In their own words, “Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.”

In this article, a quick data analysis of Common Crawl’s recent public data and metrics will be presented to offer a glimpse into what’s happening on the web today.

This data analysis was performed on almost two billion edges of nearly 90 million hosts. For the purposes of this article, the term “edge” will be used as a reference to a link. An edge from one host (domain) to another is counted only once if there is at least one link from one host to the other host. Also to note that the PageRank of hosts is dependent on the number of links received from other hosts but not on the number given to others.

There is also a dependency between the number of links given to hosts and the number of subdomains of a host. This is not a great surprise given that of the nearly 90 million hosts, the one receiving links from the maximum number of hosts is “googleapis.com,” while the host sending links to the maximum number of hosts is “blogspot.com.” And the host having the maximum number of hosts (subdomains) is “wordpress.com.”

The public Common Crawl data include crawls from May, June and July 2019.

The main data analysis is performed on three following compressed Common Crawl files.

These two datasets are used for the additional data analysis concerning the top 50 U.S. sites.

The Common Crawl data provided in three compressed files belongs to their recent domain-level graph. First, in the “domain vertices” file, there are 90 million nodes (naked domains). In the “domain edges” file, there are their two billion edges (links). Lastly, the file “domain ranks” contains the rankings of naked domains by their PageRank and harmonic centrality.

Harmonic centrality is a centrality measure like PageRank used to discover the importance of the nodes in a graph. Since 2017, Common Crawl has been using harmonic centrality in their crawling strategy for prioritization by link analysis. Additionally in the “domain ranks” dataset, the domains are sorted according to their harmonic centrality values, not to their PageRank values. Although harmonic centrality doesn’t correlate with PageRank on the final dataset, it correlates with PageRank in the top 50 U.S. sites data analysis. There is a compelling video “A Modern View of Centrality Measures”  where Paolo Boldi presents a comparison of PageRank and harmonic centrality measurements on the Hollywood graph. He states that harmonic centrality selects top nodes better than PageRank.

[All Common Crawl data used in this article is from May, June and July 2019.]

Preview of Common Crawl “domain vertices” dataset

Preview of common crawl “domain edges” dataset

Preview of Common Crawl “domain ranks” dataset sorted by harmonic centrality  

The preview of the final dataset obtained by three main Common Crawl datasets; “domain vertices,” “domain edges” and “domain ranks” sorted by PageRank 

Column names:

  • host_rev: Reversed host name, for example ‘google.com’ becomes ‘com.google’ 
  • n_in_hosts: Number of other hosts which the host receives at least one link from
  • n_out_hosts: Number of other hosts which the host sends at least one link to
  • harmonicc_pos: Harmonic centrality position of the host
  • harmonicc_val: Harmonic centrality value of the host
  • pr_pos: PageRank position of the host
  • pr_val: PageRank value of the host
  • n_hosts: Number of  hosts (subdomains) belonging to the host

Statistics of Common Crawl final dataset

*link : Counted as a link if there is at least one link from one host to other 

  • Number of incoming hosts: 
    • Mean, min, max of n_in_hosts  = 21.63548751, 0, 20081619
    • *The reversed host receiving links* from maximum number of hosts is ‘com.googleapis’.
  • Number of outgoing hosts: 
    • Mean, min, max of n_out_hosts  = 21.63548751, 0, 7813499
    • *The reversed host sending links* to maximum number of hosts is ‘com.blogspot’
  • PageRank 
    • mean, min, max of pr_val  = 1.13303402e-08, 0., 0.02084144
  • Harmonic centrality
    • mean, min, max of harmonicc_val  = 10034682.46655859, 0., 29977668.
  • Number of hosts (subdomains)
    • mean, min, max of n_hosts  = 5.04617139, 1, 7034608
    • *The reversed host having maximum number of hosts (subdomains) is ‘com.wordpress’’
  • Correlations
    • correlation(n_in_hosts, n_out_hosts) = 0.11155189
    • correlation(n_in_hosts, n_hosts) = 0.07653162
    • correlation(n_out_hosts, n_hosts) = 0.60220516
    • correlation(n_in_hosts, pr_val) = 0.96545709
    • correlation(n_out_hosts, pr_val) = 0.08552065
    • correlation(n_in_hosts, harmonicc_val) = 0.00527706
    • correlation(n_out_hosts, harmonicc_val) = 0.00440205
    • correlation(pr_val, harmonicc_val) = 0.00400214
    • correlation(pr_val, n_hosts) = 0.05847027
    • correlation(harmoniccc_val, n_hosts) = 0.00042441

The correlation results show that the number of incoming hosts (n_in_hosts) is correlated with PageRank value (pr_val) and number of outgoing hosts (n_out_hosts), while the former is very strong, the latter is weak. There is also a dependency between the number of outgoing hosts and number of hosts (n_hosts), subdomains of a host.

Data visualization: Distribution of PageRank

The graph below presents the plot of the count of pr_val values. It shows us that the distribution of PageRank on almost 90 million hosts is highly right skewed meaning the majority of the hosts have very low PageRank.

Distribution of the number of hosts

The following graph presents the plot of the count of n_hosts (subdomains) values. It shows us that the distribution of number of hosts (subdomains) of almost 90 million hosts is highly right-skewed meaning the majority of the hosts have a low number of subdomains.

Distribution of the number of incoming hosts

The graph below presents the plot of the count of n_in_hosts (number of incoming hosts) values. It shows us that this distribution is right-skewed, too.

Distribution of number of outgoing hosts

The following graph shows the plot of the count of n_out_hosts (number of outgoing hosts) values. Again, this distribution is also right-skewed.

Distribution of harmonic centrality 

The following graph presents the plot of the count of harmonicc_val column values. It shows that the distribution of harmonicc_val on almost 90 million hosts is not highly right-skewed like  PageRank or number of hosts distributions. It is not a perfect gaussian distribution but more gaussian than the distributions of PageRank and number of hosts. This distribution is multimodal.

Scatter plot of number of incoming hosts vs number of outgoing hosts

The graph below presents the scatter plot of the n_in_hosts in x-axis and the n_out_hosts in y-axis. It is showing that the number of outgoing and incoming hosts are not overall directly dependent on each other. In other words, when the number of links which a host receives from other hosts increase, its outgoing links to other hosts do not increase. When hosts do not have a significant number of incoming hosts, they easily give links to other hosts. However the hosts having an important number of incoming hosts are not that generous.

Scatter plot of number of incoming hosts vs. PageRank 

The graph below presents the scatter plot of the n_in_hosts values in x-axis and the pr_val values of hosts in y-axis. It shows us that there is a correlation between the number of incoming hosts to a host and its PageRank. In other words, the more hosts link to a host, the greater its PageRank value is.

Scatter plot of number of outgoing hosts vs. PageRank 

The graph below presents the scatter plot of the n_out_hosts  in x-axis and the pr_val value of hosts in y-axis. It shows us that the correlation between the number of incoming hosts and PageRank do not exist between the number of outgoing hosts and the PageRank. 

Scatter plot of PageRank and harmonic centrality 

As the majority of hosts have low PageRank, we see a vertical line when we scatter plot the PageRank and harmonic centrality values of hosts. But, we observe the detachment of the hosts’ PageRank values from the masses begins when their harmonic centrality value is closer to 1.5e7 and accelerates when it is greater than.

Top 50 US sites

Top 50 U.S. sites data are selected from the final Common Crawl dataset obtained in the beginning. Their hosts are reversed in order to match with the column “host_rev” in the Common Crawl final data set. For example, “youtube.com” becomes “com.youtube.” Below is a preview from this selection. There are 49 sites instead of 50 because “finance.yahoo.com”  doesn’t exist in Common Crawl dataset but “com.yahoo” does. 

The Majestic Million public dataset is also imported. The preview of this file is below

These two data sets; top U.S. 50 sites including Common Crawl data and metrics and the data set of Majestic Million are merged. The refips, refsubnets are summed up by reversed hosts.

The preview of this final dataset is below

Statistics of top 50 US sites final dataset

  • Number of incoming hosts: 
    • mean, min, max of n_in_hosts  = 1565724.63265306, 1015, 16537551
  • Number of outgoing hosts:
    •  mean, min, max of n_out_hosts  = 80812.70833333, 28., 2529655
  • PageRank
    • mean, min, max of pr_val  = 0.00105891, 9.73490741e-07, 0.01285745 
  • Harmonic centrality
    • mean, min, max of harmonicc_val  = 18871331.16326531, 14605537., 27867704
  • Number of hosts (subdomains)
    • mean, min, max of n_hosts  = 36426.79591837, 22, 1555402

From this dataset, which have the top 50 U.S. sites Common Crawl data and Majestic Million data, a pairwise scatterplot of metrics – pr_val, n_in_hosts, n_out_hosts, harmonicc_val, refips_sum, refsubnets_sum – are created can be seen below.

This pairwise scatter plot shows us that PageRank of the U.S. 50 top sites is somewhat correlated with all the metrics used in this graph except number of outgoing hosts, represented with legend n_out_hosts.

Below the correlation heatmap of these metrics is also available

Conclusion

The data analysis of the top 50 U.S. sites shows a dependency between the number of incoming hosts and referring IP addresses (refips) and the subdivision of an IP network that points to the target domain (refsubnets) metrics. Harmonic centrality is correlated between PageRank, number of incoming hosts, refIPs and refsubnets of the hosts.

Of the almost 90 million hosts ranks and their two billion edges (edges are links only counted once even if there are many from a single host), there is a strong correlation between PageRank and the number of incoming edges to each host. However, we can’t say the same for the number of outgoing edges from hosts.

In this data analysis, we find a correlation between the number of subdomains and the number of outgoing edges from one host to other hosts. The distribution of PageRank on this web graph is highly right-skewed meaning the majority of the hosts have very low PageRank.

Ultimately, the main data analysis tells us that the majority of domains on the web have low PageRank, a low number of incoming and outgoing edges and a low number of host subdomains. We know this because all of these features have the same highly right-skewed type of data distribution.

PageRank is still a popular and well-known centrality measure. One of the reasons for its success is its performance with similar types of data distribution comparable to the distribution of edges on domains.

Common Crawl is an invaluable and neglected public data source for SEO. The tremendous data are technically not easy to access even though they are public. However, it provides a once per three months “domain ranks” file that can be relatively easy to analyze compared to raw monthly crawl data. Due to a lack of resources, we can not crawl the web and calculate the centrality measures ourselves, but we can take advantage of this extremely useful resource to analyze our customers’ websites and their competitors rankings with their connections on the web.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About The Author

Aysun Akarsu is a trilingual data scientist specialized in machine intelligence for digital marketing wanting to help companies in making data driven decisions for reaching a broader, qualified audience. Aysun writes regulary about SEO data analysis on her blog,SearchDatalogy

Continue Reading
Click to comment

You must be logged in to post a comment Login

Leave a Reply

WordPress

WordPress.com Introduces A New Way For Websites to Make Money

Published

on

Matt Southern


Websites hosted on WordPress.com can now monetize their content with a new recurring payments feature.

Available with any paid plan on WordPress.com, the recurring payment feature lets site owners collect repeat contributions from supporters in exchange for things like exclusive content or a monthly membership.

“Let your followers support you with periodic, scheduled payments. Charge for your weekly newsletter, accept monthly donations, sell yearly access to exclusive content — and do it all with an automated payment system.”

Recurring payments on WordPress.com allows site owners to:

  • Accept regularly-scheduled payments directly on their site.
  • Offer ongoing subscriptions, site memberships, monthly donations, and more.
  • Integrate their site with Stripe to process payments and collect funds.

WordPress.com site owners can enable recurring payments by following the steps below:

  • Step 1: Connect (or create) a Stripe account. Visit the Earn page from the WordPress dashboard and click Connect Stripe to Get Started.
  • Step 2: Add a recurring payments button to your site using the block editor.
  • Step 3: Customize details such as payment amounts, frequencies, subscription tiers, and so on.

Websites will pay WordPress a percentage of revenue earned through recurring payments, which varies depending on whether its a personal plan (8%), premium plan (4%), or business plan (2%). In addition to WordPress fees, Stripe collects 2.9% + $0.30 for each payment.

In order to make a recurring payment to a WordPress.com site, users will also need to have a WordPress.com account. If they don’t already have one, they’ll be prompted to create one when making a recurring payment for the first time.

For users, this will make it easy to subscribe to multiple sites with one account and manage all subscriptions from one place.



Continue Reading

WordPress

Here’s how to set up the Google Site Kit WordPress plugin

Published

on

Here's how to set up the Google Site Kit WordPress plugin


On Oct. 31, Google announced the launch of its Site Kit WordPress plugin that, “enables you to set up and configure key Google services, get insights on how people find and use your site, learn how to improve, and easily monetize your content.”

This plugin allows you to easily connect the following Google Services in a dashboard format within your WordPress backend:

  • Search Console
  • Analytics
  • PageSpeed Insights
  • AdSense
  • Optimize
  • Tag Manager

It brings the convenience of accessing your site’s performance data while logged into the backend of the site. This is great for webmasters, developers and agencies who are often an admin for their own site or a client’s WordPress site. However, it does not offer the robust and dynamic capabilities of a Google Data Studio report or dashboard to sort data so it may not be ideal for a digital marketing manager or CMO.

With that said, it wouldn’t hurt to implement this plugin as it’s actually a nifty tool that can help you stay on top of your site’s performance metrics. It’s also another way to give Google more access to your site which can have some in-direct benefits organically. 

Here is what the Google Site Kit plugin looks like within the WordPress plugin directory.

Installing and setting up Google Site Kit

To utilize the plugin, simply click install and activate as you would any other WordPress plugin. You will then be prompted to complete the set up.

Step 1

Click on the “Start Setup” button.

Step 2

You will be prompted to give access to your site’s Google Search Console profile, which means you need to sign in to the Gmail account that has access to your site’s Search Console profile.

Step 3

Once logged in you need to grant permissions for Google to access the data in your Search Console profile.

Step 4

Once you’ve granted all the respective permissions, you will get a completion notification and can then click on “Go to my Dashboard.”

Step 5

Once you’re in the Dashboard you will see options to connect other services such as Analytics, AdSense and PageSpeed insights. You can now choose to connect these services if you like. If you go to the settings of the plugin you will see additional connection options for Optimize and Tag Manager.

Here is what the dashboard looks like with Search Console, analytics and PageSpeed Insights enabled. You can see a clear breakdown of the respective metrics.

The plugin allows you to dive into each reporting respectively with navigation options on the left to drill down into Search Console and analytics.

There is also an admin bar feature to see individual page stats.

In summary, this is a great plugin by Google but keep in mind it’s just version 1.0. I’m excited to see what features and integrations the later versions will have!


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About The Author

Tony Edward is a director of SEO at Tinuiti and an adjunct instructor of search marketing at NYU. Tony has been in the online marketing industry for over 10 years. His background stems from affiliate marketing and he has experience in paid search, social media and video marketing. Tony is also the founder of the Thinking Crypto YouTube channel.

Continue Reading

WordPress

Bing Announces Link Penalties – Search Engine Journal

Published

on

Roger Montti


Bing announced a new link penalties. These link penalties are focused on taking down private blog networks (PBNs), subdomain leasing and manipulative cross-site linking.

Inorganic Site Structure

An inorganic site structure is a linking pattern that uses internal site-level link signals (with subdomains) or cross-site linking patterns (with external domains) in order to manipulate search engine rankings.

While these spam techniques already existed, Bing introduced the concept of calling them “inorganic site structure” in order to describe them.

Bing noted that sites legitimately create subdomains to keep different parts of the site separate, such as support.example.com. These are treated as belonging to the main domain, passing site-level signals to the subdomains.

Bing also said sites like WordPress create standalone sites under subdomains, in which case no site level signals are passed to the subdomains.

Examples of Inorganic Site Structure

An inorganic site structure is when a company leases a subdomain in order to take advantage of site-level signals to rank better. There have been

Private blog networks were also included as inorganic site structure

Domain Boundaries

Bing also introduced the idea of domain boundaries. The idea is that there are boundaries to a domain. Sometimes, as in the case of legitimate subdomains (ex. support.example.com), those boundaries extend out to the subdomain. In other cases like WordPress.com subdomains the boundaries do not extend to the subdomains.

Private Blog Networks (PBNs)
Bing called out PBNs as a form of spam that abuse website boundaries.

“While not all link networks misrepresent website boundaries, there are many cases where a single website is artificially split across many different domains, all cross-linking to one another, for the obvious purpose of rank boosting. This is particularly true of PBNs (private blog networks).”

Subdomain Leasing Penalties

Bing explained why they consider subdomain leasing a spammy activity:

“…we heard concerns from the SEO community around the growing practice of hosting third-party content or letting a third party operate a designated subdomain or subfolder, generally in exchange for compensation.

…the practice equates to buying ranking signals, which is not much different from buying links.”

At the time of this article, I still see a news site subdomain ranking in Bing (and Google). This page belongs to another company. All the links are redirected affiliate type links with parameters meant for tracking the referrals.

According to Archive.org the subdomain page was credited to an anonymous news staffer. Sometime in the summer the author was switched to someone with a name who is labeled as an expert, although the content is still the same.

So if Bing is already handing out penalties that means Bing (and Google who also ranks this page) still have some catching up to do.

Cross-Site Linking

Bing mentioned sites that are essentially one site that are broken up into multiple interlinking sites. Curiously Bing said that these kinds of sites already in violation of other link spam rules but that additional penalties will apply.

Here’s the kind of link structure that Bing used as an example:

illustration of a spammy link networkAll these sites are interlinking to each other. All the sites have related content and according to Bing are essentially the same site. This kind of linking practice goes back many years. They are traditionally known as interlinked websites. They are generally topically related to each other.

Bing used the above example to illustrate interlinked sites that are really just one site.

That link structure resembles the structure of interlinked websites that belong to the same company. If you’re planning a new web venture, it’s generally a good idea to create a site that’s comprehensive than to create a multitude of sites that are focused on just a small part of the niche.

Curiously, in reference to the above illustration, Bing said that kind of link structure was already in violation of link guidelines and that more penalties would be piled on top of those:

“Fig. 3 – All these domains are effectively the same website.
This kind of behavior is already in violation of our link policy.

Going forward, it will be also in violation of our “inorganic site structure” policy and may receive additional penalties.

Takeaway

It’s good news to hear Bing is improving. Competition between search engines encourage innovation and as Bing improves perhaps search traffic may become more diversified as more people switch to Bing as well as other engines like DuckDuckGo.

Read Bing’s announcement: Some Thoughts on Website Boundaries



Continue Reading

Trending

Copyright © 2019 Plolu.