For several years, many SEOs have been proclaiming the end of small business (SMB) websites. The theory is that third party destinations (GMB, Facebook, Yelp, etc.) have taken over and SMB sites will rarely see consumer visits if at all. GMB now is so widely used and so complete, the argument goes, that consumers never need to visit the underlying SMB site.
Recent investments and M&A. That description of consumer behavior is partly correct but not entirely. Websites continue to be a critical SMB asset and content anchor. That fact is underscored by WordPress parent Automattic’s most recent funding round of $300 million (at a $3+ billion valuation) and Square’s April 2018 roughly $365 million acquisition of site builder Weebly.
On a smaller scale, ten-year old web design platform Duda recently raised $25 million (for just under $50 million in total funding). Duda has a network of more than 6,000 third party resellers and agencies that work with SMBs. It will continue to focus on websites and presence management rather than expand horizontally into other marketing channels.
New Yahoo web design service. In addition, late last week Verizon-owned Yahoo launched a new web design product for SMBs. There are two service tiers ($99 and $299 per month). The offering includes design consultation, ongoing maintenance and content updates (it’s a SaaS product).
Yahoo Small Business was at one time the premier hosting company for SMBs. During a long period of somnambulance, it was surpassed by GoDaddy and others. But following Verizon’s $4+ billion acquisition of Yahoo in 2016, the company has sought to invest and develop new small business products and services and regain momentum. Its brand has remained relatively strong among SMBs across the U.S. despite the decline of Yahoo itself.
Now, Yahoo is developing a new generation of marketing products and services for SMBs. The web design service is just the first announcement.
SMB sites more trusted, still visited. A May 2019 consumer survey from BrightLocal found nearly twice as many respondents (56%) expected SMB websites to be accurate compared with Google My Business (32%). This was a surprise. However, a 2018 survey from the SEO firm found that the most common consumer action by a fairly significant margin, after reading a positive review, was to visit the SMB’s website.
Why we should care. The Small Business Administration says (.pdf) there are now roughly 30 million SMBs in the U.S. The SBA defines “small business” as having a headcount of up to 499 employees. There’s a massive difference between a firm with three or even 20 employees and one that has 300. Regardless, well over 90% of U.S. SMBs have fewer than 10 employees.
While a majority of SMBs in theory, now have websites — 64% according to a 2018 Clutch survey — there’s still a significant opportunity for providers of websites. New businesses form and fail every quarter. And even with shrinking reach in organic search and social, websites are likely remainto the anchor of SMB digital marketing into the foreseeable future.
About The Author
Greg Sterling is a Contributing Editor at Search Engine Land. He writes about the connections between digital and offline commerce. He previously held leadership roles at LSA, The Kelsey Group and TechTV. Follow him Twitter or find him on LinkedIn.
The web is not only essential for people working in digital marketing, but for everyone. We professionals in this field need to understand the big picture of how the web functions for our daily work. We also know that optimizing our customers’ sites is not just about their sites, but also improving their presence on the web, which it is connected to other sites by links.
To get an overall view of information about the web we need data, lots of data. And we need it on a regular basis. There are some organizations that provide open data for this purpose like Httparchive. It collects and permanently stores the web’s digitized content and offers them as public dataset. A second example is Common Crawl, an organization that crawls the web every month. Their web archive has been collecting petabytes of data since 2011. In their own words, “Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.”
In this article, a quick data analysis of Common Crawl’s recent public data and metrics will be presented to offer a glimpse into what’s happening on the web today.
This data analysis was performed on almost two billion edges of nearly 90 million hosts. For the purposes of this article, the term “edge” will be used as a reference to a link. An edge from one host (domain) to another is counted only once if there is at least one link from one host to the other host. Also to note that the PageRank of hosts is dependent on the number of links received from other hosts but not on the number given to others.
There is also a dependency between the number of links given to hosts and the number of subdomains of a host. This is not a great surprise given that of the nearly 90 million hosts, the one receiving links from the maximum number of hosts is “googleapis.com,” while the host sending links to the maximum number of hosts is “blogspot.com.” And the host having the maximum number of hosts (subdomains) is “wordpress.com.”
The public Common Crawl data include crawls from May, June and July 2019.
The main data analysis is performed on three following compressed Common Crawl files.
These two datasets are used for the additional data analysis concerning the top 50 U.S. sites.
The Common Crawl data provided in three compressed files belongs to their recent domain-level graph. First, in the “domain vertices” file, there are 90 million nodes (naked domains). In the “domain edges” file, there are their two billion edges (links). Lastly, the file “domain ranks” contains the rankings of naked domains by their PageRank and harmonic centrality.
Harmonic centrality is a centrality measure like PageRank used to discover the importance of the nodes in a graph. Since 2017, Common Crawl has been using harmonic centrality in their crawling strategy for prioritization by link analysis. Additionally in the “domain ranks” dataset, the domains are sorted according to their harmonic centrality values, not to their PageRank values. Although harmonic centrality doesn’t correlate with PageRank on the final dataset, it correlates with PageRank in the top 50 U.S. sites data analysis. There is a compelling video “A Modern View of Centrality Measures” where Paolo Boldi presents a comparison of PageRank and harmonic centrality measurements on the Hollywood graph. He states that harmonic centrality selects top nodes better than PageRank.
[All Common Crawl data used in this article is from May, June and July 2019.]
Preview of Common Crawl “domain vertices” dataset
Preview of common crawl “domain edges” dataset
Preview of Common Crawl “domain ranks” dataset sorted by harmonic centrality
The preview of the final dataset obtained by three main Common Crawl datasets; “domain vertices,” “domain edges” and “domain ranks” sorted by PageRank
host_rev: Reversed host name, for example ‘google.com’ becomes ‘com.google’
n_in_hosts: Number of other hosts which the host receives at least one link from
n_out_hosts: Number of other hosts which the host sends at least one link to
harmonicc_pos: Harmonic centrality position of the host
harmonicc_val: Harmonic centrality value of the host
pr_pos: PageRank position of the host
pr_val: PageRank value of the host
n_hosts: Number of hosts (subdomains) belonging to the host
Statistics of Common Crawl final dataset
*link : Counted as a link if there is at least one link from one host to other
Number of incoming hosts:
Mean, min, max of n_in_hosts = 21.63548751, 0, 20081619
*The reversed host receiving links* from maximum number of hosts is ‘com.googleapis’.
Number of outgoing hosts:
Mean, min, max of n_out_hosts = 21.63548751, 0, 7813499
*The reversed host sending links* to maximum number of hosts is ‘com.blogspot’
mean, min, max of pr_val = 1.13303402e-08, 0., 0.02084144
mean, min, max of harmonicc_val = 10034682.46655859, 0., 29977668.
Number of hosts (subdomains)
mean, min, max of n_hosts = 5.04617139, 1, 7034608
*The reversed host having maximum number of hosts (subdomains) is ‘com.wordpress’’
The correlation results show that the number of incoming hosts (n_in_hosts) is correlated with PageRank value (pr_val) and number of outgoing hosts (n_out_hosts), while the former is very strong, the latter is weak. There is also a dependency between the number of outgoing hosts and number of hosts (n_hosts), subdomains of a host.
Data visualization: Distribution of PageRank
The graph below presents the plot of the count of pr_val values. It shows us that the distribution of PageRank on almost 90 million hosts is highly right skewed meaning the majority of the hosts have very low PageRank.
Distribution of the number of hosts
The following graph presents the plot of the count of n_hosts (subdomains) values. It shows us that the distribution of number of hosts (subdomains) of almost 90 million hosts is highly right-skewed meaning the majority of the hosts have a low number of subdomains.
Distribution of the number of incoming hosts
The graph below presents the plot of the count of n_in_hosts (number of incoming hosts) values. It shows us that this distribution is right-skewed, too.
Distribution of number of outgoing hosts
The following graph shows the plot of the count of n_out_hosts (number of outgoing hosts) values. Again, this distribution is also right-skewed.
Distribution of harmonic centrality
The following graph presents the plot of the count of harmonicc_val column values. It shows that the distribution of harmonicc_val on almost 90 million hosts is not highly right-skewed like PageRank or number of hosts distributions. It is not a perfect gaussian distribution but more gaussian than the distributions of PageRank and number of hosts. This distribution is multimodal.
Scatter plot of number of incoming hosts vs number of outgoing hosts
The graph below presents the scatter plot of the n_in_hosts in x-axis and the n_out_hosts in y-axis. It is showing that the number of outgoing and incoming hosts are not overall directly dependent on each other. In other words, when the number of links which a host receives from other hosts increase, its outgoing links to other hosts do not increase. When hosts do not have a significant number of incoming hosts, they easily give links to other hosts. However the hosts having an important number of incoming hosts are not that generous.
Scatter plot of number of incoming hosts vs. PageRank
The graph below presents the scatter plot of the n_in_hosts values in x-axis and the pr_val values of hosts in y-axis. It shows us that there is a correlation between the number of incoming hosts to a host and its PageRank. In other words, the more hosts link to a host, the greater its PageRank value is.
Scatter plot of number of outgoing hosts vs. PageRank
The graph below presents the scatter plot of the n_out_hosts in x-axis and the pr_val value of hosts in y-axis. It shows us that the correlation between the number of incoming hosts and PageRank do not exist between the number of outgoing hosts and the PageRank.
Scatter plot of PageRank and harmonic centrality
As the majority of hosts have low PageRank, we see a vertical line when we scatter plot the PageRank and harmonic centrality values of hosts. But, we observe the detachment of the hosts’ PageRank values from the masses begins when their harmonic centrality value is closer to 1.5e7 and accelerates when it is greater than.
Top 50 US sites
Top 50 U.S. sites data are selected from the final Common Crawl dataset obtained in the beginning. Their hosts are reversed in order to match with the column “host_rev” in the Common Crawl final data set. For example, “youtube.com” becomes “com.youtube.” Below is a preview from this selection. There are 49 sites instead of 50 because “finance.yahoo.com” doesn’t exist in Common Crawl dataset but “com.yahoo” does.
The Majestic Million public dataset is also imported. The preview of this file is below
These two data sets; top U.S. 50 sites including Common Crawl data and metrics and the data set of Majestic Million are merged. The refips, refsubnets are summed up by reversed hosts.
The preview of this final dataset is below
Statistics of top 50 US sites final dataset
Number of incoming hosts:
mean, min, max of n_in_hosts = 1565724.63265306, 1015, 16537551
Number of outgoing hosts:
mean, min, max of n_out_hosts = 80812.70833333, 28., 2529655
mean, min, max of pr_val = 0.00105891, 9.73490741e-07, 0.01285745
mean, min, max of harmonicc_val = 18871331.16326531, 14605537., 27867704
Number of hosts (subdomains)
mean, min, max of n_hosts = 36426.79591837, 22, 1555402
From this dataset, which have the top 50 U.S. sites Common Crawl data and Majestic Million data, a pairwise scatterplot of metrics – pr_val, n_in_hosts, n_out_hosts, harmonicc_val, refips_sum, refsubnets_sum – are created can be seen below.
This pairwise scatter plot shows us that PageRank of the U.S. 50 top sites is somewhat correlated with all the metrics used in this graph except number of outgoing hosts, represented with legend n_out_hosts.
Below the correlation heatmap of these metrics is also available
The data analysis of the top 50 U.S. sites shows a dependency between the number of incoming hosts and referring IP addresses (refips) and the subdivision of an IP network that points to the target domain (refsubnets) metrics. Harmonic centrality is correlated between PageRank, number of incoming hosts, refIPs and refsubnets of the hosts.
Of the almost 90 million hosts ranks and their two billion edges (edges are links only counted once even if there are many from a single host), there is a strong correlation between PageRank and the number of incoming edges to each host. However, we can’t say the same for the number of outgoing edges from hosts.
In this data analysis, we find a correlation between the number of subdomains and the number of outgoing edges from one host to other hosts. The distribution of PageRank on this web graph is highly right-skewed meaning the majority of the hosts have very low PageRank.
Ultimately, the main data analysis tells us that the majority of domains on the web have low PageRank, a low number of incoming and outgoing edges and a low number of host subdomains. We know this because all of these features have the same highly right-skewed type of data distribution.
PageRank is still a popular and well-known centrality measure. One of the reasons for its success is its performance with similar types of data distribution comparable to the distribution of edges on domains.
Common Crawl is an invaluable and neglected public data source for SEO. The tremendous data are technically not easy to access even though they are public. However, it provides a once per three months “domain ranks” file that can be relatively easy to analyze compared to raw monthly crawl data. Due to a lack of resources, we can not crawl the web and calculate the centrality measures ourselves, but we can take advantage of this extremely useful resource to analyze our customers’ websites and their competitors rankings with their connections on the web.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.
About The Author
Aysun Akarsu is a trilingual data scientist specialized in machine intelligence for digital marketing wanting to help companies in making data driven decisions for reaching a broader, qualified audience. Aysun writes regulary about SEO data analysis on her blog,SearchDatalogy
On October 3, Webmaster Trends Analyst John Mueller delivered an edition of #AskGoogleWebmasters describing how Google approaches H1 headings with regards to ranking. The explanation caused a bit of a stir.
What Google said
“Our systems aren’t too picky and we’ll try to work with the HTML as we find it — be it one H1 heading, multiple H1 headings or just styled pieces of text without semantic HTML at all,” Mueller said.
In other words, Mueller is saying Google’s systems don’t have to rely on specific headings structure to indicate the main focus of content on the page.
What’s the fuss?
Mueller’s answer would appear to counter a longstanding “best practice” to use and optimize a single H1 and subsequent headings on a page. This is even reflected in the weighting of +2 that headings were given in our own most recent Periodic Table of SEO Factors.
“This seems to directly contradict years of SEO advice I’ve been given by all the SEO experts,” Dr. John Grohol, founder of PsychCentral.com, tweeted, expressing a reaction shared by many. Others cited their own experiences of seeing how H1 implementations can affect organic visibility.
How headings are designed to be used
The hierarchy of headings communicates what the content on a page is about as well as how ideas are grouped, making it easy for users to navigate the page. Applying multiple H1s or skipping headings altogether can create a muddled page structure and make a page harder to read.
Accessibility is also a significant reason to use headings. A point made even more salient now that the courts have ruled that websites fall under the Americans with Disabilities Act.
“Heading markup will allow assistive technologies to present the heading status of text to a user,” the World Wide Web Consortium’s Web Content Accessibility Guidelines (WCAG) explains. “A screen reader can recognize the code and announce the text as a heading with its level, beep or provide some other auditory indicator. Screen readers are also able to navigate heading markup which can be an effective way for screen reader users to more quickly find the content of interest. Assistive technologies that alter the authored visual display will also be able to provide an appropriate alternate visual display for headings that can be identified by heading markup.”
Joost de Valk, founder of Yoast SEO WordPress plugin, noted that most WordPress themes are designed to have a single H1 heading just for post titles — “not for SEO (although that won’t hurt) but for decent accessibility.”
SEO consultant Alan Bleiweiss pointed to a WebAIM survey that found 69% of screen readers use headings to navigate through a page and 52% find heading levels very useful.
Many SEOs are concerned that Google’s lack of emphasis on accessibility standards, including rel=prev/next, may disincentive site owners to implement them, potentially making content harder to understand for users who depend on screen-reading technology, such as the visually impaired. Do that at your own risk.
H1s and SEO
“It is naive to think that Google completely ignores the H1 tag,” Hamlet Batista, CEO and founder of RankSense, told Search Engine Land.
“I’ve seen H1s used in place of title tags in the SERPs. So, it is a good idea to make the H1 the key topic of the page; in case this happens, you have a reasonably good headline,” Batista said, adding that having multiple H1s may provide less control of what text could appear in the search results if the H1 is used instead of the title.
Others said headings hiccups have hurt rankings.
In the comment above, which was left on Search Engine Roundtable’s coverage of the announcement, the commenter attributes the performance decline to an error that resulted in removal of H1s from his content.
You should still use proper headings
All John Mueller is saying is that Google can usually figure out what’s important on a page even when you’re not using headings or heading hierarchies. “It’s not a secret ranking push,” Mueller added in a follow up. “A script sees the page, you’re highlighting some things as ‘important parts,’ so often we can use that a bit more relative to the rest. If you highlight nothing/everything, we’ll try to figure it out.”
As Mueller said at the end of the #AskGoogleWebmasters video, “When thinking about this topic, SEO shouldn’t be your primary objective. Instead, think about your users: if you have ways of making your content accessible to them, be it by using multiple H1 headings or other standard HTML constructs, then that’s not going to get in the way of your SEO efforts.”
About The Author
George Nguyen is an Associate Editor at Third Door Media. His background is in content marketing, journalism, and storytelling.
Google announced that Chrome browser will begin blocking web pages with mixed content beginning December 2019. Publishers are urged to check their websites to make sure there are no resources that are being loaded using the insecure HTTP protocol.
What is Mixed Content
Mixed content is when a secure web page (loaded through HTTPS) also contains scripts, styles, images or other linked content that is served through the insecure HTTP protocol. This is called mixed content.
Mixed content presents a security risk for your site visitor as well as to your website.
According to Google’s developer page on mixed content:
“Mixed content degrades the security and user experience of your HTTPS site.
…Using these resources, an attacker can often take complete control over the page, not just the compromised resource.”
How Google Chrome Will Handle Mixed Content
Currently Google loads pages with mixed content. Beginning in December 2019 with the introduction of Chrome 79, Google will do two things:
Google will automatically upgrade http content to https if that resource exists on https.
Google will introduce a toggle that a Chrome user can use to unblock insecure resources that Chrome is blocking.
Although this isn’t a full blocking, it might as well be because users may opt to back out of a site that displays a security warning.
This will be a bad experience for publishers and may lead to less sales, visitors and ad views.
Beginning in January 2020 Google will remove the unblocking option and begin blocking mixed content web pages.
How to Check Your Site for Mixed Content
There are multiple ways to check your site for mixed content.
Online Mixed Content Scanner
JitBit SSL Checker
JitBit SSL Checker is a free online scanner that will scan up to 400 pages of your site.
Really Simple SSL
Really Simple SSL is a lightweight plugin that can handle the migration to SSL as well as check and fix mixed content.
If your WordPress site is already migrated to SSL then you can use SSL Insecure Content Fixer WordPress Plugin to scan your site and alert you to insecure resources and help you fix them.
Screaming Frog Crawl Software
Screaming Frog is a modestly priced crawl software (£149.00/year). This is a good option for crawling large sites. There’s a short learning curve for learning how to use the crawler, it’s quite intuitive. Screaming Frog will find your mixed content but it won’t fix it.