Connect with us


How to Uncover Powerful Data Stories with Python



There are many emotional and powerful stories hidden in gobs of data just waiting to be found.

When these stories get told, they have the power to change careers, businesses, and whole groups of people.

Take Whirlpool, for example. They discovered a socio-economic problem that they could leverage with their brand.

They mined data to find a social cause to align with and discovered that every day 4,000 students drop out of school because they cannot afford to keep their clothes clean.

Whirlpool donated washers and dryers to the schools with the most at-risk children and tracked attendance.

The brand found 90% of these students had improved attendance rates and close to the same amount of children had improved class participation. The campaign was so effective that it won a number of awards, including the Cannes Lions Grand Prix for Creative Data Collection and Research.

While big brands can afford to hire award-winning creative agencies that can produce campaigns like this one, for most small businesses, that is out of the question.

One way to get into the spotlight is to find powerful stories that are yet to be discovered because of the gap that exists between marketers and data scientists.

I introduced a simple framework to do this which is around reframing already popular visualizations. The opportunity to reframe exists because marketers and developers operate in silos.

How to Uncover Powerful Data Stories with Python

As a marketer, when you handoff a data project to a developer, the first thing they do is remove the context.

The developer’s job is to generalize. But, when you get their results back, you need to add the context back so you can personalize.

Without the user context, the developer is unable to ask the right questions that can lead to making strong emotional connections.

In this article, I’m going to walk you over one example to show you how you can come up with powerful visualization and data stories by piggybacking on popular ones.

Here is our plan of action.

  • We are going to rebuild a popular data visualization from the subreddit Data is Beautiful.
  • We will collect data from public web pages (including some of it from moving charts).
  • We will reframe the visualization by asking different questions than the original author.

Our Reframed Visualization

How to Uncover Powerful Data Stories with Python

This is what our reframed visualization looks like. It shows the best Disney rides ranked by how much fun they would be for different age groups.

How to Uncover Powerful Data Stories with Python

This is the original one shared on Reddit. It shows the best Disney rides compared by how long they last and how long you need to wait in line.

Our Rebuilt Visualization

How to Uncover Powerful Data Stories with Python

Our first step is to rebuild the original visualization shared in the subreddit. The data scientist shared the data sources he used, but not the code.

This gives us a great opportunity to learn how to scrape data and visualize it in Python.

I will share some code snippets as usual, but you can find all the code in this Google Colab notebook.

Extracting Our Source Data

The original visualization contains two datasets, one with the duration of the rides and another with their average wait time.

Let’s first collect the ride durations from this page

We are going to complete these steps to extract the ride durations:

  1. Use Google Chrome to get an HTML DOM element selector with the ride durations.
  2. Use requests-html to extract the elements from the source page.
  3. Use a simple regular expression for duration numbers.

How to Uncover Powerful Data Stories with Python

Next, we need to collect the average wait times from this page

How to Uncover Powerful Data Stories with Python

This is a more challenging extraction because the data we want is in the moving charts.

We are going to complete these steps to extract the average wait times:

  1. Use requests-html to extract the JavaScript snippets from the source page.
  2. Use regular expressions to extract the data rows from the JavaScript code and also the ride name/title of the chart.
  3. Use a Jinja2 template to stich together a custom JavaScript function that returns the values we extracted in step 2.
  4. Use Py_mini_racer to execute the custom JavaScript function and get the data in Python format.

In order to convert the JavaScript data embedded in the charts to Python, we are going to perform a clever trick.

We are going to stitch together JavaScript functions using fragments of the code we are scraping.

We will use delimiters to define which fragments we will extract and use a Jinja2 template to work them together in a JavaScript function that runs correctly. The function will return a dictionary with the duration of our rides.

We will execute such functions using an obscure library called Py_mini_racer. That library runs JavaScript code from Python, returning Python objects that we can use.

I tried to use the PyV8 engine from Google, but couldn’t get it to work. It seems the project has been abandoned.

Now, we have the two datasets we need to produce our chart, but there is some processing we need to do first.

Processing Our Source Data

We need to combine the datasets we scraped, clean them up, calculate average, etc.

We are going to complete these steps:

  1. Split the extracted dataset into two Python dictionaries. One with the timestamps and one with the wait times per ride.
  2. Filter rides with fewer than 64 data points to keep the same number of data rows per ride.
  3. Calculate the average number of wait time per ride.
  4. Combine average wait time per ride and ride duration into one data frame.
  5. Eliminate rows with empty columns.

Here is what the final data frame looks like.

How to Uncover Powerful Data Stories with Python

Visualizing Our Data

We are almost in the finish line. In this step, we get to do the fun part! Visualizing the data frame we created.

We are going to complete these steps:

  1. Convert pandas data frame to a row-oriented dictionary. The X-axis is the Average Wait Time and the Y-axis is Ride Duration. The label is the Ride name.
  2. Use Plotly to generate a labeled scatter plot.

You need to manually drag the labels around to make them more legible.

How to Uncover Powerful Data Stories with Python

We finally have a visualization that closely resembles the original one we found on Reddit.

In our final step, we will produce an original visualization built from the same data we collected for this one.

Reframing Our Data

Rebuilding the original visualization took serious work and we are not producing anything new. We will address that in this final section.

The original visualization lacked an emotional hook. What if the rides are not fun for me?

We will pull an additional dataset: the ratings per ride by different age groups. This will help us visualize not sure the best rides that will have less wait time, but also which ones would be more fun for a particular age group.

We are going to complete these steps to reframe the original visualization:

  1. We want to know which age groups will have the most fun per ride.
  2. We will fetch the average ride ratings per age group from
  3. We will calculate an “Enjoyment Score” per ride and age group, which is the number of minutes per ride divided by average minutes of wait time.
  4. We will use Plotly to display a bar chart with the results.

How to Uncover Powerful Data Stories with Python

This is the page with our extra data.

We scrape it just like we pulled the ride durations.

Let’s summarize the original data frame using a new metric: an Enjoyment Score. 🙂

We define it as the average duration by wait time. The bigger the number, the more fun we should have as we have to wait less in line.

This is what the updated data frame looks like with our new Enjoyment Score metric.

How to Uncover Powerful Data Stories with Python

Now, let’s visualize it.

Finally, we get this beautiful and super valuable visualization.

How to Uncover Powerful Data Stories with Python

Resources & Community Projects

Last January, I received an email that kickstarted my “Python crusade”. Braintree had rejected RankSense’s application for a merchant account because they saw SEO as a high-risk category.

Right next to fortune tellers, mail-order brides and “get rich quick” schemes!

We had worked on the integration for three weeks. I felt really mad and embarrassed.

I had been enjoying my time in the data science and AI community last year. I was learning a lot of cool stuff and having fun.

I’ve been in the SEO space for probably too long. Sadly, my generation made the big mistake of letting speculation and magic tricks rule the perception of what SEO is about.

As a result of this, too many businesses have fallen prey to charlatans.

I had the choice to leave the SEO community or try to encourage the new generation to drive change so our community could be a fun and proud place to be.

I decided to stay, but I was afraid that trying to drive change by myself with minimal social presence would be impossible.

Fortunately, I watched this powerful video, wrote this sort of manifesto, and put my head down to write practical Python articles every month.

I’m excited to see that in less than six months, Python is everywhere in the SEO community and the momentum keeps growing.

I’m really excited about our community and the brilliant future ahead.

Now, let me continue to bring light to the awesome projects we continue to churn out each month. So, exciting to see more people joining the Python bandwagon. 🐍 🔥

Tyler shared a project to auto-generate meta descriptions using a Text Rank summarizer.

Hugo shared his first script that automates exporting SEMrush reports.

Jeffrey is working on an AI tool to break the writer’s block and open-sourced his Python backend.

Charly is working on a URL translator and classifier.

More Resources:

Image Credits

All screenshots taken by author, October 2019
In-post images: Provided by author

Continue Reading
Click to comment

You must be logged in to post a comment Login

Leave a Reply


Restaurant app Tobiko goes old school by shunning user reviews



You can think of Tobiko as a kind of anti-Yelp. Launched in 2018 by Rich Skrenta, the restaurant app relies on data and expert reviews (rather than user reviews) to deliver a kind of curated, foodie-insider experience.

A new Rich Skrenta project. Skrenta is a search veteran with several startups behind him. He was one of the founders of DMOZ, a pioneering web directory that was widely used. Most recently Skrenta was the CEO of human-aided search engine Blekko, whose technology was sold to IBM Watson in roughly 2015.

At the highest level, both DMOZ and Blekko sought to combine human editors and search technology. Tobiko is similar; it uses machine learning, crawling and third-party editorial content to offer restaurant recommendations.

Tobiko screenshots

Betting on expert opinion. Tobiko is also seeking to build a community, and user input will likely factor into recommendations at some point. However, what’s interesting is that Skrenta has shunned user reviews in favor of “trusted expert reviews” (read: critics).

Those expert reviews are represented by a range of publisher logos on profile pages that, when clicked, take the user to reviews or articles about the particular restaurant on those sites. Where available, users can also book reservations. And the app can be personalized by engaging a menu of preferences. (Yelp recently launched broad, site-wide personalization itself.)

While Skrenta is taking something of a philosophical stand in avoiding user reviews, his approach also made the app easier to launch because expert content on third-party sites already existed. Community content takes much longer to reach critical mass. However, Tobiko also could have presented or “summarized” user reviews from third-party sites as Google does in knowledge panels, with TripAdvisor or Facebook for example.

Tobiko is free and currently appears to have no ads. The company also offers a subscription-based option that has additional features.

Why we should care. It’s too early to tell whether Tobiko will succeed, but it provocatively bucks conventional wisdom about the importance of user reviews in the restaurant vertical (although reading lots of expert reviews can be burdensome). As they have gained importance, reviews have become somewhat less reliable, with review fraud on the rise. Last month, Google disclosed an algorithm change that has resulted in a sharp decrease in rich review results showing in Search.

Putting aside gamesmanship and fraud, reviews have brought transparency to online shopping but can also make purchase decisions more time-consuming. It would be inaccurate to say there’s widespread “review fatigue,” but there’s anecdotal evidence supporting the simplicity of expert reviews in some cases. Influencer marketing can be seen as an interesting hybrid between user and expert reviews, though it’s also susceptible to manipulation.

About The Author

Greg Sterling is a Contributing Editor at Search Engine Land. He writes about the connections between digital and offline commerce. He previously held leadership roles at LSA, The Kelsey Group and TechTV. Follow him Twitter or find him on LinkedIn.

Continue Reading


3 Ways to Use XPaths with Large Site Audits



When used creatively, XPaths can help improve the efficiency of auditing large websites. Consider this another tool in your SEO toolbelt.

There are endless types of information you can unlock with XPaths, which can be used in any category of online business.

Some popular ways to audit large sites with XPaths include:

In this guide, we’ll cover exactly how to perform these audits in detail.

What Are XPaths?

Simply put, XPath is a syntax that uses path expressions to navigate XML documents and identify specified elements.

This is used to find the exact location of any element on a page using the HTML DOM structure.

We can use XPaths to help extract bits of information such as H1 page titles, product descriptions on ecommerce sites, or really anything that’s available on a page.

While this may sound complex to many people, in practice, it’s actually quite easy!

How to Use XPaths in Screaming Frog

In this guide, we’ll be using Screaming Frog to scrape webpages.

Screaming Frog offers custom extraction methods, such as CSS selectors and XPaths.

It’s entirely possible to use other means to scrape webpages, such as Python. However, the Screaming Frog method requires far less coding knowledge.

(Note: I’m not in any way currently affiliated with Screaming Frog, but I highly recommend their software for web scraping.)

Step 1: Identify Your Data Point

Figure out what data point you want to extract.

For example, let’s pretend Search Engine Journal didn’t have author pages and you wanted to extract the author name for each article.

What you’ll do is:

  • Right-click on the author name.
  • Select Inspect.
  • In the dev tools elements panel, you will see your element already highlighted.
  • Right-click the highlighted HTML element and go to Copy and select Copy XPath.

2 copy xpath

At this point, your computer’s clipboard will have the desired XPath copied.

Step 2: Set up Custom Extraction

In this step, you will need to open Screaming Frog and set up the website you want to crawl. In this instance, I would enter the full Search Engine Journal URL.

  • Go to Configuration > Custom > Extraction

3 setup xpath extraction

  • This will bring up the Custom Extraction configuration window. There are a lot of options here, but if you’re looking to simply extract text, match your configuration to the screenshot below.

4 configure xpath extraction

Step 3: Run Crawl & Export

At this point, you should be all set to run your crawl. You’ll notice that your custom extraction is the second to last column on the right.

When analyzing crawls in bulk, it makes sense to export your crawl into an Excel format. This will allow you to apply a variety of filters, pivot tables, charts, and anything your heart desires.

3 Creative Ways XPaths Help Scale Your Audits

Now that we know how to run an XPath crawl, the possibilities are endless!

We have access to all of the answers, now we just need to find the right questions.

  • What are some aspects of your audit that could be automated?
  • Are there common elements in your content silos that can be extracted for auditing?
  • What are the most important elements on your pages?

The exact problems you’re trying to solve may vary by industry or site type. Below are some unique situations where XPaths can make your SEO life easier.

1. Using XPaths with Redirect Maps

Recently, I had to redesign a site that required a new URL structure. The former pages all had parameters as the URL slug instead of the page name.

This made creating a redirect map for hundreds of pages a complete nightmare!

So I thought to myself, “How can I easily identify each page at scale?”

After analyzing the various page templates, I came to the conclusion that the actual title of the page looked like an H1 but was actually just large paragraph text. This meant that I couldn’t just get the standard H1 data from Screaming Frog.

However, XPaths would allow me to copy the exact location for each page title and extract it in my web scraping report.

In this case I was able to extract the page title for all of the old URLs and match them with the new URLs through the VLOOKUP function in Excel. This automated most of the redirect map work for me.

With any automated work, you may have to perform some spot checking for accuracy.

2. Auditing Ecommerce Sites with XPaths

Auditing Ecommerce sites can be one of the more challenging types of SEO auditing. There are many more factors to consider, such as JavaScript rendering and other dynamic elements.

Sometimes, stakeholders will need product level audits on an ad hoc basis. Sometimes this covers just categories of products, but sometimes it may be the entire site.

Using the XPath extraction method we learned earlier in this article, we can extract all types of data including:

  • Product name
  • Product description
  • Price
  • Review data
  • Image URLs
  • Product Category
  • And much more

This can help identify products that may be lacking valuable information within your ecommerce site.

The cool thing about Screaming Frog is that you can extract multiple data points to stretch your audits even further.

3. Auditing Blogs with XPaths

This is a more common method for using XPaths. Screaming Frog allows you to set parameters to crawl specific subfolders of sites, such as blogs.

However, using XPaths, we can go beyond simple meta data and grab valuable insights to help identify content gap opportunities.

Categories & Tags

One of the most common ways SEO professionals use XPaths for blog auditing is scraping categories and tags.

This is important because it helps us group related blogs together, which can help us identify content cannibalization and gaps.

This is typically the first step in any blog audit.


This step is a bit more Excel-focused and advanced. How this works, is you set up an XPath extraction to pull the body copy out of each blog.

Fair warning, this may drastically increase your crawl time.

Whenever you export this crawl into Excel, you will get all of the body text in one cell. I highly recommend that you disable text wrapping, or your spreadsheet will look terrifying.

Next, in the column to the right of your extracted body copy, enter the following formula:


In this formula, A1 equals the cell of the body copy.

To scale your efforts, you can have your “keyword” equal the cell that contains your category or tag. However, you may consider adding multiple columns of keywords to get a more accurate and robust picture of your blogging performance.

This formula will present a TRUE/FALSE Boolean value. You can use this to quickly identify keyword opportunities and cannibalization in your blogs.


We’ve already covered this example, but it’s worth noting that this is still an important element to pull from your articles.

When you blend your blog export data with performance data from Google Analytics and Search Console, you can start to determine which authors generate the best performance.

To do this, sort your blogs by author and start tracking average data sets including:

  • Impressions – Search Console
  • Clicks – Search Console
  • Sessions – Analytics
  • Bounce Rate – Analytics
  • Conversions – Analytics
  • Assisted Conversions – Analytics

Share Your Creative XPath Tips

Do you have some creative auditing methods that involve XPaths? Share this article on Twitter or tag me @seocounseling and let me know what I missed!

More Resources:

Image Credits

All screenshots taken by author, October 2019

Continue Reading


When parsing ‘Googlespeak’ is a distraction



Over the almost 16-years of covering search, specifically what Googlers have said in terms of SEO and ranking topics, I have seen my share of contradictory statements. Google’s ranking algorithms are complex, and the way one Googler explains something might sound contradictory to how another Googler talks about it. In reality, they are typically talking about different things or nuances.

Some of it is semantics, some of it is being literal in how one person might explain something while another person speaks figuratively. Some of it is being technically correct versus trying to dumb something down for general practitioners or even non-search marketers to understand. Some of it is that the algorithm can change over the years, so what was true then has evolved.

Does it matter if something is or is not a ranking factor? It can be easy to get wrapped up in details that end up being distractions. Ultimately, SEOs, webmasters, site owners, publishers and those that produce web pages need to care more about providing the best possible web site and web page for the topic. You do not want to chase algorithms and racing after what is or is not a ranking factor. Google’s stated aim is to rank the most relevant results to keep users happy and coming back to the search engine. How Google does that changes over time. It releases core updates, smaller algorithm updates, index updates and more all the time.

For SEOs, the goal is to make sure your pages offer the most authoritative and relevant content for the given query and can be accessed by search crawlers.

When it is and is not a ranking factor. An example of Googlers seeming to contradict themselves popped this week.

Gary Illyes from Google said at Pubcon Thursday that content accuracy is a ranking factor. That raised eyebrows because in past Google has seemed to say content accuracy is not a ranking factor. Last month Google’s Danny Sullivan said, “Machines can’t tell the ‘accuracy’ of content. Our systems rely instead on signals we find align with relevancy of topic and authority.” One could interpret that to mean that if Google cannot tell the accuracy of content, that it would be unable to use accuracy as a ranking factor.

Upon closer look at the context of Illyes comments this week, it’s clear he’s getting at the second part of Sullivan’s comment about using signals to understand “relevancy of topic and authority.” SEO Marie Haynes captured more of the context of Illyes’ comment.

Illyes was talking about YMYL (your money, your life) content. He added that Google goes through “great lengths to surface reputable and trustworthy sources.”

He didn’t outright say Google’s systems are able to tell if a piece of content is factually accurate or not. He implied Google uses multiple signals, like signals that determine reputations and trustworthiness, as a way to infer accuracy.

So is content accuracy a ranking factor? Yes and no. It depends if you are being technical, literal, figurative or explanatory. When I covered the different messaging around content accuracy on my personal site, Sullivan pointed out the difference, he said on Twitter “We don’t know if content is accurate” but “we do look for signals we believe align with that.”

It’s the same with whether there is an E-A-T score. Illyes said there is no E-A-T score. That is correct, technically. But Google has numerous algorithms and ranking signals it uses to figure out E-A-T as an overall theme. Sullivan said on Twitter, “Is E-A-T a ranking factor? Not if you mean there’s some technical thing like with speed that we can measure directly. We do use a variety of signals as a proxy to tell if content seems to match E-A-T as humans would assess it. In that regard, yeah, it’s a ranking factor.”

You can see the dual point Sullivan is making here.

The minutiae. When you have people like me, who for almost 16 years, analyze and scrutinize every word, tweet, blog post or video that Google produces, it can be hard for a Google representative to always convey the exact clear message at every point. Sometimes it is important to step back, look at the bigger picture, and ask yourself, Why is this Googler saying this or not saying that?

Why we should care. It is important to look at long term goals, and as I said above, not chase the algorithm or specific ranking factors but focus on the ultimate goals of your business (money). Produce content and web pages that Google would be proud to rank at the top of the results for a given query and other sites will want to source and link to. And above all, do whatever you can to make the best possible site for users — beyond what your competitors produce.

About The Author

Barry Schwartz is Search Engine Land’s News Editor and owns RustyBrick, a NY based web consulting firm. He also runs Search Engine Roundtable, a popular search blog on SEM topics.

Continue Reading


Copyright © 2019 Plolu.