My original plan was to cover this topic: “How to Build a Bot to Automate your Mindless Tasks using Python and BigQuery”. I made some slight course changes, but hopefully, the original intention remains the same!
The inspiration for this article comes from this tweet from JR Oakes. 🙂
I think I just have the inspiration I was looking for for my next @sejournal column “How to build a bot to automate your mindless tasks using #python and @bigquery” 🤓 Thanks JR! 🍺 https://t.co/dQqULIH2p2
As Uber released an updated version of Ludwig and Google also announced the ability to execute Tensorflow models in BigQuery, I thought the timing couldn’t be better.
In this article, we will revisit the intent classification problem I addressed before, but we will replace our original encoder for a state of the art one: BERT, which stands for Bidirectional Encoder Representations from Transformers.
This small change will help us improve the model accuracy from a 0.66 combined test accuracy to 0.89 while using the same dataset and no custom coding!
Here is our plan of action:
We will rebuild the intent classification model we built on part one, but we will leverage pre-training data using a BERT encoder.
We will test it again against the questions we pulled from Google Search Console.
We will upload our queries and intent predictions data to BigQuery.
We will connect BigQuery to Google Data Studio to group the questions by their intention and extract actionable insights we can use to prioritize content development efforts.
We will go over the new underlying concepts that help BERT perform significantly better than our previous model.
Setting up Google Colaboratory
As in part one, we will run Ludwig from within Google Colaboratory in order to use their free GPU runtime.
First, run this code to check the Tensorflow version installed.
import tensorflow as tf; print(tf.__version__)
Let’s make sure our notebook uses the right version expected by Ludwig and that it also supports the GPU runtime.
I get 1.14.0 which is great as Ludwig requires at least 1.14.0
Under the Runtime menu item, select Python 3 and GPU.
You can confirm you have a GPU by typing:
At the time of this writing, you need to install some system libraries before installing the latest Ludwig (0.2). I got some errors that they later resolved.
When the installation failed for me, I found the solution from this StackOverflow answer, which wasn’t even the accepted one!
!pip install ludwig
You should get:
Successfully installed gmpy-1.17 ludwig-0.2
Prepare the Dataset for Training
We are going to use the same question classification dataset that we used in the first article.
After you log in to Kaggle and download the dataset, you can use the code to load it to a dataframe in Colab.
Configuring the BERT Encoder
Instead of using the parallel CNN encoder that we used in the first part, we will use the BERT encoder that was recently added to Ludwig.
This encoder leverages pre-trained data that enables it to perform better than our previous encoder while requiring far less training data. I will explain how it works in simple terms at the end of this article.
Let’s first download a pretrained language model. We will download the files for the model BERT-Base, Uncased.
I tried the bigger models first, but hit some roadblocks due to their memory requirements and the limitations in Google Colab.
Now we can put together the model definition file.
Let’s compare it to the one we created in part one.
I made a number of changes. Let’s review them.
I essentially changed the encoder from parallel_cnn to bert and added extra parameters required by bert: config_path, checkpoint_path, word_tokenizer, word_vocab_file, padding_symbol, and unknown_symbol.
Most of the values come from the language model we downloaded.
I added a few more parameters that I figured out empirically: batch_size, learning_rate and word_sequence_length_limit.
The default values Ludwig uses for these parameters don’t work for the BERT encoder because they are way off compared to the pre-trained data. I found some working values in the BERT documentation.
The training process is the same as we’ve done previously. However, we need to install bert-tensorflow first.
We beat our previous model performance after only two epochs.
The final improvement was 0.89 combined test accuracy after 10 epochs. Our previous model took 14 epochs to get to .66.
This is pretty remarkable considering we didn’t write any code. We only changed some settings.
It is incredible and exciting how fast deep learning research is improving and how accessible it is now.
Why BERT Performs So Well
There are two primary advantages from using BERT compared to traditional encoders:
The bidirectional word embeddings.
The language model leveraged through transfer learning.
Bidirectional Word Embeddings
When I explained word vectors and embeddings in part one, I was referring to the traditional approach (I used a GPS analogy of coordinates in an imaginary space).
Traditional word embedding approaches assign the equivalent of a GPS coordinate to each word.
Let’s review the different meanings of the word “Washington” to illustrate why this could be a problem in some scenarios.
George Washington (person)
Washington D.C. (City)
George Washington Bridge (bridge)
The word “Washington” above represents completely different things and a system that assigns the same coordinates regardless of context, won’t be very precise.
If we are in Google’s NYC office and we want to visit “Washington”, we need to provide more context.
Are we planning to visit the George Washington memorial?
Do we plan to drive south to visit Washington, D.C.?
Are we planning a cross country trip to Washington State?
As you can see in the text, the surrounding words provide some context that can more clearly define what “Washington” might mean.
If you read from left to right, the word George, might indicate you are talking about the person, and if you read from right to left, the word D.C., might indicate you are referring to the city.
But, you need to read from left to right and from right to left to tell you actually want to visit the bridge.
BERT works by encoding different word embeddings for each word usage, and relies on the surrounding words to accomplish this. It reads the context words bidirectionally (from left to right and from right to left).
Back to our GPS analogy, imagine an NYC block with two Starbucks coffee shops in the same street. If you want to get to a specific one, it would be much easier to refer to it by the businesses that are before and/or after.
Transfer learning is probably one of the most important concepts in deep learning today. It makes many applications practical even when you have very small datasets to train on.
Traditionally, transfer learning was primarily used in computer vision tasks.
You typically have research groups from big companies (Google, Facebook, Stanford, etc.) train an image classification model on a large dataset like that from Imagenet.
This process would take days and generally be very expensive. But, once the training is done, the final part of the trained model is replaced, and retrained on new data to perform similar but new tasks.
This process is called fine tuning and works extremely well. Fine tuning can take hours or minutes depending on the size of the new data and is accessible to most companies.
Let’s get back to our GPS analogy to understand this.
Say you want to travel from New York City to Washington state and someone you know is going to Michigan.
Instead of renting a car to go all the way, you could hike that ride, get to Michigan, and then rent a car to drive from Michigan to Washington state, at a much lower cost and driving time.
BERT is one of the first models to successful apply transfer learning in NLP (Natural Language Processing). There are several pre-trained models that typically take days to train, but you can fine tune in hours or even minutes if you use Google Cloud TPUs.
Automating Intent Insights with BigQuery & Data Studio
Now that we have a trained model, we can test on new questions we can grab from Google Search Console using the report I created on part one.
We can run the same code as before to generate the predictions.
This time, I also want to export them to a CSV and import into BigQuery.
After we have our predictions in BigQuery, we can connect it to Data Studio and create a super valuable report to helps us visualize which intentions have the greatest opportunity.
After I connected Data Studio to our BigQuery dataset, I created a new field: CTR by dividing impressions and clicks.
As we are grouping queries by their predicted intentions, we can find content opportunities where we have intentions with high search impressions and low number of clicks. Those are the lighter blue squares.
How the Learning Process Works
I want to cover this last foundational topic to expand the encoder/decoder idea I briefly covered in part one.
Let’s take a look at the charts below that help us visualize the training process.
But, what exactly is happening here? How it the machine learning model able to perform the tasks we are training on?
The first chart shows how the error/loss decreases which each training steps (blue line).
But, more importantly, the error also decreases when the model is tested on “unseen” data. Then, comes a point where no further improvements take place.
I like to think about this training process as removing noise/errors from the input by trial and error, until you are left with what is essential for the task at hand.
There is some random searching involved to learn what to remove and what to keep, but as the ideal output/behavior is known, the random search can be super selective and efficient.
Let’s say again that you want to drive from NYC to Washington and all the roads are covered with snow. The encoder, in this case, would play the role of a snowblower truck with the task of carving out a road for you.
It has the GPS coordinates of the destination and can use it to tell how far or close it is, but needs to figure out how to get there by intelligent trial and error. The decoder would be our car following the roads created by the snowblower for this trip.
If the snowblower moves too far south, it can tell it is going in the wrong direction because it is getting farther from the final GPS destination.
A Note on Overfitting
After the snowblower is done, it is tempting to just memorize all the turns required to get there, but that would make our trip inflexible in the case we need to take detours and have no roads carved out for that.
So, memorizing is not good and is called overfitting in deep learning terms. Ideally, the snowblower would carve out more than one way to get to our destination.
In other words, we need as generalized routes as possible.
We accomplish this by holding out data during the training process.
We use testing and validation datasets to keep our models as generic as possible.
A Note on Tensorflow for BigQuery
I tried to run our predictions directly from BigQuery, but hit a roadblock when I tried to import our trained model.
BigQuery complained about the size of the model exceeded their limit.
Waiting on bqjob_r594b9ea2b1b7fe62_0000016c34e8b072_1 ... (0s) Current status: DONE BigQuery error in query operation: Error processing job 'sturdy-now-248018:bqjob_r594b9ea2b1b7fe62_0000016c34e8b072_1': Error while reading data, error message: Total TensorFlow data size exceeds max allowed size; Total size is at least: 1319235047; Max allowed size is: 268435456
I reached out to their support and they offered some suggestions. I’m sharing them here in case someone finds the time to test them out.
Resources to Learn More
When I started taking deep learning classes, I didn’t see BERT or any of the latest state of the art neural network architectures.
However, the foundation I received, has helped me pick up new concepts and ideas fairly quickly. One of the articles that I found most useful to learn the new advances was this one: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning).
I also found this one very useful: Paper Dissected: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained and this other one from the same publication: Paper Dissected: “XLNet: Generalized Autoregressive Pretraining for Language Understanding” Explained.
BERT has recently been beaten by a new model called XLNet. I am hoping to cover it in a future article when it becomes available in Ludwig.
The Python momentum in the SEO community continues to grow. Here are some examples:
Paul Shapiro brought Python to the MozCon stage earlier this month. He shared the scripts he discussed during his talk.
I was pleasantly surprised when I shared a code snippet in Twitter and Tyler Reardon, a fellow SEO, quickly spotted a bug I missed because he created a similar code independently.
Big shoutout to @TylerReardon who spotted a bug in my code pretty quickly! It is already fixed https://t.co/gvypIOVuBp
I thought I was comparing the IP from the log and the one from the DNS, but I was comparing the log IP twice! 😅
You can think of Tobiko as a kind of anti-Yelp. Launched in 2018 by Rich Skrenta, the restaurant app relies on data and expert reviews (rather than user reviews) to deliver a kind of curated, foodie-insider experience.
A new Rich Skrenta project. Skrenta is a search veteran with several startups behind him. He was one of the founders of DMOZ, a pioneering web directory that was widely used. Most recently Skrenta was the CEO of human-aided search engine Blekko, whose technology was sold to IBM Watson in roughly 2015.
At the highest level, both DMOZ and Blekko sought to combine human editors and search technology. Tobiko is similar; it uses machine learning, crawling and third-party editorial content to offer restaurant recommendations.
Betting on expert opinion. Tobiko is also seeking to build a community, and user input will likely factor into recommendations at some point. However, what’s interesting is that Skrenta has shunned user reviews in favor of “trusted expert reviews” (read: critics).
Those expert reviews are represented by a range of publisher logos on profile pages that, when clicked, take the user to reviews or articles about the particular restaurant on those sites. Where available, users can also book reservations. And the app can be personalized by engaging a menu of preferences. (Yelp recently launched broad, site-wide personalization itself.)
While Skrenta is taking something of a philosophical stand in avoiding user reviews, his approach also made the app easier to launch because expert content on third-party sites already existed. Community content takes much longer to reach critical mass. However, Tobiko also could have presented or “summarized” user reviews from third-party sites as Google does in knowledge panels, with TripAdvisor or Facebook for example.
Tobiko is free and currently appears to have no ads. The company also offers a subscription-based option that has additional features.
Why we should care. It’s too early to tell whether Tobiko will succeed, but it provocatively bucks conventional wisdom about the importance of user reviews in the restaurant vertical (although reading lots of expert reviews can be burdensome). As they have gained importance, reviews have become somewhat less reliable, with review fraud on the rise. Last month, Google disclosed an algorithm change that has resulted in a sharp decrease in rich review results showing in Search.
Putting aside gamesmanship and fraud, reviews have brought transparency to online shopping but can also make purchase decisions more time-consuming. It would be inaccurate to say there’s widespread “review fatigue,” but there’s anecdotal evidence supporting the simplicity of expert reviews in some cases. Influencer marketing can be seen as an interesting hybrid between user and expert reviews, though it’s also susceptible to manipulation.
About The Author
Greg Sterling is a Contributing Editor at Search Engine Land. He writes about the connections between digital and offline commerce. He previously held leadership roles at LSA, The Kelsey Group and TechTV. Follow him Twitter or find him on LinkedIn.
When used creatively, XPaths can help improve the efficiency of auditing large websites. Consider this another tool in your SEO toolbelt.
There are endless types of information you can unlock with XPaths, which can be used in any category of online business.
Some popular ways to audit large sites with XPaths include:
In this guide, we’ll cover exactly how to perform these audits in detail.
What Are XPaths?
Simply put, XPath is a syntax that uses path expressions to navigate XML documents and identify specified elements.
This is used to find the exact location of any element on a page using the HTML DOM structure.
We can use XPaths to help extract bits of information such as H1 page titles, product descriptions on ecommerce sites, or really anything that’s available on a page.
While this may sound complex to many people, in practice, it’s actually quite easy!
How to Use XPaths in Screaming Frog
In this guide, we’ll be using Screaming Frog to scrape webpages.
Screaming Frog offers custom extraction methods, such as CSS selectors and XPaths.
It’s entirely possible to use other means to scrape webpages, such as Python. However, the Screaming Frog method requires far less coding knowledge.
(Note: I’m not in any way currently affiliated with Screaming Frog, but I highly recommend their software for web scraping.)
Step 1: Identify Your Data Point
Figure out what data point you want to extract.
For example, let’s pretend Search Engine Journal didn’t have author pages and you wanted to extract the author name for each article.
What you’ll do is:
Right-click on the author name.
In the dev tools elements panel, you will see your element already highlighted.
Right-click the highlighted HTML element and go to Copy and select Copy XPath.
At this point, your computer’s clipboard will have the desired XPath copied.
Step 2: Set up Custom Extraction
In this step, you will need to open Screaming Frog and set up the website you want to crawl. In this instance, I would enter the full Search Engine Journal URL.
Go to Configuration > Custom > Extraction
This will bring up the Custom Extraction configuration window. There are a lot of options here, but if you’re looking to simply extract text, match your configuration to the screenshot below.
Step 3: Run Crawl & Export
At this point, you should be all set to run your crawl. You’ll notice that your custom extraction is the second to last column on the right.
When analyzing crawls in bulk, it makes sense to export your crawl into an Excel format. This will allow you to apply a variety of filters, pivot tables, charts, and anything your heart desires.
3 Creative Ways XPaths Help Scale Your Audits
Now that we know how to run an XPath crawl, the possibilities are endless!
We have access to all of the answers, now we just need to find the right questions.
What are some aspects of your audit that could be automated?
Are there common elements in your content silos that can be extracted for auditing?
What are the most important elements on your pages?
The exact problems you’re trying to solve may vary by industry or site type. Below are some unique situations where XPaths can make your SEO life easier.
1. Using XPaths with Redirect Maps
Recently, I had to redesign a site that required a new URL structure. The former pages all had parameters as the URL slug instead of the page name.
This made creating a redirect map for hundreds of pages a complete nightmare!
So I thought to myself, “How can I easily identify each page at scale?”
After analyzing the various page templates, I came to the conclusion that the actual title of the page looked like an H1 but was actually just large paragraph text. This meant that I couldn’t just get the standard H1 data from Screaming Frog.
However, XPaths would allow me to copy the exact location for each page title and extract it in my web scraping report.
In this case I was able to extract the page title for all of the old URLs and match them with the new URLs through the VLOOKUP function in Excel. This automated most of the redirect map work for me.
With any automated work, you may have to perform some spot checking for accuracy.
2. Auditing Ecommerce Sites with XPaths
Sometimes, stakeholders will need product level audits on an ad hoc basis. Sometimes this covers just categories of products, but sometimes it may be the entire site.
Using the XPath extraction method we learned earlier in this article, we can extract all types of data including:
And much more
This can help identify products that may be lacking valuable information within your ecommerce site.
The cool thing about Screaming Frog is that you can extract multiple data points to stretch your audits even further.
3. Auditing Blogs with XPaths
This is a more common method for using XPaths. Screaming Frog allows you to set parameters to crawl specific subfolders of sites, such as blogs.
However, using XPaths, we can go beyond simple meta data and grab valuable insights to help identify content gap opportunities.
Categories & Tags
One of the most common ways SEO professionals use XPaths for blog auditing is scraping categories and tags.
This is important because it helps us group related blogs together, which can help us identify content cannibalization and gaps.
This is typically the first step in any blog audit.
This step is a bit more Excel-focused and advanced. How this works, is you set up an XPath extraction to pull the body copy out of each blog.
Fair warning, this may drastically increase your crawl time.
Whenever you export this crawl into Excel, you will get all of the body text in one cell. I highly recommend that you disable text wrapping, or your spreadsheet will look terrifying.
Next, in the column to the right of your extracted body copy, enter the following formula:
In this formula, A1 equals the cell of the body copy.
To scale your efforts, you can have your “keyword” equal the cell that contains your category or tag. However, you may consider adding multiple columns of keywords to get a more accurate and robust picture of your blogging performance.
Over the almost 16-years of covering search, specifically what Googlers have said in terms of SEO and ranking topics, I have seen my share of contradictory statements. Google’s ranking algorithms are complex, and the way one Googler explains something might sound contradictory to how another Googler talks about it. In reality, they are typically talking about different things or nuances.
Some of it is semantics, some of it is being literal in how one person might explain something while another person speaks figuratively. Some of it is being technically correct versus trying to dumb something down for general practitioners or even non-search marketers to understand. Some of it is that the algorithm can change over the years, so what was true then has evolved.
Does it matter if something is or is not a ranking factor? It can be easy to get wrapped up in details that end up being distractions. Ultimately, SEOs, webmasters, site owners, publishers and those that produce web pages need to care more about providing the best possible web site and web page for the topic. You do not want to chase algorithms and racing after what is or is not a ranking factor. Google’s stated aim is to rank the most relevant results to keep users happy and coming back to the search engine. How Google does that changes over time. It releases core updates, smaller algorithm updates, index updates and more all the time.
For SEOs, the goal is to make sure your pages offer the most authoritative and relevant content for the given query and can be accessed by search crawlers.
When it is and is not a ranking factor. An example of Googlers seeming to contradict themselves popped this week.
Gary Illyes from Google said at Pubcon Thursday that content accuracy is a ranking factor. That raised eyebrows because in past Google has seemed to say content accuracy is not a ranking factor. Last month Google’s Danny Sullivan said, “Machines can’t tell the ‘accuracy’ of content. Our systems rely instead on signals we find align with relevancy of topic and authority.” One could interpret that to mean that if Google cannot tell the accuracy of content, that it would be unable to use accuracy as a ranking factor.
Upon closer look at the context of Illyes comments this week, it’s clear he’s getting at the second part of Sullivan’s comment about using signals to understand “relevancy of topic and authority.” SEO Marie Haynes captured more of the context of Illyes’ comment.
Illyes was talking about YMYL (your money, your life) content. He added that Google goes through “great lengths to surface reputable and trustworthy sources.”
He didn’t outright say Google’s systems are able to tell if a piece of content is factually accurate or not. He implied Google uses multiple signals, like signals that determine reputations and trustworthiness, as a way to infer accuracy.
So is content accuracy a ranking factor? Yes and no. It depends if you are being technical, literal, figurative or explanatory. When I covered the different messaging around content accuracy on my personal site, Sullivan pointed out the difference, he said on Twitter “We don’t know if content is accurate” but “we do look for signals we believe align with that.”
It’s the same with whether there is an E-A-T score. Illyes said there is no E-A-T score. That is correct, technically. But Google has numerous algorithms and ranking signals it uses to figure out E-A-T as an overall theme. Sullivan said on Twitter, “Is E-A-T a ranking factor? Not if you mean there’s some technical thing like with speed that we can measure directly. We do use a variety of signals as a proxy to tell if content seems to match E-A-T as humans would assess it. In that regard, yeah, it’s a ranking factor.”
You can see the dual point Sullivan is making here.
The minutiae. When you have people like me, who for almost 16 years, analyze and scrutinize every word, tweet, blog post or video that Google produces, it can be hard for a Google representative to always convey the exact clear message at every point. Sometimes it is important to step back, look at the bigger picture, and ask yourself, Why is this Googler saying this or not saying that?
Why we should care. It is important to look at long term goals, and as I said above, not chase the algorithm or specific ranking factors but focus on the ultimate goals of your business (money). Produce content and web pages that Google would be proud to rank at the top of the results for a given query and other sites will want to source and link to. And above all, do whatever you can to make the best possible site for users — beyond what your competitors produce.
About The Author
Barry Schwartz is Search Engine Land’s News Editor and owns RustyBrick, a NY based web consulting firm. He also runs Search Engine Roundtable, a popular search blog on SEM topics.