Connect with us

SEO

How to prepare for a JS migration

Published

on


An 80 percent decrease in organic traffic is the nightmare of every business. Unfortunately, such a nightmarish scenario may become reality if a website migration is done incorrectly; instead of improving the current situation it eventually leads to catastrophe.

Source: http://take.ms/V6aDv

There are many types of migrations, such as changing, merging or splitting the domains, redesigning the website or moving to a new framework.

Web development trends are clearly showing that the use of JavaScript has been growing in recent years and JavaScript frameworks are becoming more and more popular. In the future, we can expect that more and more websites will be using JavaScript.

Source: https://httparchive.org/reports/state-of-javascript

As a consequence, SEOs will be faced with the challenge of migrating to JavaScript frameworks.

In this article, I will show you how to prepare for a migration of a website built with a static HTML to a JavaScript framework.

Search engines vs. JavaScript

Google is the only search engine that is able to execute JavaScript and “see” the elements like content and navigation even if they are powered by JavaScript. However, there are two things that you always need to remember when considering changes to a JS framework.

Firstly, Google uses Chrome 41 for rendering pages. This is a three-year old browser that does not support all the modern features needed for rendering advanced features. Even if they can render JS websites in general, it may happen that some important parts will not be discovered due to the reliance on technology that Google can’t process.

Secondly, JS executing is an extremely heavy process so that Google indexes JS websites in two waves. The first wave gets the raw HTML indexed. In the case of JS-powered websites, this translates to almost an empty page. During the second wave, Google executes JavaScript so they can “see” the additional elements loaded by JS. Then they are ready for indexing the full content of the page.

The combination of these two elements makes it so that if you decide to change your current website to the JavaScript framework, you always need to check if Google can efficiently crawl and index your website.

Migration to a JS framework done right

SEOs may not like JavaScript, but it doesn’t mean that its popularity will stop growing. We should get prepared as much as we can and implement the modern framework correctly.

Below you will find information that will help you navigate through the process of changing the current framework. I do not provide “ready-to-go” solutions because your situation will be the result of different factors and there is no universal recipe. However, I want to stress the elements you need to pay particular attention to.

Cover the basics of standard migration

You can’t count on the miracle that Google will understand the change without your help. The whole process of migration should be planned in detail.

I want to keep the focus on JS migration for this article, so if you need detailed migration guidelines, Bastian Grimm has already covered this.

Source: Twitter

Understand your needs in terms of serving the content to Google

This step should be done before anything else. You need to decide on how Google will receive the content of your website. You have two options:

1. Client-side rendering: This means that you are totally relying on Google for rendering. However, if you go for this option you agree on some inefficiency. The first important drawback of this solution is the deferred indexing of your content due to the two waves of indexing mentioned above. Secondly, it may happen that everything doesn’t work properly because Chrome 41 is not supporting all the modern features. And last, but not least, not all search engines can execute JavaScript, so your JS website will seem empty to Bing, Yahoo, Twitter and Facebook.

Source: YouTube

2. Server-side rendering: This solution relies on rendering by an external mechanism or the additional mechanism/component responsible for the rendering of JS websites, creating a static snapshot and serving it to the search engine crawlers. At the Google I/O conference, Google announced that serving a separate version of your website only to the crawler is fine. This is called Dynamic Rendering, which means that you can detect the crawler’s User Agent and send the server-side rendered version. This option also has its disadvantages: creating and maintaining additional infrastructure, possible delays if a heavy page is rendered on the server or possible issues with caching (Googlebot may receive a not-fresh version of the page).

Source: Google

Before migration, you need to answer if you need option A or B.

If the success of your business is built around fresh content (news, real estate offers, coupons), I can’t imagine relying only on the client-side rendered version. It may result in dramatic delays in indexing so your competitors may gain an advantage.

If you have a small website and the content is not updated very often, you can try to leave it as client-side rendered, but you should test before launching the website if Google really does see the content and navigation. The most useful tools to do so are Fetch as Google in GSC and the Chrome 41 browser.

However, Google officially stated that it’s better to use Dynamic Rendering to make sure they will discover frequently changing content correctly and quickly.

Framework vs. solution

If your choice is to use Dynamic Rendering, it’s time to answer how to serve the content to the crawlers. There is no one universal answer. In general, the solution depends on the technology AND developers AND budget AND your needs.

Below you will find a review of the options you have from a few approaches, but the choice is yours:

  • I need an as simple a solution as possible.

Probably I’d go for pre-rendering, for example with prerender.io. It’s an external service that crawls your website, renders your pages and creates static snapshots to serve them if a specific User Agent makes a request. A big advantage of this solution is the fact that you don’t need to create your own infrastructure.

You can schedule recrawling and create fresh snapshots of your pages. However, for bigger and frequently changing websites, it might be difficult to make sure that all the pages are refreshed on time and show the same content both to Googlebot and users.

  • I need a universal solution and I follow the trends.

If you build the website with one of the popular frameworks like React, Vue, or Angular, you can use one of the methods of Server Side Rendering dedicated to a given framework. Here are some popular matches:

Using one of these frameworks installed on the top of React or Vue results in creating a universal application, meaning that the exact same code can be executed both on the server (Server Side rendering) and in the client (Client Side Rendering). It minimizes the issues with a content gap that you could have if you rely on creating snapshots and heavy caching, as with prerender.

  • I need a universal solution and I don’t use a popular framework.

It may happen that you are going to use a framework that does not have a ready-to-use solution for building a universal application. In this case, you can go for building your infrastructure for rendering. It means that you can install a headless browser on your server that will render all the subpages of your website and create the snapshots that are served to the search engine crawlers. Google provides a solution for that – Puppeteer is a library that does a similar job as prender.io. However, everything happens on your infrastructure.

  • I want a long-lasting solution.

For this, I’d use hybrid rendering. It’s said that this solution provides the best experience both to users and the crawlers because users and crawlers receive a server-side rendered version of the page on the initial request. In many cases, serving an SSR page is faster for users rather than executing all the heavy files in the browser. All subsequent user interactions are served by JavaScript. Crawlers do not interact with the website by clicking or scrolling so it’s always a new request to the server and they always receive an SSR version. Sounds good, but it’s not easy to implement.

Source: YouTube

The option that you choose will depend on many factors like technology, developers and budgets. In some cases, you may have a few options, but in many cases, you may have many restrictions, so picking a solution will be a single-choice process.

Testing the implementation

I can’t imagine a migration without creating a staging environment and testing how everything works. Migration to a JavaScript framework adds complexity and additional traps that you need to watch out for.

There are two scenarios. If for some reason you decided to rely on client-side rendering, you need to install Chrome 41 and check how it renders and works. One of the most important points of an audit is checking errors in the console in Chrome Dev Tools. Remember that even a small error in processing JavaScript may result in issues with rendering.

If you decided to use one of the methods of serving the content to the crawler, you will need to have a staging site with this solution installed. Below, I’ll outline the most important elements that should be checked before going live with the website:

1. Content parity

You should always check if users and crawlers are seeing exactly the same content. To do that, you need to switch the user agents in the browser to see the version sent to the crawlers. You should verify the general discrepancies regarding rendering. However, to see the whole picture you will also need to check the DOM (Document Object Model) of your website. Copy the source code from your browser, then change the User Agent to Googlebot and grab the source code as well. Diffchecker will help you to see the differences between the two files. You should especially look for the differences in the content, navigation and metadata.

An extreme situation is when you send an empty HTML file to Googlebot, just as Disqus does.

Source: Google

This is what their SEO Visibility looks like:

Source: http://take.ms/Fu3bL

They’ve seen better days. Now the homepage is not even indexed.

2. Navigation and hyperlinks

To be 100 percent sure that Google sees, crawls and passes link juice, you should follow the clear recommendation of implementing internal links shared at Google I/O Conference 2018.

Source: YouTube

If you rely on server-side rendering methods, you need to check if the HTML of a prerendered version of a page contains all the links that you expect. In other words, if it has the same navigation as your client-side rendered version. Otherwise, Google will not see the internal linking between pages. Critical areas where you may have problems is facet navigation, pagination, and the main menu.

3. Metadata

Metadata should not be dependent on JS at all. Google says that if you load the canonical tag with JavaScript they probably will not see this in the first wave of indexing and they will not re-check this element in the second wave. As a result, the canonical signals might be ignored.

While testing the staging site, always check if an SSR version has the canonical tag in the head section. If yes, confirm that the canonical tag is the correct one. A rule of thumb is always sending consistent signals to the search engine whether you use client or server-side rendering.

While checking the website, always verify if both CSR and SSR versions have the same titles, descriptions and robots instructions.

4. Structured data

Structured data helps the search engine to better understand the content of your website.

Before launching the new website make sure that the SSR version of your website displays all the elements that you want to mark with structured data and if the markups are included in the prerendered version. For example, if you want to add markups to the breadcrumbs navigation. In the first step, check if the breadcrumbs are displayed on the SSR version. In the second step, run the test in Rich Results Tester to see if the markups are valid.

5. Lazy loading

My observations show that modern websites love loading images and content (e.g. products)  with lazy loading. The additional elements are loaded on a scroll event. Perhaps it might be a nice feature for users, but Googlebot can’t scroll, so as a consequence these items will not be discovered.

Seeing that so many webmasters are having problems with lazy loading in an SEO-friendly way, Google published a guideline for the best practices of lazy loading. If you want to load images on a scroll, make sure you support paginated loading. This means that if you scroll, the URLs should change (e.g., by adding the pagination identifiers: ?page=2, ?page=3, etc.) and most importantly, the URLs are updated with the proper content, for example by using History API.

Do not forget about adding rel=”prev” and rel=”next” markups in the head section to indicate the sequence of the pages.

Snapshot generation and cache settings

If you decided to create a snapshot for search engine crawlers, you need to monitor a few additional things.

You must check if the snapshot is an exact copy of the client-side rendered version of your website. You can’t load additional content or links that are not visible to a standard user, because it might be assessed as cloaking. If the process of creating snapshots is not efficient e.g. your pages are very heavy and your server is not that fast, it may result in creating broken snapshots. As a result, you will serve e.g. partially rendered pages to the crawler.

There are some situations when the rendering infrastructure must work at high-speeds, such as Black Friday when you want to update the prices very quickly. You should test the rendering in extreme conditions and see how much time it takes to update a given number of pages.

The last thing is caching. Setting the cache properly is something that will help you to maintain efficiency because many pages might be quickly served directly from the memory. However, if you do not plan the caching correctly, Google may receive stale content.

Monitoring

Monitoring post-migration is a natural step. However, in the case of moving to a JS framework, sometimes there is an additional thing to monitor and optimize.

Moving to a JS framework may affect web performance. In many cases, the payload increases which may result in longer loading times, especially for mobile users. A good practice is monitoring how your users perceive the performance of the website and compare the data before and after migration. To do so you can use Chrome User Experience Report.

Source: Google

It will provide information if the Real User Metrics have changed over time. You should always aim at improving them and loading the website as fast as possible.

Summary

Migration is always a risky process and you can’t be sure of the results. The risks might be mitigated if you plan the whole process in detail. In the case of all types of migrations, planning is as important as the execution. If you take part in the migration to the JS framework, you need to deal with additional complexity. You need to make additional decisions and you need to verify additional things. However, as web development trends continue to head in the direction of using JavaScript more and more, you should be prepared that sooner or later you will need to face a JS migration. Good luck!


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About The Author

Maria Cieslak is a Senior Technical SEO Consultant at Elephate, the “Best Small SEO Agency” in Europe. Her day to day involves creating and executing SEO strategies for large international structures and pursuing her interest in modern websites built with JavaScript frameworks. Maria has been a guest speaker at SEO conferences in Europe, including 2018’s SMX London, where she has spoken on a wide range of subjects, including technical SEO and JavaScript. If you are interested in more information on this subject, you should check out Elephate’s “Ultimate Guide to JavaScript SEO“.





Source link

Continue Reading
Click to comment

You must be logged in to post a comment Login

Leave a Reply

SEO

Video: Chris Boggs on experience in the SEM industry

Published

on


Chris Boggs has been doing the SEO and SEM thing since 2000 — yes, for over 20 years. Boggs does both SEO and PPC and has worked at both large agencies and smaller agencies and in-house at both large companies and small companies. He has been on his own, running his own agency named Web Traffic Advisors, since 2014.

Boggs is a former US Marine and credits a lot of his success in the industry to what he learned from his service. He also credits his previous jobs and bosses with his success. We spoke about that and also chatted about some of the earlier days in SEM.

Our conversation went into some technical SEO topics and PPC topics as well. I hope you enjoy learning about Chris Boggs, he is a good man.

I started this vlog series recently, and if you want to sign up to be interviewed, you can fill out this form on Search Engine Roundtable. You can also subscribe to my YouTube channel by clicking here.


About The Author

Barry Schwartz a Contributing Editor to Search Engine Land and a member of the programming team for SMX events. He owns RustyBrick, a NY based web consulting firm. He also runs Search Engine Roundtable, a popular search blog on very advanced SEM topics. Barry’s personal blog is named Cartoon Barry and he can be followed on Twitter here.



Continue Reading

SEO

Leverage Python and Google Cloud to extract meaningful SEO insights from server log data

Published

on


For my first post on Search Engine Land, I’ll start by quoting Ian Lurie:

Log file analysis is a lost art. But it can save your SEO butt!

Wise words.

However, getting the data we need out of server log files is usually laborious:

  • Gigantic log files require robust data ingestion pipelines, a reliable cloud storage infrastructure, and a solid querying system
  • Meticulous data modeling is also needed in order to convert cryptic, raw logs data into legible bits, suitable for exploratory data analysis and visualization

In the first post of this two-part series, I will show you how to easily scale your analyses to larger datasets, and extract meaningful SEO insights from your server logs.

All of that with just a pinch of Python and a hint of Google Cloud!

Here’s our detailed plan of action:

#1 – I’ll start by giving you a bit of context:

  • What are log files and why they matter for SEO
  • How to get hold of them
  • Why Python alone doesn’t always cut it when it comes to server log analysis

#2 – We’ll then set things up:

  • Create a Google Cloud Platform account
  • Create a Google Cloud Storage bucket to store our log files
  • Use the Command-Line to convert our files to a compliant format for querying
  • Transfer our files to Google Cloud Storage, manually and programmatically

#3 – Lastly, we’ll get into the nitty-gritty of Pythoning – we will:

  • Query our log files with Bigquery, inside Colab!
  • Build a data model that makes our raw logs more legible 
  • Create categorical columns that will enhance our analyses further down the line
  • Filter and export our results to .csv

In part two of this series (available later this year), we’ll discuss more advanced data modeling techniques in Python to assess:

  • Bot crawl volume
  • Crawl budget waste
  • Duplicate URL crawling

I’ll also show you how to aggregate and join log data to Search Console data, and create interactive visualizations with Plotly Dash!

Excited? Let’s get cracking!

System requirements

We will use Google Colab in this article. No specific requirements or backward compatibility issues here, as Google Colab sits in the cloud.

Downloadable files

  • The Colab notebook can be accessed here 
  • The log files can be downloaded on Github – 4 sample files of 20 MB each, spanning 4 days (1 day per file)

Be assured that the notebook has been tested with several million rows at lightning speed and without any hurdles!

Preamble: What are log files?

While I don’t want to babble too much about what log files are, why they can be invaluable for SEO, etc. (heck, there are many great articles on the topic already!), here’s a bit of context.

A server log file records every request made to your web server for content.

Every. Single. One.

In their rawest forms, logs are indecipherable, e.g. here are a few raw lines from an Apache webserver:

Daunting, isn’t it?

Raw logs must be “cleansed” in order to be analyzed; that’s where data modeling kicks in. But more on that later.

Whereas the structure of a log file mainly depends on the server (Apache, Nginx, IIS etc…), it has evergreen attributes:

  • Server IP
  • Date/Time (also called timestamp)
  • Method (GET or POST)
  • URI
  • HTTP status code
  • User-agent

Additional attributes can usually be included, such as:

  • Referrer: the URL that ‘linked’ the user to your site
  • Redirected URL, when a redirect occurs
  • Size of the file sent (in bytes)
  • Time taken: the time it takes for a request to be processed and its response to be sent

Why are log files important for SEO?

If you don’t know why they matter, read this. Time spent wisely!

Accessing your log files

If you’re not sure where to start, the best is to ask your (client’s) Web Developer/DevOps if they can grant you access to raw server logs via FTP, ideally without any filtering applied.

Here are the general guidelines to find and manage log data on the three most popular servers:

We’ll use raw Apache files in this project.

Why Pandas alone is not enough when it comes to log analysis

Pandas (an open-source data manipulation tool built with Python) is pretty ubiquitous in data science.

It’s a must to slice and dice tabular data structures, and the mammal works like a charm when the data fits in memory!

That is, a few gigabytes. But not terabytes.

Parallel computing aside (e.g. Dask, PySpark), a database is usually a better solution for big data tasks that do not fit in memory. With a database, we can work with datasets that consume terabytes of disk space. Everything can be queried (via SQL), accessed, and updated in a breeze!

In this post, we’ll query our raw log data programmatically in Python via Google BigQuery. It’s easy to use, affordable and lightning-fast – even on terabytes of data!

The Python/BigQuery combo also allows you to query files stored on Google Cloud Storage. Sweet!

If Google is a nay-nay for you and you wish to try alternatives, Amazon and Microsoft also offer cloud data warehouses. They integrate well with Python too:

Amazon:

Microsoft:

Create a GCP account and set-up Cloud Storage

Both Google Cloud Storage and BigQuery are part of Google Cloud Platform (GCP), Google’s suite of cloud computing services.

GCP is not free, but you can try it for a year with $300 credits, with access to all products. Pretty cool.

Note that once the trial expires, Google Cloud Free Tier will still give you access to most Google Cloud resources, free of charge. With 5 GB of storage per month, it’s usually enough if you want to experiment with small datasets, work on proof of concepts, etc…

Believe me, there are many. Great. Things. To. Try!

You can sign-up for a free trial here.

Once you have completed sign-up, a new project will be automatically created with a random, and rather exotic, name – e.g. mine was “learned-spider-266010“!

Create our first bucket to store our log files

In Google Cloud Storage, files are stored in “buckets”. They will contain our log files.

To create your first bucket, go to storage > browser > create bucket:

The bucket name has to be unique. I’ve aptly named mine ‘seo_server_logs’!

We then need to choose where and how to store our log data:

  • #1 Location type – ‘Region’ is usually good enough.
  • #2 Location – As I’m based in the UK, I’ve selected ‘Europe-West2’. Select your nearest location
  • #3 Click on ‘continue’

Default storage class: I’ve had good results with ‘nearline‘. It is cheaper than standard, and the data is retrieved quickly enough:

Access to objects: “Uniform” is fine:

Finally, in the “advanced settings” block, select:

  • #1 – Google-managed key
  • #2 – No retention policy
  • #3 – No need to add a label for now

When you’re done, click “‘create.”

You’ve created your first bucket! Time to upload our log data.

Adding log files to your Cloud Storage bucket

You can upload as many files as you wish, whenever you want to!

The simplest way is to drag and drop your files to Cloud Storage’s Web UI, as shown below:

Yet, if you really wanted to get serious about log analysis, I’d strongly suggest automating the data ingestion process!

Here are a few things you can try:

  • Cron jobs can be set up between FTP servers and Cloud Storage infrastructures: 
  • FTP managers like Cyberduck also offer automatic transfers to storage systems, too
  • More data ingestion tips here (AppEngine, JSON API etc.)

A quick note on file formats

The sample files uploaded in Github have already been converted to .csv for you.

Bear in mind that you may have to convert your own log files to a compliant file format for SQL querying. Bigquery accepts .csv or .parquet. 

Files can easily be bulk-converted to another format via the command line. You can access the command line as follows on Windows:

  • Open the Windows Start menu
  • Type “command” in the search bar
  • Select “Command Prompt” from the search results
  • I’ve not tried this on a Mac, but I believe the CLI is located in the Utilities folder

Once opened, navigate to the folder containing the files you want to convert via this command:

CD 'path/to/folder’

Simply replace path/to/folder with your path.

Then, type the command below to convert e.g. .log files to .csv:

for file in *.log; do mv "$file" "$(basename "$file" .*0).csv"; done

Note that you may need to enable Windows Subsystem for Linux to use this Bash command.

Now that our log files are in, and in the right format, it’s time to start Pythoning!

Unleash the Python

Do I still need to present Python?!

According to Stack Overflow, Python is now the fastest-growing major programming language. It’s also getting incredibly popular in the SEO sphere, thanks to Python preachers like Hamlet or JR.

You can run Python on your local computer via Jupyter notebook or an IDE, or even in the cloud via Google Colab. We’ll use Google Colab in this article.

Remember, the notebook is here, and the code snippets are pasted below, along with explanations.

Import libraries + GCP authentication

We’ll start by running the cell below:

It imports the Python libraries we need and redirects you to an authentication screen.

There you’ll have to choose the Google account linked to your GCP project.

Connect to Google Cloud Storage (GCS) and BigQuery

There’s quite a bit of info to add in order to connect our Python notebook to GCS & BigQuery. Besides, filling in that info manually can be tedious!

Fortunately, Google Colab’s forms make it easy to parameterize our code and save time.

The forms in this notebook have been pre-populated for you. No need to do anything, although I do suggest you amend the code to suit your needs.

Here’s how to create your own form: Go to Insert > add form field > then fill in the details below:

When you change an element in the form, its corresponding values will magically change in the code!

Fill in ‘project ID’ and ‘bucket location’

In our first form, you’ll need to add two variables:

  • Your GCP PROJECT_ID (mine is ‘learned-spider-266010′)
  • Your bucket location:
    • To find it, in GCP go to storage > browser > check location in table
    • Mine is ‘europe-west2′

Here’s the code snippet for that form:

Fill in ‘bucket name’ and ‘file/folder path’:

In the second form, we’ll need to fill in two more variables:

The bucket name:

  • To find it, in GCP go to: storage > browser > then check its ‘name’ in the table
  • I’ve aptly called it ‘apache_seo_logs’!

The file path:

  • You can use a wildcard to query several files – Very nice!
  • E.g. with the wildcarded path ‘Loggy*’, Bigquery would query these three files at once:
    • Loggy01.csv
    • Loggy02.csv
    • Loggy03.csv
  • Bigquery also creates a temporary table for that matter (more on that below)

Here’s the code for the form:

Connect Python to Google Cloud Storage and BigQuery

In the third form, you need to give a name to your BigQuery table – I’ve called mine ‘log_sample’. Note that this temporary table won’t be created in your Bigquery account.

Okay, so now things are getting really exciting, as we can start querying our dataset via SQL *without* leaving our notebook – How cool is that?!

As log data is still in its raw form, querying it is somehow limited. However, we can apply basic SQL filtering that will speed up Pandas operations later on.

I have created 2 SQL queries in this form:

  • “SQL_1st_Filter” to filter any text
  • “SQL_Useragent_Filter” to select your User-Agent, via a drop-down

Feel free to check the underlying code and tweak these two queries to your needs.

If your SQL trivia is a bit rusty, here’s a good refresher from Kaggle!

Code for that form:

Converting the list output to a Pandas Dataframe

The output generated by BigQuery is a two-dimensional list (also called ‘list of lists’). We’ll need to convert it to a Pandas Dataframe via this code:

Done! We now have a Dataframe that can be wrangled in Pandas!

Data cleansing time, the Pandas way!

Time to make these cryptic logs a bit more presentable by:

  • Splitting each element
  • Creating a column for each element

Split IP addresses

Split dates and times

We now need to convert the date column from string to a “Date time” object, via the Pandas to_datetime() method:

Doing so will allow us to perform time-series operations such as:

  • Slicing specific date ranges 
  • Resampling time series for different time periods (e.g. from day to month)
  • Computing rolling statistics, such as a rolling average

The Pandas/Numpy combo is really powerful when it comes to time series manipulation, check out all you can do here!

More split operations below:

Split domains

Split methods (Get, Post etc…)

Split URLs

Split HTTP Protocols

Split status codes

Split ‘time taken’

Split referral URLs

Split User Agents

Split redirected URLs (when existing)

Reorder columns

Time to check our masterpiece:

Well done! With just a few lines of code, you converted a set of cryptic logs to a structured Dataframe, ready for exploratory data analysis.

Let’s add a few more extras.

Create categorical columns

These categorical columns will come handy for data analysis or visualization tasks. We’ll create two, paving the way for your own experiments!

Create an HTTP codes class column

Create a search engine bots category column

As you can see, our new columns httpCodeClass and SEBotClass have been created:

Spotting ‘spoofed’ search engine bots

We still need to tackle one crucial step for SEO: verify that IP addresses are genuinely from Googlebots.

All credit due to the great Tyler Reardon for this bit! Tyler has created  searchtools.io, a clever tool that checks IP addresses and returns ‘fake’ Googlebot ones, based on a reverse DNS lookup.

We’ve simply integrated that script into the notebook – code snippet below:

Running the cell above will create a new column called ‘isRealGbot?:

Note that the script is still in its early days, so please consider the following caveats:

  • You may get errors when checking a huge amount of IP addresses. If so, just bypass the cell
  • Only Googlebots are checked currently

Tyler and I are working on the script to improve it, so keep an eye on Twitter for future enhancements!

Filter the Dataframe before final export

If you wish to further refine the table before exporting to .csv, here’s your chance to filter out status codes you don’t need and refine timescales.

Some common use cases:

  • You have 12 months’ worth of log data stored in the cloud, but only want to review the last 2 weeks
  • You’ve had a recent website migration and want to check all the redirects (301s, 302s, etc.) and their redirect locations
  • You want to check all 4XX response codes

Filter by date 

Refine start and end dates via this form:

Filter by status codes

Check status codes distribution before filtering:

Code:

Then filter HTTP status codes via this form:

Related code:

Export to .csv 

Our last step is to export our Dataframe to a .csv file. Give it a name via the export form:

Code for that last form:

Pat on the back if you’ve followed till here! You’ve achieved so much over the course of this article!

I cannot wait to take it to the next level in my next column, with more advanced data modeling/visualization techniques!

I’d like to thank the following people:

  • Tyler Reardon, who’s helped me to integrate his anti-spoofing tool into this notebook!
  • Paul Adams from Octamis and my dear compatriot Olivier Papon for their expert advice
  • Last but not least, Kudos to Hamlet Batista or JR Oakes – Thanks guys for being so inspirational to the SEO community!

Please reach me out on Twitter if questions, or if you need further assistance. Any feedback (including pull requests! :)) is also greatly appreciated!

Happy Pythoning!

This year’s SMX Advanced will feature a brand-new SEO for Developers track with highly-technical sessions – many in live-coding format – focused on using code libraries and architecture models to develop applications that improve SEO. SMX Advanced will be held June 8-10 in Seattle. Register today.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About The Author

Charly Wargnier is a seasoned digital marketing consultant based in the UK, leaning on over a decade of in-the-trenches SEO, BI and Data engineering experience. Charly has worked both in-house and agency-side, primarily for large enterprises in Retail and Fashion, and on a wide range of fronts including complex technical SEO issues, site performance, data pipelining and visualization frameworks. When he isn’t working, he enjoys coding for good and spending quality time with his family – cooking, listening to Jazz music and playing chess, in no particular order!



Continue Reading

SEO

Microsoft Office hits pause on forcing Bing search in Chrome, Firefox

Published

on


Microsoft recently announced a new “extension” as part of an update to its Office 365 ProPlus software that forcibly changes company-wide Chrome and Firefox search engine defaults to Bing search, automatically, from what is likely set to Google. After considerable backlash, the company is reversing course, a bit.

In a predatory fashion, the extension automatically seeks out, through the network and local device file systems, installations of independent browsers (Chrome and Firefox were mentioned) in order to edit configuration files outside its own software ecosystem.

A compromise

In a halfhearted reversal, Microsoft will compromise with modifications that comply more with administrators’ wishes to make the extension optional. This will result in a timeline delay, as well. Rather than automatically changing default search engines for Chrome and Firefox to Bing, administrators are now required to opt-in for it to do so, and actions will initially be limited to only Active Directory joined devices.

This means, at first, the extension won’t act like a worm that traverses the whole network looking for vulnerable computers — until sometime “in the future.”

In the future we will add specific settings to govern the deployment of the extension to unmanaged devices. 

Microsoft

It’s still troubling Microsoft plans to do this but is understandable when considering what is often done in tandem with an organization’s rules. IT infrastructure setup and maintenance require super-user levels of control over software installation and configuration settings.

The problem is when organizations are less restrictive, allowing users to install Chrome and Firefox rather than limit them to using Microsoft Edge or past versions of IE. Browser applications get very personalized when authenticated with Google and/or Firefox Accounts for services such as Google search.

No matter how convenient the ability to search for docs and refs from shared drives and Microsoft applications via Chrome and Firefox default search is, users of those browsers should be able to do that through company resources and manage search defaults on their own.

Security implications

In more restrictive organizations, like those that require secure access to sensitive information by authenticated staff, having “overlord” control over networked machines is a vital component of IT systems operations. In those cases, it is commonplace to disallow software installations in the first place.

It stands to reason security incidents can increase when browser search with Microsoft in Bing accesses network resources. Administrators have to take care when considering such applications. They certainly didn’t ask for the features the new extension provides and rightly view the move as one of pure marketing.

It’s when users are allowed to install programs that policy and operations should be less impinging. Automatically changing default search settings to Bing while only providing last-minute instructions for administrators who must take action to prevent the extension from executing was a very poor way to introduce a controversial procedure in Office 365 setup.

Why we care

Ironically, ink from the press about the backlash gave the search capability of Microsoft in Bing a spotlight that the extension may not have received otherwise. Microsoft should not resort to leveraging its Office 365 install base to switch user-defined search defaults from a desired choice to Bing in order to unfairly compete. It demonstrates how much it would like to take search market share away from Google. Bing integrated with Microsoft search competes fairly well with its unique results from network resources, something Google can only emulate with its own suite of interoperable services appearing in search results.


About The Author

Detlef Johnson is the SEO for Developers Expert for Search Engine Land and SMX. He is also a member of the programming team for SMX events and writes the SEO for Developers series on Search Engine Land. Detlef is one of the original group of pioneering webmasters who established the professional SEO field more than 20 years ago. Since then he has worked for major search engine technology providers, managed programming and marketing teams for Chicago Tribune, and consulted for numerous entities including Fortune 500 companies. Detlef has a strong understanding of Technical SEO and a passion for Web programming.



Continue Reading

Trending

Copyright © 2019 Plolu.