Posted by RobinRozhon
I use web crawlers on a daily basis. While they are very useful,
they only imitate search engine crawlers’ behavior, which means
you aren’t always getting the full picture.
The only tool that can give you a real overview of how search
engines crawl your site are log files. Despite this, many people
are still obsessed with
crawl budget — the number of URLs Googlebot can and wants to
Log file analysis may discover URLs on your site that you had no
idea about but that search engines are crawling anyway — a major
waste of Google server resources (Google
“Wasting server resources on pages like these will drain
crawl activity from pages that do actually have value, which may
cause a significant delay in discovering great content on a
While it’s a fascinating topic, the fact is that most sites
don’t need to worry that much about crawl budget —an
observation shared by John Mueller (Webmaster Trends Analyst at
Google) quite a few
There’s still a huge value in analyzing logs produced from
those crawls, though. It will show what pages Google is crawling
and if anything needs to be fixed.
When you know exactly what your log files are telling you,
you’ll gain valuable insights about how Google crawls and views
your site, which means you can optimize for this data to increase
traffic. And the bigger the site, the greater
the impact fixing these issues will have.
What are server logs?
A log file is a recording of everything that goes in and out of
a server. Think of it as a ledger of requests made by crawlers and
real users. You can see exactly what resources Google is crawling
on your site.
You can also see what errors need your attention. For instance,
one of the issues we uncovered with our analysis was that our CMS
created two URLs for each page and Google discovered both. This led
to duplicate content issues because two URLs with the same content
was competing against each other.
Analyzing logs is not rocket science — the logic is the same
as when working with tables in Excel or Google Sheets. The hardest
part is getting access to them — exporting and filtering that
Looking at a log file for the first time may also feel somewhat
daunting because when you open one, you see something like
Calm down and take a closer look at a single line:
126.96.36.199 - - [08/Dec/2017:04:54:20 -0400] "GET /contact/ HTTP/1.1" 200 11179 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
You’ll quickly recognize that:
188.8.131.52 is the IP address
[08/Dec/2017:04:54:20 -0400] is the
- GET is the Method
/contact/ is the Requested URL
- 200 is the Status Code (result)
11179 is the Bytes Transferred
“-” is the Referrer URL (source)
— it’s empty because this request was made by a crawler
Mozilla/5.0 (compatible; Googlebot/2.1;
+http://bit.ly/eSXNch) is the User
Agent (signature) — this is user agent of Googlebot
Once you know what each line is composed of, it’s not so
scary. It’s just a lot of information. But that’s where the
next step comes in handy.
Tools you can use
There are many tools you can choose from that will help you
analyze your log files. I won’t give you a full run-down of
available ones, but it’s important to know the difference between
static and real-time tools.
Static — This only analyzes a static file.
You can’t extend the time frame. Want to analyze another period?
You need to request a new log file. My favourite tool for analyzing
static log files is Power BI.
Real-time — Gives you direct access to logs.
I really like open source ELK Stack (Elasticsearch,
Logstash, and Kibana). It takes a moderate effort to implement it
but once the stack is ready, it allows me changing the time frame
based on my needs without needing to contact our developers.
Don’t just dive into logs with a hope to find something —
start asking questions. If you don’t formulate your questions at
the beginning, you will end up in a rabbit hole with no direction
and no real insights.
Here are a few samples of questions I use at the start of my
- Which search engines crawl my website?
- Which URLs are crawled most often?
- Which content types are crawled most often?
- Which status codes are returned?
If you see that Google is crawling non-existing pages (404), you
can start asking which of those requested URLs return 404 status
Order the list by the number of requests, evaluate the ones with
the highest number to find the pages with the highest priority (the
more requests, the higher priority), and consider whether to
redirect that URL or do any other action.