When doing log analysis it’s common to want to know how major search engines are crawling your site.
Google provides a great overview and simple tutorial, but we’ll dive in a bit deeper and look at gathering additional information when it’s some time ambiguous.
With any data, you don’t want it to be accurate and not polluted so you can make well informed decisions.
There are many rogue scrapers out there spoofing their user agent to look like a good search crawler.
We’ll look at how we can verify “good” search engine crawlers.
For any given IP, it’s basically 2 steps.
- Perform a Reverse DNS Lookup
- Perform a Forward DNS Lookup
If you want more details on how to do this in Ruby, Python, or Bash you can check out the tutorials.
I typically have a ruby script which I use to uses a list of IPs, and then performs the DNS lookups, as well as, doing a GeoIP lookup to see where the bots are coming from.
Google is usually very good about making sure their Googlebot serverse respond with a valid name for a reverse DNS lookups.
Bing on the other hand isn’t always as up to date. You might also see some strange behaviors from bots that identify themselves as bingbot or msnbot. I’ve seen requets come from Microsoft IPs, but the IPs Geo locate to the Phillipines or South East Asia.
Reverse DNS lookup can be spoofed, so doing a forward DNS lookup on the host name provided to verify it’s the same IP address is necessary if you want to be sure.
In the cases, where there isn’t a response from the server, we can use a
whois lookup to see if it belongs to a major search engine.
This information can also be spoofed, but it’s usually harder to do.
Now that you can verify the search engine IP ranges, you can then use a log analysis tool like Splunk to gain a more accurate picture of the search engines crawl behavior on your site.
You can also store these IPs into a database so you can monitor over time which IPs and regions are more active over the course of time. In addition, it’s helpful to re-check old IPs that you haven’t seen in a while, just to make sure those IPs haven’t changed hands with another company, or that Google and Bing aren’t using them for different services. This typically doesn’t happen, but it’s a pretty good practice.