Our thinking


Troubleshooting: Fetch a web page as Googlebot

I’ve had a couple of clients come to me recently after their WordPress site was pwned. Sometimes you’re able to use a tool like Wordfence to clean it up and secure the site and it’s all OK. Sometimes however it goes a bit deeper than this.

With one hack recently, even after the WordPress core files were cleaned, all plugins updated with fresh versions from the repository and many miscellaneous php files scattered within the wp-content directory were removed, it all looked OK – except when Google indexed the site.

Whilst Google have a utility in their webmaster tools to fetch a page as Google, there are a few limitations. First of all, you need to have the website added to your Google Analytics account (or be logged into a Google Analytics account that owns the property) and it’s not instant.

I was trying to track down something that didn’t show up on any Wordfence scan and wasn’t a malicious plugin or hacked core file. I was quite sure about this as the spam links on the page persisted even after WordPress was reinstalled and all plugins were disabled.

The one thing that did fix it however was switching the theme.

As it turns out, the hackers had inserted some conditional code into the main theme files so that whenever a regular browser was viewing the site, everything looked as it should. When Googlebot viewed the site (actually, any user agent on a long list of user agents used by search engine spiders) there were a huge number of spam links inserted into the page.

I was able to find the snippet of code in one of the theme files and removed it easily, however I found that the Fetch as Google was slow to use in practice. Through using a quick trick with Curl, I could give the website a user agent that triggered the spam links like so:

curl -L -A "Googlebot/2.1 (+http://www.google.com/bot.html)" http://example.com

The above command runs curl with -L (or –location) and -A (or –user-agent) to set the user agent. The Location switch tells curl that if it’s given a redirect code, to send the same request to the new location (i.e. it will send the same user agent to the new location).

This was able to quickly show me the spam content in the page and give me instant feedback that the html output was clean once I had found and removed the suspect chunk of code.

Leave a Reply