In June of 2006, while working to resolve some indexing issues for a client, I discovered a bug in Google’s algorithm that allowed 3rd parties to literally hack a web page out of Google’s index and search results. I notified a contact at Google soon after, once I managed to confirm that what we thought we were seeing was really happening.
The problem still exists today, so I am making this public in the hope that it will spur some action.
I have sat on this information for more than a year now. A good friend has allowed his reputation to suffer, rather than disclose what we knew. I continue to see web sites that are affected by this issue. After giving Google more than a year to resolve the issue, I have decided that the only way to spur them to action is to publish what I know.
Disclaimer: What you’re about to read is as accurate as it can be, given the fact that I do not work at Google, and have no access to inside information. It’s also potentially disruptive to the organic results at Google, until they fix the problem. I hope that publishing this information is for the greater good, but I can’t control what others do with it, or how Google responds.
I am also not the only person who knows about this hack.
- Alan Perkins (who along with many others stayed quiet about the 302 redirect bug for 2 years) knew about it the day after I found it.
- Danny Sullivan has known nearly as long, and I suspect that his behind the scenes efforts are the reason why the major search engines all decided to publish “how to validate our spider” instructions after SES San Jose last year.
- Bill Atchison knows, because he helped me figure out a defensive strategy for my client’s sites… and along with me danced around this issue on the “Bot Obedience” panel at SES last year – trying to warn people without telling them too much.
- My (now former) client Brad Fallon knows… and he’s been subjected to a lot of unfair criticism that he could have easily answered by making this public. It cost him a lot of money, and “a lot of money” to Brad is a lot more than it is for most of us.
- “Someone else” knows, because they were actively exploiting this bug to knock one of Brad’s sites off of Google’s SERPs. I suspect many other “black hats” know about it by now… because other sites are being affected. I can’t believe that they’re all accidents.
This is going to be a long story, I’m afraid… but bear with me, because you need to understand this, and how to defend yourself.
The story begins over a year ago…
My friend Brad Fallon had been having some troubles with Google, and one of his web sites, My Wedding Favors. In June of 2006, after exhausting all of his other options, Brad (who knows his way around SEO) hired me to direct his search marketing efforts and, in simple terms “figure out what the hell is going on with Google.”
The first thing I discovered was that he wasn’t “banned,” but that Google was indexing everything on his site except for the home page. It took about two weeks of research and testing before I developed a working theory. When we searched Google for phrases that should have been completely unique to the My Wedding Favors home page, we kept finding a particular kind of duplicate content: proxies.
For those who don’t know what a proxy is, it’s a web server that’s set up to deliver the content from other web sites. Among other things, proxies have been set up to allow people to surf the internet “anonymously,” since the requests come from the proxy server’s IP address and not their own. Some of them are set up to allow people to get to content that is blocked by firewalls and URL blocking on corporate, educational, and other networks.
The diagram below shows what this looks like, when used innocently by a human being:
Unfortuntately for Brad, the proxies weren’t being used innocently, and it wasn’t some kid trying to read his Myspace messages at school, it was Googlebot, fetching his home page’s content under a different URL, and indexing it:
When Google fetched the copy of Brad’s home page through the proxy URL, they were dropping Brad’s (authentic) home page from the index completely, and keeping the (proxy) duplicate instead. Every time it happened, Brad was losing a ton of traffic.
Since at first we thought it could be a “one-off” problem, we just blocked the proxies’ IP addresses from accessing our server. Sure enough, within a week or so, My Wedding Favors’ home page wasn’t just back in Google – it was all the way back on the first page of search results!
All good again? Not so fast…
Another week or so went by, another set of proxies showed up in Google’s index, and the home page was once again completely dropped from Google.
At this point, we realized that this wasn’t an accident. Someone knew exactly what they were doing. Someone was actively seeking out proxies and linking to them, so that Google would pick them up, and drop Brad’s home page.
By now, more technically inclined readers should see where this is going, but keep reading – it gets better (or worse). If you’re just looking for “how to hack Google” instructions, that’s all I’m going to give you, so you can leave now.
Back to the story… It was pretty obvious that we had a fight on our hands… but we had some ideas for a solution. Since all of the Googlebot requests from the proxies so far had come through with Googlebot as the user-agent, we implemented a “quick fix” solution. Whenever we got a request from a Googlebot user-agent, we did a lookup on ARIN to see if the IP address was actually assigned to Google.
Another week, and they were back in, and right back on the first page of SERPs.
Now, I should mention that by this time I had contacted Matt Cutts at Google to let him know what was going on. His response was short, and he told me that he was surprised that this kind of thing could happen, but he did look into it. That was nice of Matt, because it’s not really his department, but he’s a good guy and actually wants to help webmasters out. I spoke about this with Matt and others on several subsequent occasions. They seemed to understand it, but nobody I talked to could do anything more than pass the word along to someone else.
That was more than a year ago.
I enlisted a few trusted folks to help me investigate, and Bill Atchison and I gave presentations at SES San Jose (August 2006) where we tried to warn people about the need to defend themselves without actually telling them “too much.” Since that presentation, other folks have written about the problem of proxies and duplicate content, but fortunately or unfortunately, they didn’t know how bad the problem was.
While I was in San Jose, Brad’s site got hacked out again (server upgrade broke our self-defense script and it had to be rewritten)…
Brad started getting a little sick of people calling him a “fake SEO expert” because his site was showing PR0, and couldn’t be found in SERPs… but he kept quiet and took the abuse, because he understood that this was dangerous information. I kept quiet too, because letting this kind of information out without giving Google a chance to fix the problem would be terribly irresponsible.
Bill kept quiet, as did Alan, Danny, and a few other folks who helped me research the issue. Either the SES San Jose presentations got through to someone, or Danny did something behind the scenes, because shortly after he learned about this, all of the search engines decided to start publishing clear instructions on how to validate their user-agents.
So, things quieted down for a bit. Google was (I thought) working on the problem, and Brad’s site was doing fine.
I told you it gets worse, though…
Around the first of October, the next wave of proxies hit. A different kind of proxy, that didn’t pass Googlebot’s user-agent along. There was a whole network of proxies that were built to avoid detection, because they were built to allow people inside the People’s Republic of China to view censored content without getting blocked by the “great firewall of China.” These proxies not only spoof the user-agent, they come in through many other (intermediate) proxies, so that the IP address of the original server can not be determined.
There was way to block them by IP address, because even blocking every IP in China didn’t catch them all. There was no way to catch them based on the user-agent, because the user-agent was spoofed.
I was expecting this, actually… and we had a solution in the works: reverse cloaking. Every page Brad’s web servers deliver now has “noindex, nofollow” in the robots meta tag, unless the request comes from a validated search engine spider. A “spoofed” proxy visit from Googlebot delivers a page that won’t be indexed. A real visit from Googlebot gets the page with “noindex” removed.
He’s not the only one doing that, either. Matthew Inman at SEOMoz noticed del.ico.us doing the same thing last fall but none of the commenters could understand why… except Bill Slawski, who had seen the presentation at SES where I mentioned the “reverse cloaking” idea. Bill didn’t say much, but he probably understood the whole picture by then.
Crossing our fingers, but…
So far, this defense has held up, and Brad’s site isn’t just back in Google, he’s back at #1, and no longer has to answer questions about why he’s “banned by Google.”
Unfortunately, Brad was only the first person I know who was affected by this bug in Google’s algorithm. He’s far from the last… and I am sick of seeing people get hurt. After more than a year, Google hasn’t fixed the problem, although it seems that you are now more likely to catch a “-999 penalty” than get dropped completely. In my opinion, that’s not a huge improvement.
So I am going public with this, because we need solutions. We need Google to find solutions, instead of calling it a feature, like the 302 redirect bug (which BTW still exists in some form). Alan and I sat on that 302 stuff for almost 2 years before it got out. The result was no different. They still haven’t completely fixed that one – all it takes is a shorter URL and random things can still happen with a 302.
Google needs to hear it loud and clear from every web site owner – “fix this problem.” For any “Googlers” who may happen by and read this, here’s a suggestion I passed along to Vanessa Fox last December: when you retrieve a web page, and the Sitemaps verification meta tag tells you “hey, this tag is on the wrong domain,” then dump that page because you’ve got a proxy. That would at least help a few folks out… but it’s not a complete solution.
Why Is This Even Possible? Can It Be Done To Anyone?
In simple terms, it appears that the original (authentic) page gets dropped or penalized as duplicate content.
A couple years ago, Google deployed some software & infrastructure changes collectively known as “Big Daddy.” This involved crawling from many different data centers, and changes to the crawler itself. It appears that the changes include moving some of the duplicate content detection down to the crawlers. The bug probably arises from the way the data centers are synchronized. Pure speculation here, but the picture I have of what happens looks like this:
- The original page exists in at least some of the data centers.
- A copy (proxy) gets indexed in one data center, and that gets sync’d across to the others.
- A spider visits the original, checks to see if the content is duplicate, and erroneously decides that it is.
- The original is dropped or penalized.
The thing is, even if Google’s system is 99.9% accurate in selecting the right version as authentic, all you have to do is overwhelm it with large numbers. Large numbers of proxies, and/or a large number of times when a spider has to make the right decision. It’s possible that there’s no way to “fix this” without throwing the whole system away. I don’t know. I’m not an engineer.
As far as whether “any site” could get hacked,” I don’t know. I’m not a black hat. I don’t have a link farm. I don’t have a botnet to spam blogs with. So I can’t manufacture thousands of links to thousands of proxies, in an attempt to knock sites off of SERPS. I wouldn’t do that anyway – it’s evil. So what I know is based mostly on sites reporting a problem, blocking the proxies, and seeing the problem disappear after the proxies are gone. Then repeating the exercise with the same results.
It depends on whether it’s all about confusing the system, or if there are enough other factors involved. It’s quite possible that some sites have so much authority, MojoRank, or whatever, that they simply could never be affected. It’s possible that there are negative trust factors, such as large-scale reciprocal linking, that could make a site more vulnerable.
How To Tell If You’ve Been Proxy Hacked
The simplest test, if you are experiencing a problem, is to examine Google search results for a phrase (search term in quotes) that should be unique to your page. For example, if your home page says “Fred’s Widget Factory sells the best down-home widgets on Earth” then you can search for that phrase.
You want to use a phrase (or combination of phrases) that should only appear on your page, and nowhere else on the web… or very few places at least. Then you do the search – if there’s more than one result (your page), then you need to examine the other URLs that are listed. If some of them are delivering an exact copy of the page, you just may be dealing with a proxy that has hijacked your content.
A typical proxy link looks something like this:
It’s easy to see what URL that would fetch, if example.com were a real proxy. Other proxy URLs encode the target URL so it’s not always that easy to determine what they’re going to fetch just by looking.
The mere presence of proxies in the index doesn’t necessarily mean you’ll be dropped or penalized. The situation inside Google’s systems is no doubt very complex. I have seen sites with multiple proxies indexed, and no ill effects. It’s possible that there are certain factors (trust, authority, domain age, etc.) that make one site more susceptible than another. I have no idea how they make the decision on which copy of a page to keep.
If you discover that you have a problem (pages knocked out of the index, -999 “penalty”), and you can identify proxies as duplicate content, a reinclusion request is likely to work in the short term, while you implement countermeasures. If you don’t mind sharing your information with me so that I can use it for further research, send an email to email@example.com with the affected URL and the URL of the search result page that shows the proxy duplicates, along with any search terms where your ranking appears to be affected.
Cry Havoc, And Let Slip The Dogs of Spam?
I don’t know if publishing this today was the right decision… but it seems keeping quiet isn’t spurring anyone at Google into action. People are already getting nailed by this. I’ve spent the past month going back and forth, trying to decide what to do. I’m going to hit “publish” now, and hope that any attention we can bring to this situation will spur all those Ph.Ds in Mountain View to focus on this for a little while.
Ultimately, this decision was the same one that anyone faces, who finds a security problem with any software – do you try to work with the developer behind the scenes, or inform the community and hope that the community can respond faster than the hackers? As you’ll see below, the community is already responding, and I’m not publishing this without offering some solutions for those who may be affected.
How To Fight Back
There are basically three main possibilities for your situation:
Situation 1: You are running an Apache server. We have 2 solutions in this case, that were developed by Jaimie Sirovich (co-author of Professional Search Engine Optimization with PHP). We’ve worked some late nights on this.
Solution #1 uses mod_write and .htaccess, to pass all spider requests through a PHP script that validates the request. This will only defends against being hacked via “normal” anonymous proxies that pass long the user agent – it only inspects visits from the “Big 4″ search engines (Ask, Google, MSN, and Yahoo). I call this the “first tier” defense – it won’t stop every proxy that exists, but it will come close, and you can implement it without modifying any of your applications. It wil even work if your web site is all static pages. This is what I’m implementing. Jaimie doesn’t like it because it’s kind of a hack – and he would rather you didn’t use it at all.
Solution #2 is a PHP script that implements the “reverse cloaking” defense, putting a “nonindex, nofollow” robots meta tag into your pages unless it’s a spider that you have configured the script to recognize. This will only be possible if your site is built on PHP. It wouldn’t be terribly difficult for a competent PHP user to implement this in an all-static site, you’d just need to change .htaccess so that your .html files are parsed as PHP. A WordPress plug-in will follow soon. This is a more robust defense, against more proxies.
How to get the code: An implementation guide is provided on Jaimie’s blog, along with a testing environment that you can use to check spider user agents & IP addresses, and of course the source code for both solutions. No warranty is given. This is hard core code for a hard core situation. Don’t use it if you don’t need it, and all code should really be deployed by professionals who can understand what it does, modify it to suit unique environments, etc.
Situation 2: You are running a Microsoft (IIS) server. Jaimie is working on an IIS/ASP solution similar to the Apache/PHP solution, which should be available soon. Think days, not weeks, in other words. Much sooner than his new book (Professional SEO with ASP), which is also in the pipeline.
I want to thank Jaimie for stepping up to provide these solutions on very short notice. I had some code of my own but he’s a real programmer, and I’m just a guy who hacks scripts together when I need something in a hurry. This isn’t a job that should be trusted to a guy who hacks code part time. You want an expert.
Situation 3: You are on a hosted solution, aren’t running PHP scripts that you can edit, don’t control the web server, etc. This is a more complex situation. I will have another post tomorrow that will offer some possible solutions, including one that involves creating your own caching proxy on a separate server. In this case, I don’t recommend doing anything unless you really believe that you have a problem with proxies.
In fact, I have mixed feelings about recommending any “defensive” measures for anyone who isn’t actually being affected… unless losing your Google traffic for a few weeks is such a daunting prospect that you feel you must put up the walls. Just understand – running extra code before you deliver a page will have a cost, in terms of server load and response times. Personally, I am putting up the walls on all of my sites.
Further disclaimer: these solutions are based, at least in part, on information that the search engines have published regarding the right way to validate spider visits. It would be nice if they would publish the information once and then stick by it, but Yahoo gave us instructions shortly after Google did, then they recently changed the domain they crawl from (was inktomisearch.com, now crawl.yahoo.net). Once you start doing this stuff, you have to keep up with what the search engines are doing. I’ll certainly try to keep my subscribers informed, but not everyone gets my newsletter. Keeping up to speed on this stuff is up to you.
There are other solutions available. Bill Atchison’s Crawlwall is a professional (commercial) solution, that does a lot more to prevent content theft, etc. If you have the means, you may want to consider this instead, and move the burden of “keeping up with the spiders” onto Bill’s shoulders. Jaimie is working on a more general proxy-blocking solution as well. Ekstreme has the beginnings of a spider validation solution in the PHP Search Engine Bot Authentication code they published.
If You Are Operating A Proxy – Don’t Be Part of the Problem
If you are operating a proxy server, and you don’t want to be part of the problem, you can prevent your server from being used as a tool by adding a robots.txt file that prevents all search engine spiders from indexing proxied content through your server. For example, if all proxy URLs begin with /proxy/ then you can use:
Of course, not all proxies are being run by innocent people for innocent reasons. Some of them are actually designed to hijack content – to deliver ads, etc. Some people want to steal your content, and they want the search engines to index it. In fact, I would not be surprised if a large part of the overall problem isn’t caused by such people firing links at their own proxies.
Is It Just Google?
You got me… I haven’t seen any cases on other engines that looked like a proxy hack, but I’d be surprised if it only affected Google. Google may simply be the only search engine that shows you enough search results to let you “catch” the proxies. Google may be more susceptible because they crawl more URLs more often, and use multiple data centers.
Assuming I am not completely wrong, it sure looks like less of a design flaw, and more of an “emergent property” of the very things that make Google the world’s best search engine (just my opinion, apparently the average consumer no longer agrees). I don’t know that there is an easy solution, especially if the problem arises because of their multiple-data-center strategy.
Unfortunately, any countermeasures that we implement could be thwarted by someone willing to copy our content in other ways, or by constructing a proxy that spoofs user agents, uses intermediate proxies to hide its IP address, and strips out meta tags. This has always been possible, BTW. Anyone actually doing these things, of course, would likely be committing a crime… and would be a lot easier to find than some script kiddie using comment spam to fire links at someone else’s proxies.
Is It Possible That You Are Totally Wrong, Dan?
Yes, I suppose it’s possible that there is some other explanation, that everything at Google is perfect, etc. But I’ve spent a lot of time looking at this, and it sure looks to me like this is a real problem.
Defend yourselves, folks. It’s a dangerous world.
P.S. I will be discussing this issue with Jim Hedger on Webmaster Radio’s “The Alternative” today, Thursday August 16 – the show airs at 5pm Eastern.
UPDATE: As of May 1, 2008, I have every reason to believe that Google has solved this problem, at least in the general case. At this point, the only sites I can see getting “duped by proxy” are spammier than the proxies themselves.
Update again: September 2009 - damned if this thing hasn’t cropped up again – now it looks like Google’s replacing the duped URL with the copy’s URL – and even RANKING the duplicates… (similar to the already-known-and-passed-off-as-a-feature 302 redirect bug).