August 16, 2007

Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs

In June of 2006, while working to resolve some indexing issues for a client, I discovered a bug in Google's algorithm that allowed 3rd parties to literally hack a web page out of Google's index and search results. I notified a contact at Google soon after, once I managed to confirm that what we thought we were seeing was really happening.

The problem still exists today, so I am making this public in the hope that it will spur some action.

I have sat on this information for more than a year now. A good friend has allowed his reputation to suffer, rather than disclose what we knew. I continue to see web sites that are affected by this issue. After giving Google more than a year to resolve the issue, I have decided that the only way to spur them to action is to publish what I know.

Disclaimer: What you're about to read is as accurate as it can be, given the fact that I do not work at Google, and have no access to inside information. It's also potentially disruptive to the organic results at Google, until they fix the problem. I hope that publishing this information is for the greater good, but I can't control what others do with it, or how Google responds.

I am also not the only person who knows about this hack.

 

  • Alan Perkins (who along with many others stayed quiet about the 302 redirect bug for 2 years) knew about it the day after I found it.
  • Danny Sullivan has known nearly as long, and I suspect that his behind the scenes efforts are the reason why the major search engines all decided to publish "how to validate our spider" instructions after SES San Jose last year.
  • Bill Atchison knows, because he helped me figure out a defensive strategy for my client's sites… and along with me danced around this issue on the "Bot Obedience" panel at SES last year - trying to warn people without telling them too much.
  • My (now former) client Brad Fallon knows… and he's been subjected to a lot of unfair criticism that he could have easily answered by making this public. It cost him a lot of money, and "a lot of money" to Brad is a lot more than it is for most of us.
  • "Someone else" knows, because they were actively exploiting this bug to knock one of Brad's sites off of Google's SERPs. I suspect many other "black hats" know about it by now… because other sites are being affected. I can't believe that they're all accidents.

This is going to be a long story, I'm afraid… but bear with me, because you need to understand this, and how to defend yourself.

The story begins over a year ago…

My friend Brad Fallon had been having some troubles with Google, and one of his web sites, My Wedding Favors. In June of 2006, after exhausting all of his other options, Brad (who knows his way around SEO) hired me to direct his search marketing efforts and, in simple terms "figure out what the hell is going on with Google."

The first thing I discovered was that he wasn't "banned," but that Google was indexing everything on his site except for the home page. It took about two weeks of research and testing before I developed a working theory. When we searched Google for phrases that should have been completely unique to the My Wedding Favors home page, we kept finding a particular kind of duplicate content: proxies.

For those who don't know what a proxy is, it's a web server that's set up to deliver the content from other web sites. Among other things, proxies have been set up to allow people to surf the internet "anonymously," since the requests come from the proxy server's IP address and not their own. Some of them are set up to allow people to get to content that is blocked by firewalls and URL blocking on corporate, educational, and other networks.

The diagram below shows what this looks like, when used innocently by a human being:

diagram1.png

Unfortuntately for Brad, the proxies weren't being used innocently, and it wasn't some kid trying to read his Myspace messages at school, it was Googlebot, fetching his home page's content under a different URL, and indexing it:

diagram2.png

When Google fetched the copy of Brad's home page through the proxy URL, they were dropping Brad's (authentic) home page from the index completely, and keeping the (proxy) duplicate instead. Every time it happened, Brad was losing a ton of traffic.

Since at first we thought it could be a "one-off" problem, we just blocked the proxies' IP addresses from accessing our server. Sure enough, within a week or so, My Wedding Favors' home page wasn't just back in Google - it was all the way back on the first page of search results!

All good again? Not so fast…

Another week or so went by, another set of proxies showed up in Google's index, and the home page was once again completely dropped from Google.

At this point, we realized that this wasn't an accident. Someone knew exactly what they were doing. Someone was actively seeking out proxies and linking to them, so that Google would pick them up, and drop Brad's home page.

By now, more technically inclined readers should see where this is going, but keep reading - it gets better (or worse).  If you're just looking for "how to hack Google" instructions, that's all I'm going to give you, so you can leave now.

Back to the story… It was pretty obvious that we had a fight on our hands… but we had some ideas for a solution. Since all of the Googlebot requests from the proxies so far had come through with Googlebot as the user-agent, we implemented a "quick fix" solution. Whenever we got a request from a Googlebot user-agent, we did a lookup on ARIN to see if the IP address was actually assigned to Google.

Another week, and they were back in, and right back on the first page of SERPs.

Now, I should mention that by this time I had contacted Matt Cutts at Google to let him know what was going on. His response was short, and he told me that he was surprised that this kind of thing could happen, but he did look into it. That was nice of Matt, because it's not really his department, but he's a good guy and actually wants to help webmasters out. I spoke about this with Matt and others on several subsequent occasions. They seemed to understand it, but nobody I talked to could do anything more than pass the word along to someone else.

That was more than a year ago.

I enlisted a few trusted folks to help me investigate, and Bill Atchison and I gave presentations at SES San Jose (August 2006) where we tried to warn people about the need to defend themselves without actually telling them "too much." Since that presentation, other folks have written about the problem of proxies and duplicate content, but fortunately or unfortunately, they didn't know how bad the problem was.

While I was in San Jose, Brad's site got hacked out again (server upgrade broke our self-defense script and it had to be rewritten)…

Brad started getting a little sick of people calling him a "fake SEO expert" because his site was showing PR0, and couldn't be found in SERPs… but he kept quiet and took the abuse, because he understood that this was dangerous information. I kept quiet too, because letting this kind of information out without giving Google a chance to fix the problem would be terribly irresponsible.

Bill kept quiet, as did Alan, Danny, and a few other folks who helped me research the issue. Either the SES San Jose presentations got through to someone, or Danny did something behind the scenes, because shortly after he learned about this, all of the search engines decided to start publishing clear instructions on how to validate their user-agents.

So, things quieted down for a bit. Google was (I thought) working on the problem, and Brad's site was doing fine.

I told you it gets worse, though…

Around the first of October, the next wave of proxies hit. A different kind of proxy, that didn't pass Googlebot's user-agent along. There was a whole network of proxies that were built to avoid detection, because they were built to allow people inside the People's Republic of China to view censored content without getting blocked by the "great firewall of China." These proxies not only spoof the user-agent, they come in through many other (intermediate) proxies, so that the IP address of the original server can not be determined.

There was way to block them by IP address, because even blocking every IP in China didn't catch them all. There was no way to catch them based on the user-agent, because the user-agent was spoofed.

I was expecting this, actually… and we had a solution in the works: reverse cloaking. Every page Brad's web servers deliver now has "noindex, nofollow" in the robots meta tag, unless the request comes from a validated search engine spider.  A "spoofed" proxy visit from Googlebot delivers a page that won't be indexed. A real visit from Googlebot gets the page with "noindex" removed.

He's not the only one doing that, either. Matthew Inman at SEOMoz noticed del.ico.us doing the same thing last fall but none of the commenters could understand why… except Bill Slawski, who had seen the presentation at SES where I mentioned the "reverse cloaking" idea. Bill didn't say much, but he probably understood the whole picture by then.

Crossing our fingers, but…

So far, this defense has held up, and Brad's site isn't just back in Google, he's back at #1, and no longer has to answer questions about why he's "banned by Google."

Unfortunately, Brad was only the first person I know who was affected by this bug in Google's algorithm. He's far from the last… and I am sick of seeing people get hurt. After more than a year, Google hasn't fixed the problem, although it seems that you are now more likely to catch a "-999 penalty" than get dropped completely. In my opinion, that's not a huge improvement.

So I am going public with this, because we need solutions. We need Google to find solutions, instead of calling it a feature, like the 302 redirect bug (which BTW still exists in some form). Alan and I sat on that 302 stuff for almost 2 years before it got out. The result was no different. They still haven't completely fixed that one - all it takes is a shorter URL and random things can still happen with a 302.

Google needs to hear it loud and clear from every web site owner - "fix this problem." For any "Googlers" who may happen by and read this, here's a suggestion I passed along to Vanessa Fox last December: when you retrieve a web page, and the Sitemaps verification meta tag tells you "hey, this tag is on the wrong domain," then dump that page because you've got a proxy. That would at least help a few folks out… but it's not a complete solution.

Why Is This Even Possible? Can It Be Done To Anyone?

In simple terms, it appears that the original (authentic) page gets dropped or penalized as duplicate content.

A couple years ago, Google deployed some software & infrastructure changes collectively known as "Big Daddy." This involved crawling from many different data centers, and changes to the crawler itself. It appears that the changes include moving some of the duplicate content detection down to the crawlers. The bug probably arises from the way the data centers are synchronized. Pure speculation here, but the picture I have of what happens looks like this:

  1. The original page exists in at least some of the data centers.
  2. A copy (proxy) gets indexed in one data center, and that gets sync'd across to the others.
  3. A spider visits the original, checks to see if the content is duplicate, and erroneously decides that it is.
  4. The original is dropped or penalized.

The thing is, even if Google's system is 99.9% accurate in selecting the right version as authentic, all you have to do is overwhelm it with large numbers. Large numbers of proxies, and/or a large number of times when a spider has to make the right decision. It's possible that there's no way to "fix this" without throwing the whole system away. I don't know. I'm not an engineer.

As far as whether "any site" could get hacked," I don't know. I'm not a black hat. I don't have a link farm. I don't have a botnet to spam blogs with. So I can't manufacture thousands of links to thousands of proxies, in an attempt to knock sites off of SERPS. I wouldn't do that anyway - it's evil. So what I know is based mostly on sites reporting a problem, blocking the proxies, and seeing the problem disappear after the proxies are gone. Then repeating the exercise with the same results.

It depends on whether it's all about confusing the system, or if there are enough other factors involved. It's quite possible that some sites have so much authority, MojoRank, or whatever, that they simply could never be affected. It's possible that there are negative trust factors, such as large-scale reciprocal linking, that could make a site more vulnerable.

How To Tell If You've Been Proxy Hacked

The simplest test, if you are experiencing a problem, is to examine Google search results for a phrase (search term in quotes) that should be unique to your page. For example, if your home page says "Fred's Widget Factory sells the best down-home widgets on Earth" then you can search for that phrase.

You want to use a phrase (or combination of phrases) that should only appear on your page, and nowhere else on the web… or very few places at least. Then you do the search - if there's more than one result (your page), then you need to examine the other URLs that are listed. If some of them are delivering an exact copy of the page, you just may be dealing with a proxy that has hijacked your content.

A typical proxy link looks something like this:
www.example.com/nph-proxy.pl/011110A/http/www.mattcutts.com/blog/
It's easy to see what URL that would fetch, if example.com were a real proxy. Other proxy URLs encode the target URL so it's not always that easy to determine what they're going to fetch just by looking.

The mere presence of proxies in the index doesn't necessarily mean you'll be dropped or penalized. The situation inside Google's systems is no doubt very complex. I have seen sites with multiple proxies indexed, and no ill effects. It's possible that there are certain factors (trust, authority, domain age, etc.) that make one site more susceptible than another. I have no idea how they make the decision on which copy of a page to keep.

If you discover that you have a problem (pages knocked out of the index, -999 "penalty"), and you can identify proxies as duplicate content, a reinclusion request is likely to work in the short term, while you implement countermeasures. If you don't mind sharing your information with me so that I can use it for further research, send an email to proxyreports@gmail.com with the affected URL and the URL of the search result page that shows the proxy duplicates, along with any search terms where your ranking appears to be affected.

Cry Havoc, And Let Slip The Dogs of Spam?

I don't know if publishing this today was the right decision… but it seems keeping quiet isn't spurring anyone at Google into action. People are already getting nailed by this. I've spent the past month going back and forth, trying to decide what to do. I'm going to hit "publish" now, and hope that any attention we can bring to this situation will spur all those Ph.Ds in Mountain View to focus on this for a little while.

Ultimately, this decision was the same one that anyone faces, who finds a security problem with any software - do you try to work with the developer behind the scenes, or inform the community and hope that the community can respond faster than the hackers? As you'll see below, the community is already responding, and I'm not publishing this without offering some solutions for those who may be affected.

How To Fight Back

There are basically three main possibilities for your situation:

Situation 1: You are running an Apache server. We have 2 solutions in this case, that were developed by Jaimie Sirovich (co-author of Professional Search Engine Optimization with PHP). We've worked some late nights on this.

Solution #1 uses mod_write and .htaccess, to pass all spider requests through a PHP script that validates the request. This will only defends against being hacked via "normal" anonymous proxies that pass long the user agent - it only inspects visits from the "Big 4" search engines (Ask, Google, MSN, and Yahoo). I call this the "first tier" defense - it won't stop every proxy that exists, but it will come close, and you can implement it without modifying any of your applications. It wil even work if your web site is all static pages. This is what I'm implementing. Jaimie doesn't like it because it's kind of a hack - and he would rather you didn't use it at all.

Solution #2 is a PHP script that implements the "reverse cloaking" defense, putting a "nonindex, nofollow" robots meta tag into your pages unless it's a spider that you have configured the script to recognize. This will only be possible if your site is built on PHP. It wouldn't be terribly difficult for a competent PHP user to implement this in an all-static site, you'd just need to change .htaccess so that your .html files are parsed as PHP. A Wordpress plug-in will follow soon. This is a more robust defense, against more proxies.

How to get the code: An implementation guide is provided on Jaimie's blog, along with a testing environment that you can use to check spider user agents & IP addresses, and of course the source code for both solutions. No warranty is given. This is hard core code for a hard core situation. Don't use it if you don't need it, and all code should really be deployed by professionals who can understand what it does, modify it to suit unique environments, etc. 

Situation 2: You are running a Microsoft (IIS) server. Jaimie is working on an IIS/ASP solution similar to the Apache/PHP solution, which should be available soon. Think days, not weeks, in other words. Much sooner than his new book (Professional SEO with ASP), which is also in the pipeline.

I want to thank Jaimie for stepping up to provide these solutions on very short notice. I had some code of my own but he's a real programmer, and I'm just a guy who hacks scripts together when I need something in a hurry. This isn't a job that should be trusted to a guy who hacks code part time. You want an expert.

Situation 3: You are on a hosted solution, aren't running PHP scripts that you can edit, don't control the web server, etc. This is a more complex situation. I will have another post tomorrow that will offer some possible solutions, including one that involves creating your own caching proxy on a separate server. In this case, I don't recommend doing anything unless you really believe that you have a problem with proxies.

In fact, I have mixed feelings about recommending any "defensive" measures for anyone who isn't actually being affected… unless losing your Google traffic for a few weeks is such a daunting prospect that you feel you must put up the walls. Just understand - running extra code before you deliver a page will have a cost, in terms of server load and response times. Personally, I am putting up the walls on all of my sites.

Further disclaimer: these solutions are based, at least in part, on information that the search engines have published regarding the right way to validate spider visits. It would be nice if they would publish the information once and then stick by it, but Yahoo gave us instructions shortly after Google did, then they recently changed the domain they crawl from (was inktomisearch.com, now crawl.yahoo.net). Once you start doing this stuff, you have to keep up with what the search engines are doing. I'll certainly try to keep my subscribers informed, but not everyone gets my newsletter. Keeping up to speed on this stuff is up to you.

There are other solutions available. Bill Atchison's Crawlwall is a professional (commercial) solution, that does a lot more to prevent content theft, etc. If you have the means, you may want to consider this instead, and move the burden of "keeping up with the spiders" onto Bill's shoulders. Jaimie is working on a more general proxy-blocking solution as well. Ekstreme has the beginnings of a spider validation solution in the PHP Search Engine Bot Authentication code they published.

If You Are Operating A Proxy - Don't Be Part of the Problem

If you are operating a proxy server, and you don't want to be part of the problem, you can prevent your server from being used as a tool by adding a robots.txt file that prevents all search engine spiders from indexing proxied content through your server. For example, if all proxy URLs begin with /proxy/ then you can use:

User-agent: * 
Disallow: /proxy/

Of course, not all proxies are being run by innocent people for innocent reasons. Some of them are actually designed to hijack content - to deliver ads, etc. Some people want to steal your content, and they want the search engines to index it. In fact, I would not be surprised if a large part of the overall problem isn't caused by such people firing links at their own proxies.

Is It Just Google?

You got me… I haven't seen any cases on other engines that looked like a proxy hack, but I'd be surprised if it only affected Google. Google may simply be the only search engine that shows you enough search results to let you "catch" the proxies. Google may be more susceptible because they crawl more URLs more often, and use multiple data centers.

Assuming I am not completely wrong, it sure looks like less of a design flaw, and more of an "emergent property" of the very things that make Google the world's best search engine (just my opinion, apparently the average consumer no longer agrees). I don't know that there is an easy solution, especially if the problem arises because of their multiple-data-center strategy.

Unfortunately, any countermeasures that we implement could be thwarted by someone willing to copy our content in other ways, or by constructing a proxy that spoofs user agents, uses intermediate proxies to hide its IP address, and strips out meta tags. This has always been possible, BTW. Anyone actually doing these things, of course, would likely be committing a crime… and would be a lot easier to find than some script kiddie using comment spam to fire links at someone else's proxies.

Is It Possible That You Are Totally Wrong, Dan?

Yes, I suppose it's possible that there is some other explanation, that everything at Google is perfect, etc. But I've spent a lot of time looking at this, and it sure looks to me like this is a real problem.

Defend yourselves, folks. It's a dangerous world.

P.S. I will be discussing this issue with Jim Hedger on Webmaster Radio's "The Alternative" today, Thursday August 16 - the show airs at 5pm Eastern.

UPDATE: As of May 1, 2008, I have every reason to believe that Google has solved this problem, at least in the general case. At this point, the only sites I can see getting "duped by proxy" are spammier than the proxies themselves.

Filed under Blog by Dan Thies

Permalink Print Comment

Comments on Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs »

How Proxy Hacking Can Hurt Your Rankings & What To Do About It…

Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs by Dan Thies gives us a detailed look at the serious dangers of proxy hacking. Dan's detailed article shows the history on how he discovered the issue. He then goes into wh…

[…] what's going on? Dan Thies has a summary up about it, and he'll probably do a much better job explaining it since he's not a programmer at […]

Hamlet Batista @ 12:14 pm

Dan - I've blogged about this issue a couple of times. Please read negative SEO counter measure #8 here http://hamletbatista.com/2007/07/16/you%e2%80%99ve-won-the-battle-but-not-the-war-10-ways-to-protect-your-site-from-negative-seo/

I also expose one critical weakness of the bot validation method and propose a stronger solution here http://hamletbatista.com/2007/07/03/the-never-ending-serps-hijacking-problem-is-there-a-definite-solution/

Both post are very technical but I think they are not difficult to follow.

Brad Masterson @ 12:19 pm

Wow, Dan, you're really brave to put up this article! I can imagine all the conflicting interests that would want you to keep this issue quiet. In the end, you're the guy the webmasters trust to keep the SEO world honest!

Jonathan Hochman @ 12:20 pm

If Google implements "trusted webmasters," they can generate a tag that we apply to every page on our site. That tag would encode the domain name, allowing Google to validate that our copy of a site is legit, and that all other copies are bogus. Once a domain is registered to a particular webmaster, no other can register it. This could become an additional feature of Google Webmaster Tools. The main challenge is deciding who can be trusted, but that's not insurmountable.

Patricia Skinner @ 12:26 pm

Thanks for sharing this Dan. I think it's better out in the open than simmering away in the background. Kudos to you for having the courage to inform us all.

Hey Dan,

great post and thanks for the summary of the initial problem. I appreciate this as I'm actually actively fighting the "google bowlers via proxy site" hackers for the last 2 weeks already..

What your post lacks off is to EMPHASIZE that both of your solutions will only get rid of those proxies that ID as googlebot & co.

But as you said for yourself - the big wave bowling sites out of the SERPs come from proxy sites that "cloak" the user agent … or better - they pass on the user agent from their visitor to your site…

and NO, they are NOT all located in China (what makes you think that?)

I actually discussed this with Bill (IncrediBill) last week on his blog where he mentioned that the only way would be to block all hosting centers in the world from crawling… something pretty aggressive to do…

(see

http://incredibill.blogspot.com/2007/07/google-proxy-hijacking-myths-urban.html

)

So, apart from blocking ALL datacenters in the world and going thru the SERPs looking for proxies, what do you suggest to cure this mess?

best regards
Christoph C. Cemper
- the marketingfan

Hamlet Batista @ 12:43 pm

Why Is This Even Possible? Can It Be Done To Anyone?

The main problem is the way Google chooses the authentic page among duplicates. From Webmaster Central Blog: http://googlewebmastercentral.blogspot.com/2007/06/duplicate-content-summit-at-smx.html

Providing a way to authenticate ownership of content…We currently rely on a number of factors such as the site's authority and the number of links to the page.

According to this, hijackers only need to install their cgi-proxies on domains with more inbound links and/or authority than ours.

Dennis @ 12:43 pm

Dan,

Just reading your post give me a massive headache!

Seriously though, even if I don't understand all the nitty-gritty details, this kind of straight-talk earns you a few more more notches in the respect scale.

The lesson for me here is: never to put all your eggs in one basket. In this case, Google's search engine. This is not a good 80/20 scenario.

Now I'm beginning to appreciate all the buzz about Web 2.0, which are real people's votes vs. Google's massive algorithm.

Great work!

Hamlet,

exactly… this is what happened to one of my sites,
and that damn proxy site has a WIKIPEDIA link,
while mine does not… this shows that even those wikipedia links which are nofollowed now seem to transfer trust

christoph

TheMadHat @ 1:04 pm

Dan,

I've personally been nailed by this tactic and have had pages replaced by proxy pages. I have implemented similar patches but that seems to be all they are because they're usually only temporary. I've reported to Google multiple times on this issue as well.

One thing I've noticed is it normally only happens to pages that have fewer inbound links and lower PR, but your example seems to void that theory as I'm familiar with Brad and know his wedding site has a very strong link profile.

Aaron

Terry Alexander @ 1:10 pm

Nice post. Sucks to be honest and try to work within the system and be nice about things. My hat's off to you for your patience. Me, I'd have posted it long ago.

I just don't have that much patience. I know it's also very hard to hint to people about a problem fix while keeping the actual problem hidden. I don't have that much patience either.

Good job.

randfish @ 1:14 pm

That's a problem that needs addressing - Sphinn it here - http://sphinn.com/story/3092 - I'll try to mention it on the blog in the next day or two.

(Trackback)

SiteProNews Blog @ 1:15 pm

Method for Getting a Site Banned Using Google’s Own Index…

Earlier today, legendary SEO Dan Theis released a potentially deadly Google Hack he claims to have shown Google engineers one year ago this week. By using cached documents found in Google’s own index, Dan discovered a method to fool Googlebot into th…

Dan Thies @ 1:15 pm

Let's see, where to start… how about Christoph:

Actually, the reverse cloaking solution does deal with proxies that don't identify the user agent, because the only user-agent that gets a page without the "noindex, nofollow" are those that:
1) Identify as spiders
2) Pass a "valid IP address" test

And I never said they were all in China - just the the set of proxies that got Brad were.

Hamlet, I don't see how what you're suggesting is stronger, but it looks like maybe you didn't read everything either.

Dan Thies @ 1:18 pm

Aaron (TheMadHat) - yeah… we're talking about a site that went all the way (back) to the top of the SERPs when we implemented the reverse cloaking solution.

Jonathan, I think even starting with the domain registration that already exists in Webmaster Tools would be better than nothing.

Sam Haynes @ 1:19 pm

As a newbie trying to make enough bucks to augment Social Insecurity checks I can appreciate the efforts of people like Dan who try to keep the playing field level, even if it does belong to Goooogle. I give complete moral support to anyone who works to thwart "Proxy Hackers".

My area to exploit are the myriad small, no-account, puny commission paying small businesses. They, like me, aren't trying to get rick quick; quite the contrary, we will work hard for small checks received thousands of times, simply by linking unique searchers to unique providers. We do this by using non hostile SEO tactics…. The cheap ones, of course.

To quote my hero and mentor, Forest Gump, "That's all I got to say about that."

Sam
PATHFINDERS 2007

Shimo @ 1:25 pm

Arghh,

One of my websites have drastically dropped in traffic couple months ago. I have not touched the site for quite some time (no improving link popularity, etc), so this came as surprise to me. I read this article and did not find any proxy urls. However, I did notice that they are many spam websites that contain my content. Could this be the reason for the drop in Google's ranks?

Dave @ 1:43 pm

This is worrying information indeed.

I am lucky enough to be in a market where the vast majority of the competition are not very technically inclined - I would be surprised if they know what a nofollow tag is.

For everyone in a semi competitive market - fireproof yourself asap.

We are only going to see more of these cases before Google decides to fix the problem.

(Trackback)

Marketingfan.com @ 1:54 pm

Say goodbye to your traffic - Google Bowling via Proxy - current blackhat tactic…

Wave Shoppe @ 1:57 pm

I will assume that Matt and company will address this issue. Could be a victory for the little guys with white hats?

“If Google implements "trusted webmasters," they can generate a tag that we apply to every page on our site." Jonathan, that concept certainly has my meager vote.

Dan,

thanks for getting back to me on my assumption your solution wouldn't work generally… and mea culpa - I'm sorry.

I now agree that this method should work for (most of) proxy sites (some will get around this even tough, so need to be blacklisted in another strategy)

I thought I'll add to this great post by showing off my own company site, which is currently suffering from this attack method - and that's for sure for 6+ weeks until I found out about it…

I put together all the facts, keywords and SERP screenshots here

http://www.marketingfan.com/search-engines/google-proxy-bowling

and would appreciate your comments or other ways to faster cure it - because obviously a spam report to Google 2 weeks ago or even blocking all those scumbag's IPs that we found were used for scraping us didn't show any effect.

I'll move forward to implement your solution#2 ASAP

Thanks again and cheers
Christoph
- the marketinfan.com

Lee @ 1:59 pm

I've been waiting to hear back from Matt and/or Adam to see what they say about this. Hopefully one of them will post a response here.

Dan Thies @ 2:05 pm

Thanks for updating that, Christoph. So far, the reverse cloaking method has held up on every site we've implemented it with. Jaimie's code is (IMO) actually a lot more reliable than what we've been using.

He's actually suggested another method, since the reverse cloaking script itself proxies the page, it would be possible to implement that using the .htaccess method and insert the robots meta tag into any content without having to modify existing code. For the same reasons (it's kind of a hack) he doesn't like it much. It would also be a heck of a lot of server overhead.

Technically speaking it would be possible to check every IP, but as Hamlet pointed out, that would be a major bit of overhead to add to every single request.

Dan Thies @ 2:07 pm

Wave, as I suggested to Google last year… they already have a meta tag that they use to verify the ownership of a domain. They use it to give you access to reports in Webmaster Tools.

Wouldn't be too hard to put that on every page, if Google would use it.

Hey Dan!

I think I know why I was mislead to think that this wouldn't work for the proxies that pass the normal user agents…

Egghead's description to implement this in htaccess says

RewriteCond %{HTTP_USER_AGENT} yahoo|slurp|msn|ask|google|gsa [NC]
RewriteRule (^.*$) proxy.php?orig_url=$1

Well, and that's the flaw … this mod_rewrite condition only accepts the major 4 SEs as you said and I thought he would pass this into the "solution 2" script…

but he omitted the proxy.php completely from the post, which made me think his simple_cloak_v2.inc would mean to be the proxy.php to pass all requests into

I guess I'll just mod this to pass ALL requests thru the simple_cloak and make sure I don't have to tweak any pages..

Tough one concern I got is about the output buffering…I think this might cause problems on sites that ALREADY USE output buffering tough… but I'm sure Jamie can comment more on that…

cheers,
Christoph
- the marketingfan.com

Dan Thies @ 2:25 pm

Yeah, I get ya. They're two completely different approaches. When he wakes up we'll ask him to address the confusion on his post. He was up until 5am running test cases on the reverse cloaking script.

Like I said, Jaimie doesn't like the .htaccess method at all. I only asked him to do it because it provides at least partial protection to static sites.

The reverse cloaking method doesn't use .htaccess at all.

Dan Thies @ 2:29 pm

To clarify I hope:

Solution #1 says "if you claim to be a spider, we force you to prove it before we give you content."

Solution #2 says, "unless you claim to be a spider and can prove it, we insert a robots meta tag with noindex."

Solution #1 will effectively deal with "normal" proxies that pass along the user agent.

Solution #2 deals with proxies that spoof the user agent.

[…] Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs by Dan Thies gives us a detailed look at the serious dangers of proxy hacking. […]

Dan,

I believe either I'm too tired or you you are still wrong about solution #1

IF sol#1 would deal with "proxies that pass along the user agent" it would do the same as sol#2 …

in fact

Solution #1 will effectively deal with proxies that PRETEND TO BE A BOT (but cannot prove it)

I meanwhile went ahead and wanted to implement Jamie's piece of code, but this implementation hint is also missing

- a config.inc.php (which could be reconstructed by a coder after doing a review)

- a cron'ed call to update the spider list

WARNING:

the CURRENT IMPLEMENTATION as listed on Jamie Egghead's site ACTIVELY causes all SITES TO GET DEINDEXED because those mysql tables are empty if nobody calls updateAll() in his code..

So beware fellow webmasters and wait until Jamie had his coffee … but obviously a mistake I can fully understand if you hang in testing mode until 5am and then still need to post all that stuff you made on your blog… :-)

cheers
Christoph
- the MarketingFan.com

Hamlet Batista @ 2:40 pm

Hamlet, I don't see how what you're suggesting is stronger, but it looks like maybe you didn't read everything either.

Dan - Thanks for responding. As you correctly mentioned in your post, there are obvious weaknesses for a programmer to exploit your proposed solutions.

1) reverse-forward dns for bot detection. As you expressed in your post the proxy can be modified to provide another user agent instead of Googlebot.
2) reverse cloaking. I like the idea and I can share another alternative implementation, but the proxy can be modified to strip the robot's meta tag or the X-Robots-Tag header.

Alternate reverse cloaking: Setting the new Googlebot supported header X-Robots-Tag: noindex,nofollow if the requesting ip fails the bot validation.

I explained an easy way to implement this mod_headers and mod_setenvif here http://hamletbatista.com/2007/08/01/controlling-your-robots-using-the-x-robots-tag-http-header-with-googlebot/

This solution has the advantage that works for any type of file, not just html ones.

What is the solution I am proposing and why is it stronger?

I'm proposing we use the same techniques that have been in use for sometime in the email anti-spam industry: Identifying and blocking ips that have been succesfully tagged as source of spam. We need to identify and block cgi hijackers.

1. We can identify cgi-proxies by inserting a unique fingerprint+requesting ip to all pages when the IP is not from a search engine bot. We can later do a search for the finger print to find the cgi proxy IPs. I know
this works because I use this technique to track people scrapping my RSS feed.
2. We can then publish those IPs in The HoneyPot Project DNS database http://www.projecthoneypot.org/httpbl.php
3. We then block access to our web pages to any IPs listed there.

It is definitely not trivial (there is some programming involved), and is a little bit reactive, but I think this is a good starting point.

mind @ 2:52 pm

sitting on a problem for a year or two is irresponsible. the proper thing to do would have been to re-notify google after a month or so of hearing nothing, that you intent to release the vulnerability, with full instructions on how to do it, in two months. i'm multiplying these times by 4 for you, so you don't see so harsh, and because perhaps a conceptual vulnerability like this is a bit harder to fix than a simple buffer overflow. often times companies don't fix something that isn't high priority. by giving them a bit of time, you're being nice and giving them a shot at fixing before public release. but after 3 months (which would normally be weeks), if they haven't done anything about it, they obviously aren't working on it, they just don't care. by keeping this thing private (and you somewhat have, given that you don't say exactly how to do it), you're only helping the people who know how to do it, and are using to actively hurt competitors. the longer it is secret knowledge, the longer it will work, and the better off they will be

Dan Thies @ 2:52 pm

Thanks for the clarity, Hamlet.

What you propose does need an implementation, but it is stronger. Jaimie is actually working on a "block all proxies" method. I believe Bill Atchison is already doing the same thing with Crawlwall.

A published database of proxies (assuming it can't be maliciously polluted) would add a stronger layer of defense.

It's a given that one could construct a proxy that would spoof user agents, strip headers and meta tags, etc. - so stronger solutions are needed.

Steve @ 2:59 pm

Solution #2 has a flaw - a malicious proxy could simply remove the robots meta tag before it sends the page back through to Google.

Solution #1 also has the problem that you need to keep a very good track of which IP addresses are valid, otherwise you may unintentionally block legitimate searchbot spidering.

Dan Thies @ 2:59 pm

I appreciate your feelings on this, Mind… but that's easy to say that when you don't have to make the decision.

Right now, at this very moment, there are people out there trying to make use of this exploit, who didn't know it existed yesterday. Implementing defenses takes time. Starting up a comment-spamming script takes seconds.

After seeing this exploited once, I didn't actually see it again for some time. I believed that Google was working on it… and as I said, I have spoken with several people more than once during the interval.

Trust me though, next time, I will just publish what I know as soon as I know it. Lesson learned.

Steve @ 3:00 pm

(Sorry, I didn't realise that had already been mentioned in these comments)

Dan Thies @ 3:04 pm

Steve,

1) We KNOW that a malicious proxy could be constructed. I've said as much. :D But I haven't seen one in the wild yet… and solving THAT problem is a lot bigger. One of the reasons why this exploit is so "nice" for black hats is that it's totally hands off - they don't have to do anything but point links at other people's proxies. If they build one that strips meta tags and headers, they have to host it somewhere, and then there's at least a chance of tracking them down.

2) The forward and reverse DNS lookup is what the search engines have given us to use.

[…] Google proxy hacking can get you dropped from the index. A what, what in the wha..? (SEO Fast Start) […]

fthead9 @ 3:42 pm

Dan and all the comment writers, you've left my head spinning but I'm very thankful nonetheless for your insights. Certainly something I'll keep a vigilant eye on from now on.

Dan Thies @ 3:45 pm

Thanks, Toren…

I see that Jamie has updated the instructions so that the .htaccess implementation makes sense.

Shimo @ 3:47 pm

I browsed my tracking system logs (01-16 August), and found them:
69.89.21.71
72.232.150.250
208.110.218.138
208.110.218.139
208.110.218.201

Crawled my pages a lot, all have useragent:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

My solution for now is making list in my .htaccess

Deny from 69.89.21.71
Deny from 72.232.150.250
Deny from 208.110.218.138
Deny from 208.110.218.139
Deny from 208.110.218.201

Hamlet Batista @ 3:48 pm

Dan - Blocking all proxies has many potential side effects (there are legitimate uses for cgi proxies, such as Anonymizer, etc.). On the other hand, there are many advantages if we use http:BL for sharing the cgi proxy IPs. The main one is that it has already been proven and has an active community behind it. We can probably approach them and they might be willing to help.

Jaimie and Bill can get in touch with me if they want to. I will contribute in any way that I can.

Dan Thies @ 4:00 pm

Yeah, blocking them all is extreme, and that's why I asked Jaimie to wait on that one in favor of reverse cloaking. Reverse cloaking has held up so far…

But I agree that we need to have several options, because giving up anonymous visitors might be worth it, if you get hit and Google's still struggling with it.

Anyone who wants to implement a header & meta stripping proxy risks being found, because they have to host it somewhere.

Bill Atchison @ 4:33 pm

Dan,

As I mentioned in a private email I think you're running into 2 distinctly different problems.

a) Sites that crawl and cache your content that are then indexed in Google and,

b) Google crawling through a proxy

Unfortunately the results are the same if Google is allowed to crawl that proxy cache.

The reason I say this is that I have many high speed crawl attempts from China and HK all the time for many thousands of pages that they cache locally on their servers.

I'm positive this activity isn't Google via a proxy just because of the sheer speed alone, which can be 100s of pages in just a few seconds.

A couple of other things…

I don't use the lists of known proxies anymore as they vanish quickly and can be gamed. Instead I used other techniques that can usually identify most open proxies before more than a couple of pages are stolen. This can be done with post page processing by opening a direct socket to the most common proxy ports for that IP to see if it's an open proxy. By doing it post page processing here's no page latency noticed and, assuming you find and open proxy, you block it within a few page accesses.

@ Wave:
"I will assume that Matt and company will address this issue."

Never assume as this problem has been going on for years and I've even discussed it with a couple of the Googlers personally.

Either a) they don't care which I find hard to believe or b) they don't think it's that big of a problem for most sites as it typically isn't or c) something is wrong at the core of Google that makes this very difficult, if not impossible, to fix.

@ Mind:
"sitting on a problem for a year or two is irresponsible"

You need to address that comment to Google, not Dan, because nobody has been sitting on it. It's been blogged about, discussed in forums that Googlers are known to read repeatedly (and recently) and they just keep letting it happen.

@ Steve:
"otherwise you may unintentionally block legitimate searchbot spidering"

Define legitimate searchbot. I get the bulk of my traffic from 4 major search engines and almost nothing from the rest, therefore I don't consider them a legitimate waste of my bandwidth and blocked all the rest.

If one of the other bots suddenly becomes a big player in the market I'll open the doors and let it in.

Until then… 403 forbidden.

[…] might look into to help explain the problem is Proxy Hacking. Check out a post by Dan Thies called: Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs […]

Dan Thies @ 4:41 pm

I love you, Bill… seriously. You've been carrying the flag on content theft for so long, and not just complaining - showing people how to deal with it.

It's funny, you mention how long this has been going on… and how Google knew perfectly well that they had a problem. Between us, we explained all of these defenses on a big stage more than a year ago. Go look at SER's reports from SES last year, folks. "Reverse cloaking" was part of it. I guess unless you give the bad guys instructions on how to exploit it, then you aren't doing your job as a responsible citizen.

Bill Atchison @ 4:54 pm

Yes, not only did we discuss it on a panel in SES San Jose '06, it was discussed again in SES Chicago in Dec '06 and I did it again at PubCon in Nov '06 and there was a Google representative, Vanessa Fox if I'm not mistaken, on each of those panels.

There was also someone from Yahoo on 2 of those panels but the name escapes me at the moment.

YES, Yahoo has some proxy indexing issues as well but I'm not sure if Yahoo penalizes the original site or not.

Dan,

Jamie's update clears some things up…

I just figured that you cannot copy/paste the code from his blog since all those quote characters are replaced by non-code quotes … i.e. they don't work in PHP…

any clue on how to copy the code to a php source?

christoph

Dan Thies @ 5:09 pm

Cristoph, can you post that question over there?

David @ 5:11 pm

Thanks for this little day brightener Dan!

It seems to me since Google is a publicly-owned company, they and their shareholders might be responsive to negative publicity.

Anyone have any media contacts?

Song @ 5:19 pm

Wow, lots of conversations going on.
I had once lost all my traffic, too and took awhile to make my site visible… could not do much but wondering why that had happened.

Thanks for the posting.
I will share this with my friends and co-workers.

Bill Atchison @ 5:41 pm

@Hamlet, do you have any evidence that passing a X-Robots-Tag wouldn't be stripped by the proxy server?

Remember, unless it's just a clean pass-thru proxy your header would most likely be lost. Most of the CGI-based proxies scrub the HTML, strip out javascript probably return their own HTML header.

If your technique does work it probably won't for long.

[…] en este interesante post, Dan Thies nos cuenta los detalles de un bug en el buscador web de Google que permite conseguir que […]

Mark @ 5:54 pm

Nice one, Dan. He's definitely not wrong, I've seen people talking about this technique on a few of the murkier blackhat private forums.

[…] it dirty, both ways will always exist. And just learn how to defend yourself. Read more about the Google SERP kickout by proxy details by Dan Thies. Share and Enjoy:These icons link to social bookmarking sites where readers can share and discover […]

SJ @ 6:53 pm

Dan,

Solution to Situation #1 is lame.

Talking of Situation #2, don't you think a malicious hijacker can strip out your nofollow and noindex meta tags using simple regexp / string stuff while serving them to the bots? And it will be of no use. ;)

Dan Thies @ 7:00 pm

SJ, you may want to read through the comments that have already been posted.

If the people exploiting this were deploying their own proxies, we would be able to find them. They don't do that - they use existing proxies, and I have yet to see one in the wild that actually strips out the meta tags.

The whole "advantage" to using this exploit, is that it's "hands off" - it's easy to create links to URLs on other people's proxies, and you can do so anonymously, by doing comment spam on blogs, etc. So, there's no way to catch those doing it.

Hamlet Batista @ 7:30 pm

Hamlet, do you have any evidence that passing a X-Robots-Tag wouldn't be stripped by the proxy server?

Bill - I'm sorry if I was not clear, but I said the opposite.

2) reverse cloaking. I like the idea and I can share another alternative implementation, but the proxy can be modified to strip the robot's meta tag or the X-Robots-Tag header.

Any information that passes through a proxy can be altered.

[…] Brad Fallon's My Wedding Favors Site Hacked??? Filed under: Uncategorized — jamesdeannash @ 12:45 am I was reading a post that Dan Theis wrote about a Google proxy hack and man is it just unbelievable that this sorda thing could even happen.  And you know what Brad Fallon has definitely paid the price with one of his sites Myweddingfavors.  Read the full story here […]

Dewald @ 9:18 pm

I'm wondering how reverse cloaking will work on a WordPress blog that has WP Cache enabled.

Logically, the Google bot and/or the proxy would get whatever version ("index" or "noindex") of the page in the cache, wouldn't it?

Dan, you mentioned that a WordPress plugin is in the works. Hopefully the developer will have some workaround for this.

jsnx @ 10:11 pm

You don't need PHP to implement the reverse cloaking, or anything else, for that matter. Ruby, Perl and Python would all work. Why introduce language partisanship into this issue?

Dan Thies @ 10:11 pm

Yep, that's the main reason why I'm going to use the .htaccess solution.

On the other hand, the risks with a blog are smaller, assuming you post regularly, because the home page will change with every post, and even the posts themselves will have some changes if you run a recent posts/comments section and allow comments. I haven't seen a blog get caught up in this yet.

Dan Thies @ 10:22 pm

Language Partisanship???

Dude, you could probably do it with a bash script too… are you saying that Jaimie should have used his spare time to write solutions in every possible language?

If I hear about a solution that's implemented with Ruby, Perl, Python, Java, Ada, Forth, Prolog, LISP, C++, C#, C-, D, Amiga E, or whatever language, I'll be happy to link to them.

Not that I'd be able to make head or tail of the code. OK, I could probably sort out the Ada, Forth, and Java OK… and I used to use Amiga E every day.

For now, because a PHP solution already exists, I'm linking to it.

Thomas @ 10:28 pm

Hey Dan,

This is a great write up. Really appreciate you taking the time to write this so that we can be better prepared for such a circumstance in the future.

Dewald @ 10:36 pm

Your observation about blogs not yet being caught up in this leads me to believe that the motivation of the folks exploiting this Google weakness is primarily financial.

In other words, they are probably hired to knock a competitor down in the SERPs.

Folks have to have a lot of time on their hands if they did it "just because they can" with no direct financial gain.

There's probably also an element of "I'll show that so-and-so who thinks he's this-and-that a thing or two."

Dan Thies @ 10:39 pm

If you blog regularly, it's pretty hard for anyone to hang a "duplicate content" label on your home page, because the content changes all the time. If you look at a site like SearchEngineLand - the home page changes several times a day. Good luck proxy hacking that. :D

Manmohan @ 10:52 pm

Dan, This is great information. Keep Going.

We all need to seriously think on this before gets thrown out through proxies or untill google provides an "early" fix.

Dewald @ 10:55 pm

So, wouldn't it be far more effective and simpler to have a section with rotating text content on something like an otherwise static storefront?

Dan Thies @ 10:56 pm

My Wedding Favors actually ran for two weeks, rewriting the home page copy daily, and it worked. Hard to recommend that as a solution though. :D

August 17, 2007

Dan Thies @ 12:03 am

OK, folks… I've been sitting here for 12 hours approving new commenters and my nerves are shot. Time to sleep. New comments will get approved in the morning after I get back from my Genius Bar appt. at the Apple store.

I'd like to thank everyone who has contributed to the discussion so far. I was really dreading this, but you folks made it worthwhile.

Dan,

FYI we will build a DRUPAL plugin for jamie's method.

cheers
christoph

Bill Atchison @ 6:46 am

@Dan "Language Partisanship???"

Yah, that one blew my mind too…

Has Hillary Clinton started programming?

(Pingback)

PHP Random Text Selection @ 7:44 am

[…] possible way to prevent Google proxy hacks from hurting your search engine ranking is to dynamically serve different versions of the same page […]

Dewald @ 7:46 am

You don't have to rewrite the copy daily. You only need to rewrite it in a one-time effort.

As an example:

1) Create three distinct blocks of text on the page,
2) Rewrite the copy in each block five times (this rewrite is a one-time effort, conveying the same message in five different ways),
3) Then, when you dynamically construct the visitor's page, randomly select one of the five copies in each text block.

Googlebot (direct) could get and cache a copy of the page with, for example, Copy Version 2a in Block A, Copy Version 4b in Block B, and Copy Version 1c in Block C. Googlebot (via the proxy) could get and cache Copy Version 5a in Block A, Copy Version 5b in Block B, and Copy Version 4c in Block C.

If my math is correct, you dynamically serve 125 different, unique, and random versions of the page. Good luck proxy hacking that. :)

I posted a PHP function on my blog (http://www.dewaldblog.com/seo/php-random-text-50/) that will help folks do this. There should also be a trackback to this post of yours.

[…] Blog August 17th, 2007 submit_url = 'http://seoaware.com/2007/08/17/seo-articles-you-need-to-read-aug-17-07/';   Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs […]

So you decided to put an all out on this one while many are standing by. You're such a great SEO, more so, a great person. It must have taken you all guts to do this and finally decide to tell all about this bug that you found.

I have heard and seen somebody from the Philippines talking about the 302 redirect also and since i am not pro when it comes to programming and even SEO, i just stood aside.

I know somebody is doing a search on this and i found your post about this one in the most prestigious forum here in the Philippines (SEOPH forum) from a freind.

Thank you very much, as someday, when i come to understand stuff that you are talking about here. I would thank you even more for selflessly helping others in the SEO arena.

sam casuncad

Hamlet Batista @ 11:13 am

Dan - this has been a great discussion on how we can fix this problem, I think we both agree that it is Google who needs to fix it. Their method of detecting authentic pages is obviously flawed.

Last month I posted an idea on how they could fix this. http://hamletbatista.com/2007/07/19/content-is-king-but-duplicate-content-is-a-royal-pain/

I'd appreciate the community's input on this, as well as any other ideas that may arise.

SJ @ 11:27 am

It is Google who needs to fix it.

Completely agree!!!

Hobo @ 11:29 am

One of the best, more sincere Google articles I have ever read, if this is indeed a true problem.

Possibly the best real story (not made for sphinn) that's been featured there. Excellent analysis and coverage.

Shaun

[…] Dan Thies on proxy hacking taking people out of the SERP's. It's happened to me folks and it's easy to do so […]

[…] Thies published a post about how people have been hacking Google's search results using proxies to get the original sites nuked as duplicate content. He also explained how to defend sites against […]

[…] Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs This looks like some bad stuff. […]

[…] Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs "I continue to see web sites that are affected by this issue. After giving Google more than a year to resolve the issue, I have decided that the only way to spur them to action is to publish what I know." (tags: google seo webdevelopment) […]

Japanese SEO @ 4:03 pm

Hi Dan,

Your article surprised me very much.

It was a little bit difficult for me to understand fully since I'm a Japanese.
But I have introduced your experience and thoughts about this matter to my Japanese visitors on my blog.

Thank you.

P.S.
I've joined Stomper SIMPLE before as well as SEOFS. :-)

[…] Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs - Google’s algorithm allows 3rd parties to literally hack a web page out of Google’s index and search results …    […]

Dan Thies @ 7:40 pm

Welcome, Japanese SEO! Let me walk you through what happens:

  1. Some jerk links to www .proxyserver. com/proxy/www .mysite. com
  2. Googlebot finds that link and fetches that URL
  3. My server gets a request from the proxy and returns the page
  4. Google indexes the contents of my page, under www .proxyserver .com/proxy/www .mysite. com
  5. If I'm unlucky, Google drops or penalizes my page as duplicate content

dp @ 7:48 pm

Thanks for publishing this, disclosure is always best. Now you will have thousands of smart people working on a solution instead of a handful.

[…] a GREAT post by Dan Thies on Google Proxy Hacking. The post explains how proxy hacking can hurt your rankings and what you can do about […]

SEO Web Consulting @ 11:26 pm

Thank you Dan,

Courageous, honest and ethical… hats off to you for setting this good example.

My best to you,
Valerie DiCarlo

August 18, 2007

Mick @ 1:16 am

Kudos mate..good karma..thank you and thank you again
Keep up the fight.