Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs

In June of 2006, while working to resolve some indexing issues for a client, I discovered a bug in Google’s algorithm that allowed 3rd parties to literally hack a web page out of Google’s index and search results. I notified a contact at Google soon after, once I managed to confirm that what we thought we were seeing was really happening.

The problem still exists today, so I am making this public in the hope that it will spur some action.

I have sat on this information for more than a year now. A good friend has allowed his reputation to suffer, rather than disclose what we knew. I continue to see web sites that are affected by this issue. After giving Google more than a year to resolve the issue, I have decided that the only way to spur them to action is to publish what I know.

Disclaimer: What you’re about to read is as accurate as it can be, given the fact that I do not work at Google, and have no access to inside information. It’s also potentially disruptive to the organic results at Google, until they fix the problem. I hope that publishing this information is for the greater good, but I can’t control what others do with it, or how Google responds.

I am also not the only person who knows about this hack.

  • Alan Perkins (who along with many others stayed quiet about the 302 redirect bug for 2 years) knew about it the day after I found it.
  • Danny Sullivan has known nearly as long, and I suspect that his behind the scenes efforts are the reason why the major search engines all decided to publish “how to validate our spider” instructions after SES San Jose last year.
  • Bill Atchison knows, because he helped me figure out a defensive strategy for my client’s sites… and along with me danced around this issue on the “Bot Obedience” panel at SES last year – trying to warn people without telling them too much.
  • My (now former) client Brad Fallon knows… and he’s been subjected to a lot of unfair criticism that he could have easily answered by making this public. It cost him a lot of money, and “a lot of money” to Brad is a lot more than it is for most of us.
  • “Someone else” knows, because they were actively exploiting this bug to knock one of Brad’s sites off of Google’s SERPs. I suspect many other “black hats” know about it by now… because other sites are being affected. I can’t believe that they’re all accidents.

This is going to be a long story, I’m afraid… but bear with me, because you need to understand this, and how to defend yourself.

The story begins over a year ago…

My friend Brad Fallon had been having some troubles with Google, and one of his web sites, My Wedding Favors. In June of 2006, after exhausting all of his other options, Brad (who knows his way around SEO) hired me to direct his search marketing efforts and, in simple terms “figure out what the hell is going on with Google.”

The first thing I discovered was that he wasn’t “banned,” but that Google was indexing everything on his site except for the home page. It took about two weeks of research and testing before I developed a working theory. When we searched Google for phrases that should have been completely unique to the My Wedding Favors home page, we kept finding a particular kind of duplicate content: proxies.

For those who don’t know what a proxy is, it’s a web server that’s set up to deliver the content from other web sites. Among other things, proxies have been set up to allow people to surf the internet “anonymously,” since the requests come from the proxy server’s IP address and not their own. Some of them are set up to allow people to get to content that is blocked by firewalls and URL blocking on corporate, educational, and other networks.

The diagram below shows what this looks like, when used innocently by a human being:

diagram1.png

Unfortuntately for Brad, the proxies weren’t being used innocently, and it wasn’t some kid trying to read his Myspace messages at school, it was Googlebot, fetching his home page’s content under a different URL, and indexing it:

diagram2.png

When Google fetched the copy of Brad’s home page through the proxy URL, they were dropping Brad’s (authentic) home page from the index completely, and keeping the (proxy) duplicate instead. Every time it happened, Brad was losing a ton of traffic.

Since at first we thought it could be a “one-off” problem, we just blocked the proxies’ IP addresses from accessing our server. Sure enough, within a week or so, My Wedding Favors’ home page wasn’t just back in Google – it was all the way back on the first page of search results!

All good again? Not so fast…

Another week or so went by, another set of proxies showed up in Google’s index, and the home page was once again completely dropped from Google.

At this point, we realized that this wasn’t an accident. Someone knew exactly what they were doing. Someone was actively seeking out proxies and linking to them, so that Google would pick them up, and drop Brad’s home page.

By now, more technically inclined readers should see where this is going, but keep reading – it gets better (or worse).  If you’re just looking for “how to hack Google” instructions, that’s all I’m going to give you, so you can leave now.

Back to the story… It was pretty obvious that we had a fight on our hands… but we had some ideas for a solution. Since all of the Googlebot requests from the proxies so far had come through with Googlebot as the user-agent, we implemented a “quick fix” solution. Whenever we got a request from a Googlebot user-agent, we did a lookup on ARIN to see if the IP address was actually assigned to Google.

Another week, and they were back in, and right back on the first page of SERPs.

Now, I should mention that by this time I had contacted Matt Cutts at Google to let him know what was going on. His response was short, and he told me that he was surprised that this kind of thing could happen, but he did look into it. That was nice of Matt, because it’s not really his department, but he’s a good guy and actually wants to help webmasters out. I spoke about this with Matt and others on several subsequent occasions. They seemed to understand it, but nobody I talked to could do anything more than pass the word along to someone else.

That was more than a year ago.

I enlisted a few trusted folks to help me investigate, and Bill Atchison and I gave presentations at SES San Jose (August 2006) where we tried to warn people about the need to defend themselves without actually telling them “too much.” Since that presentation, other folks have written about the problem of proxies and duplicate content, but fortunately or unfortunately, they didn’t know how bad the problem was.

While I was in San Jose, Brad’s site got hacked out again (server upgrade broke our self-defense script and it had to be rewritten)…

Brad started getting a little sick of people calling him a “fake SEO expert” because his site was showing PR0, and couldn’t be found in SERPs… but he kept quiet and took the abuse, because he understood that this was dangerous information. I kept quiet too, because letting this kind of information out without giving Google a chance to fix the problem would be terribly irresponsible.

Bill kept quiet, as did Alan, Danny, and a few other folks who helped me research the issue. Either the SES San Jose presentations got through to someone, or Danny did something behind the scenes, because shortly after he learned about this, all of the search engines decided to start publishing clear instructions on how to validate their user-agents.

So, things quieted down for a bit. Google was (I thought) working on the problem, and Brad’s site was doing fine.

I told you it gets worse, though…

Around the first of October, the next wave of proxies hit. A different kind of proxy, that didn’t pass Googlebot’s user-agent along. There was a whole network of proxies that were built to avoid detection, because they were built to allow people inside the People’s Republic of China to view censored content without getting blocked by the “great firewall of China.” These proxies not only spoof the user-agent, they come in through many other (intermediate) proxies, so that the IP address of the original server can not be determined.

There was way to block them by IP address, because even blocking every IP in China didn’t catch them all. There was no way to catch them based on the user-agent, because the user-agent was spoofed.

I was expecting this, actually… and we had a solution in the works: reverse cloaking. Every page Brad’s web servers deliver now has “noindex, nofollow” in the robots meta tag, unless the request comes from a validated search engine spider.  A “spoofed” proxy visit from Googlebot delivers a page that won’t be indexed. A real visit from Googlebot gets the page with “noindex” removed.

He’s not the only one doing that, either. Matthew Inman at SEOMoz noticed del.ico.us doing the same thing last fall but none of the commenters could understand why… except Bill Slawski, who had seen the presentation at SES where I mentioned the “reverse cloaking” idea. Bill didn’t say much, but he probably understood the whole picture by then.

Crossing our fingers, but…

So far, this defense has held up, and Brad’s site isn’t just back in Google, he’s back at #1, and no longer has to answer questions about why he’s “banned by Google.”

Unfortunately, Brad was only the first person I know who was affected by this bug in Google’s algorithm. He’s far from the last… and I am sick of seeing people get hurt. After more than a year, Google hasn’t fixed the problem, although it seems that you are now more likely to catch a “-999 penalty” than get dropped completely. In my opinion, that’s not a huge improvement.

So I am going public with this, because we need solutions. We need Google to find solutions, instead of calling it a feature, like the 302 redirect bug (which BTW still exists in some form). Alan and I sat on that 302 stuff for almost 2 years before it got out. The result was no different. They still haven’t completely fixed that one – all it takes is a shorter URL and random things can still happen with a 302.

Google needs to hear it loud and clear from every web site owner – “fix this problem.” For any “Googlers” who may happen by and read this, here’s a suggestion I passed along to Vanessa Fox last December: when you retrieve a web page, and the Sitemaps verification meta tag tells you “hey, this tag is on the wrong domain,” then dump that page because you’ve got a proxy. That would at least help a few folks out… but it’s not a complete solution.

Why Is This Even Possible? Can It Be Done To Anyone?

In simple terms, it appears that the original (authentic) page gets dropped or penalized as duplicate content.

A couple years ago, Google deployed some software & infrastructure changes collectively known as “Big Daddy.” This involved crawling from many different data centers, and changes to the crawler itself. It appears that the changes include moving some of the duplicate content detection down to the crawlers. The bug probably arises from the way the data centers are synchronized. Pure speculation here, but the picture I have of what happens looks like this:

  1. The original page exists in at least some of the data centers.
  2. A copy (proxy) gets indexed in one data center, and that gets sync’d across to the others.
  3. A spider visits the original, checks to see if the content is duplicate, and erroneously decides that it is.
  4. The original is dropped or penalized.

The thing is, even if Google’s system is 99.9% accurate in selecting the right version as authentic, all you have to do is overwhelm it with large numbers. Large numbers of proxies, and/or a large number of times when a spider has to make the right decision. It’s possible that there’s no way to “fix this” without throwing the whole system away. I don’t know. I’m not an engineer.

As far as whether “any site” could get hacked,” I don’t know. I’m not a black hat. I don’t have a link farm. I don’t have a botnet to spam blogs with. So I can’t manufacture thousands of links to thousands of proxies, in an attempt to knock sites off of SERPS. I wouldn’t do that anyway – it’s evil. So what I know is based mostly on sites reporting a problem, blocking the proxies, and seeing the problem disappear after the proxies are gone. Then repeating the exercise with the same results.

It depends on whether it’s all about confusing the system, or if there are enough other factors involved. It’s quite possible that some sites have so much authority, MojoRank, or whatever, that they simply could never be affected. It’s possible that there are negative trust factors, such as large-scale reciprocal linking, that could make a site more vulnerable.

How To Tell If You’ve Been Proxy Hacked

The simplest test, if you are experiencing a problem, is to examine Google search results for a phrase (search term in quotes) that should be unique to your page. For example, if your home page says “Fred’s Widget Factory sells the best down-home widgets on Earth” then you can search for that phrase.

You want to use a phrase (or combination of phrases) that should only appear on your page, and nowhere else on the web… or very few places at least. Then you do the search – if there’s more than one result (your page), then you need to examine the other URLs that are listed. If some of them are delivering an exact copy of the page, you just may be dealing with a proxy that has hijacked your content.

A typical proxy link looks something like this:
www.example.com/nph-proxy.pl/011110A/http/www.mattcutts.com/blog/
It’s easy to see what URL that would fetch, if example.com were a real proxy. Other proxy URLs encode the target URL so it’s not always that easy to determine what they’re going to fetch just by looking.

The mere presence of proxies in the index doesn’t necessarily mean you’ll be dropped or penalized. The situation inside Google’s systems is no doubt very complex. I have seen sites with multiple proxies indexed, and no ill effects. It’s possible that there are certain factors (trust, authority, domain age, etc.) that make one site more susceptible than another. I have no idea how they make the decision on which copy of a page to keep.

If you discover that you have a problem (pages knocked out of the index, -999 “penalty”), and you can identify proxies as duplicate content, a reinclusion request is likely to work in the short term, while you implement countermeasures. If you don’t mind sharing your information with me so that I can use it for further research, send an email to proxyreports@gmail.com with the affected URL and the URL of the search result page that shows the proxy duplicates, along with any search terms where your ranking appears to be affected.

Cry Havoc, And Let Slip The Dogs of Spam?

I don’t know if publishing this today was the right decision… but it seems keeping quiet isn’t spurring anyone at Google into action. People are already getting nailed by this. I’ve spent the past month going back and forth, trying to decide what to do. I’m going to hit “publish” now, and hope that any attention we can bring to this situation will spur all those Ph.Ds in Mountain View to focus on this for a little while.

Ultimately, this decision was the same one that anyone faces, who finds a security problem with any software – do you try to work with the developer behind the scenes, or inform the community and hope that the community can respond faster than the hackers? As you’ll see below, the community is already responding, and I’m not publishing this without offering some solutions for those who may be affected.

How To Fight Back

There are basically three main possibilities for your situation:

Situation 1: You are running an Apache server. We have 2 solutions in this case, that were developed by Jaimie Sirovich (co-author of Professional Search Engine Optimization with PHP). We’ve worked some late nights on this.

Solution #1 uses mod_write and .htaccess, to pass all spider requests through a PHP script that validates the request. This will only defends against being hacked via “normal” anonymous proxies that pass long the user agent – it only inspects visits from the “Big 4″ search engines (Ask, Google, MSN, and Yahoo). I call this the “first tier” defense – it won’t stop every proxy that exists, but it will come close, and you can implement it without modifying any of your applications. It wil even work if your web site is all static pages. This is what I’m implementing. Jaimie doesn’t like it because it’s kind of a hack – and he would rather you didn’t use it at all.

Solution #2 is a PHP script that implements the “reverse cloaking” defense, putting a “nonindex, nofollow” robots meta tag into your pages unless it’s a spider that you have configured the script to recognize. This will only be possible if your site is built on PHP. It wouldn’t be terribly difficult for a competent PHP user to implement this in an all-static site, you’d just need to change .htaccess so that your .html files are parsed as PHP. A WordPress plug-in will follow soon. This is a more robust defense, against more proxies.

How to get the code: An implementation guide is provided on Jaimie’s blog, along with a testing environment that you can use to check spider user agents & IP addresses, and of course the source code for both solutions. No warranty is given. This is hard core code for a hard core situation. Don’t use it if you don’t need it, and all code should really be deployed by professionals who can understand what it does, modify it to suit unique environments, etc.

Situation 2: You are running a Microsoft (IIS) server. Jaimie is working on an IIS/ASP solution similar to the Apache/PHP solution, which should be available soon. Think days, not weeks, in other words. Much sooner than his new book (Professional SEO with ASP), which is also in the pipeline.

I want to thank Jaimie for stepping up to provide these solutions on very short notice. I had some code of my own but he’s a real programmer, and I’m just a guy who hacks scripts together when I need something in a hurry. This isn’t a job that should be trusted to a guy who hacks code part time. You want an expert.

Situation 3: You are on a hosted solution, aren’t running PHP scripts that you can edit, don’t control the web server, etc. This is a more complex situation. I will have another post tomorrow that will offer some possible solutions, including one that involves creating your own caching proxy on a separate server. In this case, I don’t recommend doing anything unless you really believe that you have a problem with proxies.

In fact, I have mixed feelings about recommending any “defensive” measures for anyone who isn’t actually being affected… unless losing your Google traffic for a few weeks is such a daunting prospect that you feel you must put up the walls. Just understand – running extra code before you deliver a page will have a cost, in terms of server load and response times. Personally, I am putting up the walls on all of my sites.

Further disclaimer: these solutions are based, at least in part, on information that the search engines have published regarding the right way to validate spider visits. It would be nice if they would publish the information once and then stick by it, but Yahoo gave us instructions shortly after Google did, then they recently changed the domain they crawl from (was inktomisearch.com, now crawl.yahoo.net). Once you start doing this stuff, you have to keep up with what the search engines are doing. I’ll certainly try to keep my subscribers informed, but not everyone gets my newsletter. Keeping up to speed on this stuff is up to you.

There are other solutions available. Bill Atchison’s Crawlwall is a professional (commercial) solution, that does a lot more to prevent content theft, etc. If you have the means, you may want to consider this instead, and move the burden of “keeping up with the spiders” onto Bill’s shoulders. Jaimie is working on a more general proxy-blocking solution as well. Ekstreme has the beginnings of a spider validation solution in the PHP Search Engine Bot Authentication code they published.

If You Are Operating A Proxy – Don’t Be Part of the Problem

If you are operating a proxy server, and you don’t want to be part of the problem, you can prevent your server from being used as a tool by adding a robots.txt file that prevents all search engine spiders from indexing proxied content through your server. For example, if all proxy URLs begin with /proxy/ then you can use:

User-agent: *
Disallow: /proxy/

Of course, not all proxies are being run by innocent people for innocent reasons. Some of them are actually designed to hijack content – to deliver ads, etc. Some people want to steal your content, and they want the search engines to index it. In fact, I would not be surprised if a large part of the overall problem isn’t caused by such people firing links at their own proxies.

Is It Just Google?

You got me… I haven’t seen any cases on other engines that looked like a proxy hack, but I’d be surprised if it only affected Google. Google may simply be the only search engine that shows you enough search results to let you “catch” the proxies. Google may be more susceptible because they crawl more URLs more often, and use multiple data centers.

Assuming I am not completely wrong, it sure looks like less of a design flaw, and more of an “emergent property” of the very things that make Google the world’s best search engine (just my opinion, apparently the average consumer no longer agrees). I don’t know that there is an easy solution, especially if the problem arises because of their multiple-data-center strategy.

Unfortunately, any countermeasures that we implement could be thwarted by someone willing to copy our content in other ways, or by constructing a proxy that spoofs user agents, uses intermediate proxies to hide its IP address, and strips out meta tags. This has always been possible, BTW. Anyone actually doing these things, of course, would likely be committing a crime… and would be a lot easier to find than some script kiddie using comment spam to fire links at someone else’s proxies.

Is It Possible That You Are Totally Wrong, Dan?

Yes, I suppose it’s possible that there is some other explanation, that everything at Google is perfect, etc. But I’ve spent a lot of time looking at this, and it sure looks to me like this is a real problem.

Defend yourselves, folks. It’s a dangerous world.

P.S. I will be discussing this issue with Jim Hedger on Webmaster Radio’s “The Alternative” today, Thursday August 16 – the show airs at 5pm Eastern.

UPDATE: As of May 1, 2008, I have every reason to believe that Google has solved this problem, at least in the general case. At this point, the only sites I can see getting “duped by proxy” are spammier than the proxies themselves.

Update again: September 2009 - damned if this thing hasn’t cropped up again – now it looks like Google’s replacing the duped URL with the copy’s URL – and even RANKING the duplicates… (similar to the already-known-and-passed-off-as-a-feature 302 redirect bug).

300 thoughts on “Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs

  1. Pingback: Search Engine Land: News About Search Engines & Search Marketing

  2. Pingback: SEO Egghead by Jaimie Sirovich » How To Guide: Prevent Google Proxy Hacking

  3. Dan – I’ve blogged about this issue a couple of times. Please read negative SEO counter measure #8 here http://hamletbatista.com/2007/07/16/you%e2%80%99ve-won-the-battle-but-not-the-war-10-ways-to-protect-your-site-from-negative-seo/

    I also expose one critical weakness of the bot validation method and propose a stronger solution here http://hamletbatista.com/2007/07/03/the-never-ending-serps-hijacking-problem-is-there-a-definite-solution/

    Both post are very technical but I think they are not difficult to follow.

  4. Wow, Dan, you’re really brave to put up this article! I can imagine all the conflicting interests that would want you to keep this issue quiet. In the end, you’re the guy the webmasters trust to keep the SEO world honest!

  5. If Google implements “trusted webmasters,” they can generate a tag that we apply to every page on our site. That tag would encode the domain name, allowing Google to validate that our copy of a site is legit, and that all other copies are bogus. Once a domain is registered to a particular webmaster, no other can register it. This could become an additional feature of Google Webmaster Tools. The main challenge is deciding who can be trusted, but that’s not insurmountable.

  6. Thanks for sharing this Dan. I think it’s better out in the open than simmering away in the background. Kudos to you for having the courage to inform us all.

  7. Hey Dan,

    great post and thanks for the summary of the initial problem. I appreciate this as I’m actually actively fighting the “google bowlers via proxy site” hackers for the last 2 weeks already..

    What your post lacks off is to EMPHASIZE that both of your solutions will only get rid of those proxies that ID as googlebot & co.

    But as you said for yourself – the big wave bowling sites out of the SERPs come from proxy sites that “cloak” the user agent … or better – they pass on the user agent from their visitor to your site…

    and NO, they are NOT all located in China (what makes you think that?)

    I actually discussed this with Bill (IncrediBill) last week on his blog where he mentioned that the only way would be to block all hosting centers in the world from crawling… something pretty aggressive to do…

    (see

    http://incredibill.blogspot.com/2007/07/google-proxy-hijacking-myths-urban.html

    )

    So, apart from blocking ALL datacenters in the world and going thru the SERPs looking for proxies, what do you suggest to cure this mess?

    best regards
    Christoph C. Cemper
    - the marketingfan

  8. Why Is This Even Possible? Can It Be Done To Anyone?

    The main problem is the way Google chooses the authentic page among duplicates. From Webmaster Central Blog: http://googlewebmastercentral.blogspot.com/2007/06/duplicate-content-summit-at-smx.html

    Providing a way to authenticate ownership of content…We currently rely on a number of factors such as the site’s authority and the number of links to the page.

    According to this, hijackers only need to install their cgi-proxies on domains with more inbound links and/or authority than ours.

  9. Dan,

    Just reading your post give me a massive headache!

    Seriously though, even if I don’t understand all the nitty-gritty details, this kind of straight-talk earns you a few more more notches in the respect scale.

    The lesson for me here is: never to put all your eggs in one basket. In this case, Google’s search engine. This is not a good 80/20 scenario.

    Now I’m beginning to appreciate all the buzz about Web 2.0, which are real people’s votes vs. Google’s massive algorithm.

    Great work!

  10. Hamlet,

    exactly… this is what happened to one of my sites,
    and that damn proxy site has a WIKIPEDIA link,
    while mine does not… this shows that even those wikipedia links which are nofollowed now seem to transfer trust

    christoph

  11. Dan,

    I’ve personally been nailed by this tactic and have had pages replaced by proxy pages. I have implemented similar patches but that seems to be all they are because they’re usually only temporary. I’ve reported to Google multiple times on this issue as well.

    One thing I’ve noticed is it normally only happens to pages that have fewer inbound links and lower PR, but your example seems to void that theory as I’m familiar with Brad and know his wedding site has a very strong link profile.

    Aaron

  12. Nice post. Sucks to be honest and try to work within the system and be nice about things. My hat’s off to you for your patience. Me, I’d have posted it long ago.

    I just don’t have that much patience. I know it’s also very hard to hint to people about a problem fix while keeping the actual problem hidden. I don’t have that much patience either.

    Good job.

  13. Pingback: SiteProNews Blog

  14. Let’s see, where to start… how about Christoph:

    Actually, the reverse cloaking solution does deal with proxies that don’t identify the user agent, because the only user-agent that gets a page without the “noindex, nofollow” are those that:
    1) Identify as spiders
    2) Pass a “valid IP address” test

    And I never said they were all in China – just the the set of proxies that got Brad were.

    Hamlet, I don’t see how what you’re suggesting is stronger, but it looks like maybe you didn’t read everything either.

  15. Aaron (TheMadHat) – yeah… we’re talking about a site that went all the way (back) to the top of the SERPs when we implemented the reverse cloaking solution.

    Jonathan, I think even starting with the domain registration that already exists in Webmaster Tools would be better than nothing.

  16. As a newbie trying to make enough bucks to augment Social Insecurity checks I can appreciate the efforts of people like Dan who try to keep the playing field level, even if it does belong to Goooogle. I give complete moral support to anyone who works to thwart “Proxy Hackers”.

    My area to exploit are the myriad small, no-account, puny commission paying small businesses. They, like me, aren’t trying to get rick quick; quite the contrary, we will work hard for small checks received thousands of times, simply by linking unique searchers to unique providers. We do this by using non hostile SEO tactics…. The cheap ones, of course.

    To quote my hero and mentor, Forest Gump, “That’s all I got to say about that.”

    Sam
    PATHFINDERS 2007

  17. Arghh,

    One of my websites have drastically dropped in traffic couple months ago. I have not touched the site for quite some time (no improving link popularity, etc), so this came as surprise to me. I read this article and did not find any proxy urls. However, I did notice that they are many spam websites that contain my content. Could this be the reason for the drop in Google’s ranks?

  18. This is worrying information indeed.

    I am lucky enough to be in a market where the vast majority of the competition are not very technically inclined – I would be surprised if they know what a nofollow tag is.

    For everyone in a semi competitive market – fireproof yourself asap.

    We are only going to see more of these cases before Google decides to fix the problem.

  19. Pingback: Marketingfan.com

  20. I will assume that Matt and company will address this issue. Could be a victory for the little guys with white hats?

    “If Google implements “trusted webmasters,” they can generate a tag that we apply to every page on our site.” Jonathan, that concept certainly has my meager vote.

  21. Dan,

    thanks for getting back to me on my assumption your solution wouldn’t work generally… and mea culpa – I’m sorry.

    I now agree that this method should work for (most of) proxy sites (some will get around this even tough, so need to be blacklisted in another strategy)

    I thought I’ll add to this great post by showing off my own company site, which is currently suffering from this attack method – and that’s for sure for 6+ weeks until I found out about it…

    I put together all the facts, keywords and SERP screenshots here

    http://www.marketingfan.com/search-engines/google-proxy-bowling

    and would appreciate your comments or other ways to faster cure it – because obviously a spam report to Google 2 weeks ago or even blocking all those scumbag’s IPs that we found were used for scraping us didn’t show any effect.

    I’ll move forward to implement your solution#2 ASAP

    Thanks again and cheers
    Christoph
    - the marketinfan.com

  22. I’ve been waiting to hear back from Matt and/or Adam to see what they say about this. Hopefully one of them will post a response here.

  23. Thanks for updating that, Christoph. So far, the reverse cloaking method has held up on every site we’ve implemented it with. Jaimie’s code is (IMO) actually a lot more reliable than what we’ve been using.

    He’s actually suggested another method, since the reverse cloaking script itself proxies the page, it would be possible to implement that using the .htaccess method and insert the robots meta tag into any content without having to modify existing code. For the same reasons (it’s kind of a hack) he doesn’t like it much. It would also be a heck of a lot of server overhead.

    Technically speaking it would be possible to check every IP, but as Hamlet pointed out, that would be a major bit of overhead to add to every single request.

  24. Wave, as I suggested to Google last year… they already have a meta tag that they use to verify the ownership of a domain. They use it to give you access to reports in Webmaster Tools.

    Wouldn’t be too hard to put that on every page, if Google would use it.

  25. Hey Dan!

    I think I know why I was mislead to think that this wouldn’t work for the proxies that pass the normal user agents…

    Egghead’s description to implement this in htaccess says

    RewriteCond %{HTTP_USER_AGENT} yahoo|slurp|msn|ask|google|gsa [NC]
    RewriteRule (^.*$) proxy.php?orig_url=$1

    Well, and that’s the flaw … this mod_rewrite condition only accepts the major 4 SEs as you said and I thought he would pass this into the “solution 2″ script…

    but he omitted the proxy.php completely from the post, which made me think his simple_cloak_v2.inc would mean to be the proxy.php to pass all requests into

    I guess I’ll just mod this to pass ALL requests thru the simple_cloak and make sure I don’t have to tweak any pages..

    Tough one concern I got is about the output buffering…I think this might cause problems on sites that ALREADY USE output buffering tough… but I’m sure Jamie can comment more on that…

    cheers,
    Christoph
    - the marketingfan.com

  26. Yeah, I get ya. They’re two completely different approaches. When he wakes up we’ll ask him to address the confusion on his post. He was up until 5am running test cases on the reverse cloaking script.

    Like I said, Jaimie doesn’t like the .htaccess method at all. I only asked him to do it because it provides at least partial protection to static sites.

    The reverse cloaking method doesn’t use .htaccess at all.

  27. To clarify I hope:

    Solution #1 says “if you claim to be a spider, we force you to prove it before we give you content.”

    Solution #2 says, “unless you claim to be a spider and can prove it, we insert a robots meta tag with noindex.”

    Solution #1 will effectively deal with “normal” proxies that pass along the user agent.

    Solution #2 deals with proxies that spoof the user agent.

  28. Pingback: How Proxy Hacking Can Hurt Your Rankings & What To Do About It | Seo Alchemist

  29. Dan,

    I believe either I’m too tired or you you are still wrong about solution #1

    IF sol#1 would deal with “proxies that pass along the user agent” it would do the same as sol#2 …

    in fact

    Solution #1 will effectively deal with proxies that PRETEND TO BE A BOT (but cannot prove it)

    I meanwhile went ahead and wanted to implement Jamie’s piece of code, but this implementation hint is also missing

    - a config.inc.php (which could be reconstructed by a coder after doing a review)

    - a cron’ed call to update the spider list

    WARNING:

    the CURRENT IMPLEMENTATION as listed on Jamie Egghead’s site ACTIVELY causes all SITES TO GET DEINDEXED because those mysql tables are empty if nobody calls updateAll() in his code..

    So beware fellow webmasters and wait until Jamie had his coffee … but obviously a mistake I can fully understand if you hang in testing mode until 5am and then still need to post all that stuff you made on your blog… :-)

    cheers
    Christoph
    - the MarketingFan.com

  30. Hamlet, I don’t see how what you’re suggesting is stronger, but it looks like maybe you didn’t read everything either.

    Dan – Thanks for responding. As you correctly mentioned in your post, there are obvious weaknesses for a programmer to exploit your proposed solutions.

    1) reverse-forward dns for bot detection. As you expressed in your post the proxy can be modified to provide another user agent instead of Googlebot.
    2) reverse cloaking. I like the idea and I can share another alternative implementation, but the proxy can be modified to strip the robot’s meta tag or the X-Robots-Tag header.

    Alternate reverse cloaking: Setting the new Googlebot supported header X-Robots-Tag: noindex,nofollow if the requesting ip fails the bot validation.

    I explained an easy way to implement this mod_headers and mod_setenvif here http://hamletbatista.com/2007/08/01/controlling-your-robots-using-the-x-robots-tag-http-header-with-googlebot/

    This solution has the advantage that works for any type of file, not just html ones.

    What is the solution I am proposing and why is it stronger?

    I’m proposing we use the same techniques that have been in use for sometime in the email anti-spam industry: Identifying and blocking ips that have been succesfully tagged as source of spam. We need to identify and block cgi hijackers.

    1. We can identify cgi-proxies by inserting a unique fingerprint+requesting ip to all pages when the IP is not from a search engine bot. We can later do a search for the finger print to find the cgi proxy IPs. I know
    this works because I use this technique to track people scrapping my RSS feed.
    2. We can then publish those IPs in The HoneyPot Project DNS database http://www.projecthoneypot.org/httpbl.php
    3. We then block access to our web pages to any IPs listed there.

    It is definitely not trivial (there is some programming involved), and is a little bit reactive, but I think this is a good starting point.

  31. sitting on a problem for a year or two is irresponsible. the proper thing to do would have been to re-notify google after a month or so of hearing nothing, that you intent to release the vulnerability, with full instructions on how to do it, in two months. i’m multiplying these times by 4 for you, so you don’t see so harsh, and because perhaps a conceptual vulnerability like this is a bit harder to fix than a simple buffer overflow. often times companies don’t fix something that isn’t high priority. by giving them a bit of time, you’re being nice and giving them a shot at fixing before public release. but after 3 months (which would normally be weeks), if they haven’t done anything about it, they obviously aren’t working on it, they just don’t care. by keeping this thing private (and you somewhat have, given that you don’t say exactly how to do it), you’re only helping the people who know how to do it, and are using to actively hurt competitors. the longer it is secret knowledge, the longer it will work, and the better off they will be

  32. Thanks for the clarity, Hamlet.

    What you propose does need an implementation, but it is stronger. Jaimie is actually working on a “block all proxies” method. I believe Bill Atchison is already doing the same thing with Crawlwall.

    A published database of proxies (assuming it can’t be maliciously polluted) would add a stronger layer of defense.

    It’s a given that one could construct a proxy that would spoof user agents, strip headers and meta tags, etc. – so stronger solutions are needed.

  33. Solution #2 has a flaw – a malicious proxy could simply remove the robots meta tag before it sends the page back through to Google.

    Solution #1 also has the problem that you need to keep a very good track of which IP addresses are valid, otherwise you may unintentionally block legitimate searchbot spidering.

  34. I appreciate your feelings on this, Mind… but that’s easy to say that when you don’t have to make the decision.

    Right now, at this very moment, there are people out there trying to make use of this exploit, who didn’t know it existed yesterday. Implementing defenses takes time. Starting up a comment-spamming script takes seconds.

    After seeing this exploited once, I didn’t actually see it again for some time. I believed that Google was working on it… and as I said, I have spoken with several people more than once during the interval.

    Trust me though, next time, I will just publish what I know as soon as I know it. Lesson learned.

  35. Steve,

    1) We KNOW that a malicious proxy could be constructed. I’ve said as much. :D But I haven’t seen one in the wild yet… and solving THAT problem is a lot bigger. One of the reasons why this exploit is so “nice” for black hats is that it’s totally hands off – they don’t have to do anything but point links at other people’s proxies. If they build one that strips meta tags and headers, they have to host it somewhere, and then there’s at least a chance of tracking them down.

    2) The forward and reverse DNS lookup is what the search engines have given us to use.

  36. Pingback: Search Matters Weekly Link Love at Catalyst Search Matters Search Marketing Blog

  37. Dan and all the comment writers, you’ve left my head spinning but I’m very thankful nonetheless for your insights. Certainly something I’ll keep a vigilant eye on from now on.

  38. I browsed my tracking system logs (01-16 August), and found them:
    69.89.21.71
    72.232.150.250
    208.110.218.138
    208.110.218.139
    208.110.218.201

    Crawled my pages a lot, all have useragent:
    Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

    My solution for now is making list in my .htaccess

    Deny from 69.89.21.71
    Deny from 72.232.150.250
    Deny from 208.110.218.138
    Deny from 208.110.218.139
    Deny from 208.110.218.201

  39. Dan – Blocking all proxies has many potential side effects (there are legitimate uses for cgi proxies, such as Anonymizer, etc.). On the other hand, there are many advantages if we use http:BL for sharing the cgi proxy IPs. The main one is that it has already been proven and has an active community behind it. We can probably approach them and they might be willing to help.

    Jaimie and Bill can get in touch with me if they want to. I will contribute in any way that I can.

  40. Yeah, blocking them all is extreme, and that’s why I asked Jaimie to wait on that one in favor of reverse cloaking. Reverse cloaking has held up so far…

    But I agree that we need to have several options, because giving up anonymous visitors might be worth it, if you get hit and Google’s still struggling with it.

    Anyone who wants to implement a header & meta stripping proxy risks being found, because they have to host it somewhere.

  41. Dan,

    As I mentioned in a private email I think you’re running into 2 distinctly different problems.

    a) Sites that crawl and cache your content that are then indexed in Google and,

    b) Google crawling through a proxy

    Unfortunately the results are the same if Google is allowed to crawl that proxy cache.

    The reason I say this is that I have many high speed crawl attempts from China and HK all the time for many thousands of pages that they cache locally on their servers.

    I’m positive this activity isn’t Google via a proxy just because of the sheer speed alone, which can be 100s of pages in just a few seconds.

    A couple of other things…

    I don’t use the lists of known proxies anymore as they vanish quickly and can be gamed. Instead I used other techniques that can usually identify most open proxies before more than a couple of pages are stolen. This can be done with post page processing by opening a direct socket to the most common proxy ports for that IP to see if it’s an open proxy. By doing it post page processing here’s no page latency noticed and, assuming you find and open proxy, you block it within a few page accesses.

    @ Wave:
    “I will assume that Matt and company will address this issue.”

    Never assume as this problem has been going on for years and I’ve even discussed it with a couple of the Googlers personally.

    Either a) they don’t care which I find hard to believe or b) they don’t think it’s that big of a problem for most sites as it typically isn’t or c) something is wrong at the core of Google that makes this very difficult, if not impossible, to fix.

    @ Mind:
    “sitting on a problem for a year or two is irresponsible”

    You need to address that comment to Google, not Dan, because nobody has been sitting on it. It’s been blogged about, discussed in forums that Googlers are known to read repeatedly (and recently) and they just keep letting it happen.

    @ Steve:
    “otherwise you may unintentionally block legitimate searchbot spidering”

    Define legitimate searchbot. I get the bulk of my traffic from 4 major search engines and almost nothing from the rest, therefore I don’t consider them a legitimate waste of my bandwidth and blocked all the rest.

    If one of the other bots suddenly becomes a big player in the market I’ll open the doors and let it in.

    Until then… 403 forbidden.

  42. Pingback: Has Your Home Page Dropped Off The Face of Google?

  43. I love you, Bill… seriously. You’ve been carrying the flag on content theft for so long, and not just complaining – showing people how to deal with it.

    It’s funny, you mention how long this has been going on… and how Google knew perfectly well that they had a problem. Between us, we explained all of these defenses on a big stage more than a year ago. Go look at SER’s reports from SES last year, folks. “Reverse cloaking” was part of it. I guess unless you give the bad guys instructions on how to exploit it, then you aren’t doing your job as a responsible citizen.

  44. Yes, not only did we discuss it on a panel in SES San Jose ’06, it was discussed again in SES Chicago in Dec ’06 and I did it again at PubCon in Nov ’06 and there was a Google representative, Vanessa Fox if I’m not mistaken, on each of those panels.

    There was also someone from Yahoo on 2 of those panels but the name escapes me at the moment.

    YES, Yahoo has some proxy indexing issues as well but I’m not sure if Yahoo penalizes the original site or not.

  45. Dan,

    Jamie’s update clears some things up…

    I just figured that you cannot copy/paste the code from his blog since all those quote characters are replaced by non-code quotes … i.e. they don’t work in PHP…

    any clue on how to copy the code to a php source?

    christoph

  46. Thanks for this little day brightener Dan!

    It seems to me since Google is a publicly-owned company, they and their shareholders might be responsive to negative publicity.

    Anyone have any media contacts?

  47. Wow, lots of conversations going on.
    I had once lost all my traffic, too and took awhile to make my site visible… could not do much but wondering why that had happened.

    Thanks for the posting.
    I will share this with my friends and co-workers.

  48. @Hamlet, do you have any evidence that passing a X-Robots-Tag wouldn’t be stripped by the proxy server?

    Remember, unless it’s just a clean pass-thru proxy your header would most likely be lost. Most of the CGI-based proxies scrub the HTML, strip out javascript probably return their own HTML header.

    If your technique does work it probably won’t for long.

  49. Pingback: refugeenet.org Blog » Bug en Google: tu sitio web puede ser penalizado en el buscador mediante un ‘ataque proxy’

  50. Nice one, Dan. He’s definitely not wrong, I’ve seen people talking about this technique on a few of the murkier blackhat private forums.

  51. Pingback: SEO Religion » So no wonder I get all those proxy results...

  52. Dan,

    Solution to Situation #1 is lame.

    Talking of Situation #2, don’t you think a malicious hijacker can strip out your nofollow and noindex meta tags using simple regexp / string stuff while serving them to the bots? And it will be of no use. ;)

  53. SJ, you may want to read through the comments that have already been posted.

    If the people exploiting this were deploying their own proxies, we would be able to find them. They don’t do that – they use existing proxies, and I have yet to see one in the wild that actually strips out the meta tags.

    The whole “advantage” to using this exploit, is that it’s “hands off” – it’s easy to create links to URLs on other people’s proxies, and you can do so anonymously, by doing comment spam on blogs, etc. So, there’s no way to catch those doing it.

  54. Hamlet, do you have any evidence that passing a X-Robots-Tag wouldn’t be stripped by the proxy server?

    Bill – I’m sorry if I was not clear, but I said the opposite.

    2) reverse cloaking. I like the idea and I can share another alternative implementation, but the proxy can be modified to strip the robot’s meta tag or the X-Robots-Tag header.

    Any information that passes through a proxy can be altered.

  55. Pingback: Brad Fallon’s My Wedding Favors Site Hacked??? « James Dean Nash - SEO Blog and Internet Marketing

  56. I’m wondering how reverse cloaking will work on a WordPress blog that has WP Cache enabled.

    Logically, the Google bot and/or the proxy would get whatever version (“index” or “noindex”) of the page in the cache, wouldn’t it?

    Dan, you mentioned that a WordPress plugin is in the works. Hopefully the developer will have some workaround for this.

  57. You don’t need PHP to implement the reverse cloaking, or anything else, for that matter. Ruby, Perl and Python would all work. Why introduce language partisanship into this issue?

  58. Yep, that’s the main reason why I’m going to use the .htaccess solution.

    On the other hand, the risks with a blog are smaller, assuming you post regularly, because the home page will change with every post, and even the posts themselves will have some changes if you run a recent posts/comments section and allow comments. I haven’t seen a blog get caught up in this yet.

  59. Language Partisanship???

    Dude, you could probably do it with a bash script too… are you saying that Jaimie should have used his spare time to write solutions in every possible language?

    If I hear about a solution that’s implemented with Ruby, Perl, Python, Java, Ada, Forth, Prolog, LISP, C++, C#, C-, D, Amiga E, or whatever language, I’ll be happy to link to them.

    Not that I’d be able to make head or tail of the code. OK, I could probably sort out the Ada, Forth, and Java OK… and I used to use Amiga E every day.

    For now, because a PHP solution already exists, I’m linking to it.

  60. Hey Dan,

    This is a great write up. Really appreciate you taking the time to write this so that we can be better prepared for such a circumstance in the future.

  61. Your observation about blogs not yet being caught up in this leads me to believe that the motivation of the folks exploiting this Google weakness is primarily financial.

    In other words, they are probably hired to knock a competitor down in the SERPs.

    Folks have to have a lot of time on their hands if they did it “just because they can” with no direct financial gain.

    There’s probably also an element of “I’ll show that so-and-so who thinks he’s this-and-that a thing or two.”

  62. If you blog regularly, it’s pretty hard for anyone to hang a “duplicate content” label on your home page, because the content changes all the time. If you look at a site like SearchEngineLand – the home page changes several times a day. Good luck proxy hacking that. :D

  63. Dan, This is great information. Keep Going.

    We all need to seriously think on this before gets thrown out through proxies or untill google provides an “early” fix.

  64. So, wouldn’t it be far more effective and simpler to have a section with rotating text content on something like an otherwise static storefront?

  65. My Wedding Favors actually ran for two weeks, rewriting the home page copy daily, and it worked. Hard to recommend that as a solution though. :D

  66. OK, folks… I’ve been sitting here for 12 hours approving new commenters and my nerves are shot. Time to sleep. New comments will get approved in the morning after I get back from my Genius Bar appt. at the Apple store.

    I’d like to thank everyone who has contributed to the discussion so far. I was really dreading this, but you folks made it worthwhile.

  67. Pingback: PHP Random Text Selection

  68. You don’t have to rewrite the copy daily. You only need to rewrite it in a one-time effort.

    As an example:

    1) Create three distinct blocks of text on the page,
    2) Rewrite the copy in each block five times (this rewrite is a one-time effort, conveying the same message in five different ways),
    3) Then, when you dynamically construct the visitor’s page, randomly select one of the five copies in each text block.

    Googlebot (direct) could get and cache a copy of the page with, for example, Copy Version 2a in Block A, Copy Version 4b in Block B, and Copy Version 1c in Block C. Googlebot (via the proxy) could get and cache Copy Version 5a in Block A, Copy Version 5b in Block B, and Copy Version 4c in Block C.

    If my math is correct, you dynamically serve 125 different, unique, and random versions of the page. Good luck proxy hacking that. :)

    I posted a PHP function on my blog (http://www.dewaldblog.com/seo/php-random-text-50/) that will help folks do this. There should also be a trackback to this post of yours.

  69. Pingback: SEO Articles You Need to Read- Aug, 17-07

  70. So you decided to put an all out on this one while many are standing by. You’re such a great SEO, more so, a great person. It must have taken you all guts to do this and finally decide to tell all about this bug that you found.

    I have heard and seen somebody from the Philippines talking about the 302 redirect also and since i am not pro when it comes to programming and even SEO, i just stood aside.

    I know somebody is doing a search on this and i found your post about this one in the most prestigious forum here in the Philippines (SEOPH forum) from a freind.

    Thank you very much, as someday, when i come to understand stuff that you are talking about here. I would thank you even more for selflessly helping others in the SEO arena.

    sam casuncad

  71. One of the best, more sincere Google articles I have ever read, if this is indeed a true problem.

    Possibly the best real story (not made for sphinn) that’s been featured there. Excellent analysis and coverage.

    Shaun

  72. Pingback: Friday Tea Time - 8/17/07 » TheMadHat

  73. Pingback: Asheville SEO & Web Design| Search Engine Optimzation News and Strategy » Defending Your Site Against a Google Proxy Hack

  74. Pingback: Jimmy Daniels » More (Mostly) Bad News for/from Google

  75. Pingback: links for 2007-08-17 : Christopher Schmitt

  76. Hi Dan,

    Your article surprised me very much.

    It was a little bit difficult for me to understand fully since I’m a Japanese.
    But I have introduced your experience and thoughts about this matter to my Japanese visitors on my blog.

    Thank you.

    P.S.
    I’ve joined Stomper SIMPLE before as well as SEOFS. :-)

  77. Pingback: Interesting websites for SEO, Web Marketing and everday work from Sante - August 17th

  78. Welcome, Japanese SEO! Let me walk you through what happens:

    1. Some jerk links to www .proxyserver. com/proxy/www .mysite. com
    2. Googlebot finds that link and fetches that URL
    3. My server gets a request from the proxy and returns the page
    4. Google indexes the contents of my page, under www .proxyserver .com/proxy/www .mysite. com
    5. If I’m unlucky, Google drops or penalizes my page as duplicate content
  79. Thanks for publishing this, disclosure is always best. Now you will have thousands of smart people working on a solution instead of a handful.

  80. Pingback: Google Proxy Hacking | SEO Blog: SEO-Web-Consulting.com

  81. Pingback: Google proxy hacking « Searchability 2.0

  82. Hi Dan, what you have published here is indeed very disturbing. I got in touch with support of my Canadian host "Sitebuildit SBI!"

    My case together with your URL has been past on to their tech team. The following had happened to me that actually caused a friend of mine to do research finding your SEO publication.

    A homepage called [deleted] is using my indexpage of my humble homepage by quoting to it for their own means. What means? They are mere speculation of mine based on your pointing out the google bug.

    I found lots of entries including my indexpage in the index page of [deleted].

    None of the links work (I suppose they all are no follow links which is nasty enough) I fear that the owner of [deleted] might have a marchiavelli mind and, could really mean business.

    His motives "exploiting" other webpages that offer rental like [deleted] might be different than in my case.

    Don’t ask what business. I leave it to your better understanding.

    Anyway, I now leave things to my support sitesell.com team as everything technical goes well over my head.

    Gabriele

  83. Wow, guess it’s time to dig through my log files again. I have noticed a site – univeristySOMEthing – that is duplicating my content with a link back to the original publisher/me.

  84. Pingback: The WIzard, fkap

  85. Gabriele, I’ve edited your comments because this isn’t the place for naming names and making accusations, especially when you haven’t given your own URL so that I can investigate.

    If someone has copied your entire page, a DMCA notice to their hosting provider (use the ISP template) should be enough to get the site taken offline.

  86. Dan,

    I am curious as to how long the fix took for the site to come back into Google after implementing the fix, and was a re-inclusion request a necessity?

    Also this looks a bit advanced in implementation, is there a way to test it to assure the site is not de-indexed?

  87. Guys i have just sent an email to google for this very thread helping me JUST TODAY to find,my website hijacked and framed.(all thanks to me coming here from http://incredibill.blogspot.com/) Now it seems all ok with google search,but when i went to msnlive search and just typed my domain as is without the www etc..lo i was surprised to find a particuar site,whom seemd to hog the first and second pages with pointng to my site. So i clicked the link and sure enough it went to my site. All looked good,and seemed to be in place. UNTIL i checked the source and the source code of my very own site just had this below.NOTHING ELSE Rush and Check every engine guys. ————- Please visit the———————- Thanks Dan & Bill I hope i have helped some wake up, in my own,bumbling way.

  88. sory guys,the board has deleted the adsense framed script for the “above post”..but it was just an adsense code fmamed script,leading to some thiefs adsense-code to earn money off my effort and back.

  89. Well, good on you Dan – It is about time this was brought out into the open.

    I had exactly the same thing happen to my main website back in June. Checked the site one morning and it was gone from Google – all 8000 odd pages. The site has been around for 10+ years with a PR5 on the main page through to numerous inner pages.

    Went over to Google Webmaster forum to try and find out what the cause could have been and after many postings and checks by other webmasters, found that the site had been proxy hijacked and my pages indexed under the proxy address. The one proxy was then being redirected through to another proxy and then on to another IP address.
    Quick fix was to 403 block the two proxies and the IP address that was accessing my server and also where Googlebot was indexing my pages.

    Filed a spam report through webmaster tools against the proxy and also filed a reinclusion request. Nothing happened at all, so back to GWMF and another post which got the attention of Adam Lasnik. Never really got a straight answer to why my site was deindexed but within a few hours the site was crawled and put back in the index. However, couldn’t find anything before page 4 or 5 despite having page one rankings for numerous keyword phrases prior to being deindexed.

    I have since filed two more reinclusion requests but to no avail. On top of that, most of the site has lost it’s pagerank and I am also convinced that it is now being humanly suppressed or penalised. The site doesn’t come up in search for the domain name either.

    The original proxy that had my pages indexed has disappeared from Google but the intermediate one is still there with some 3250 pages indexed (not mine). Both proxies were registered to the same person as well.

    This problem is very real and only seems to be applicable to Google as I haven’t seen any evidence of the same thing happening in Yahoo or Live.

  90. Thanks Dan for your work and openness in revealing this issue!

    I don’t do SEO any more and am not super technical so forgive me if I’m on the wrong track. I think this may help or at least shed light on part of the problem.

    My rankings are fine but I did the proxy test just to see.
    FYI as I test I’m copying the links to Netscape where I don’t have a G toolbar so I don’t encourage G to follow these links.

    So I found a Proxy and according to Whois it’s in CA. They have exact copies of 3 of my pages. (please don’t research and click with your G toolbar on!)

    http://nickoli.net/ looks like a normal site except there is a proxy link on their homepage. Go to the proxy page and check out what it says to do. It pulls up an exact dup of GOOGLE and so I think when you search through their Google page and click on a site, thats when it makes copies. http://nickoli.net/superfun/fun.pl

    So a couple questions. Is this what a normal proxy does(as opposed to a nefarious Chinese bot ring), to just help people surf anonymously? Isnt there a place at Google to request a page be dropped from the index? Is there a way to just report this site to them so they stop indexing this one? Looks like he has about 22,000 pages of other peoples content indexed. I’m not personally too worried about this one. He doesnt have much PR or traffic. Just asking in case this example helps others with Proxies they find.

  91. I’m tired of the problems caused by proxies, ban them. Linda in your case us:
    deny from 67.159.0.0/18 , the whole data center. I spend about 1 hour a week looking for proxies and just ban the whole data center.

  92. This article was like a cold finger up my spine. The problem has existed for a year now and nothing has been done to fix it? So one day my site could be humming along nicely and the next…BAM…dropped thanks to some scumbag black hat proxy crap. Google needs to take this seriously and come up with a solution. If it means using website verification then fine. I (and other legitimate sites) have no reason not to provide verification to the Googlebot.

  93. By the way the only reason I suggest going to Google is because its the biggest search engine and people know what it is. I could easily suggest going to puppies.com first the point is the initialize the proxy before going to a website and I happened to use Google even though like I said I could easily change it to puppies. The above post mentioning my website, I find is kinda of rude, just letting you know. :)

  94. Pingback: SiteMost’s Weekly Blog Recap 20/08/07 at Brisbane SEO Blog

  95. Well, Nick, it’s great to hear that your proxy isn’t “intentionally” hijacking content @ URLs like:
    http://www.nickoli.net/superfun/fun.pl/111100A/http/mixtapetorrent.com

    That’s great to hear. It’s good to hear that you aren’t just a thief. I’m glad to know it’s all accidental. Since that’s the case, I’m sure you’ll do the right thing and put a real robots.txt file on your server, to block spiders from accessing the content that you’re proxying.

    Sorry for being rude, dude… but your site is part of the problem.

  96. Nick, I find it kind of rude as well considering your site was one of the ones I keep having problems with. You aren’t gaining in any way from people exploiting your site and the end result is that you’re going to get completely banned from Google after too many people file spam reports. It takes 10 seconds to put a proper robots.txt file in place to fix the issue.

  97. I just checked My Wedding Favors and the home page PR is 0. Does this mean those dirty dog hackers found a way around the defenses??

  98. I’m telling you guys this is insane…

    Someone needs to start an unhackable search engine where we move all our sites to on an exclusive basis and take them down from all other engines..leaving a copyrighted notice for action if found.

    Then the majors might wake up and do something.

  99. @ Mark… actually, the toolbar PR data is just lagging very far behind, because that’s the way Google designed it. So the score you’re seeing does relate back to the time when the proxy hack was working.

  100. Mick, I don’t know how you’d build a foolproof search engine.

    I’m not an attorney, but I suspect that Google may have a “duty of care” in this instance, once informed of the potential harm… as would those operating proxy servers.

  101. This would appear to help if the attack is solely proxy related. I have a feeling it will make little difference if the pages are more content based in the attack by static content and free-blog copies, which may not even be associated to the proxies.

    Is there any evidence that the site had static clones aside from proxy sites? Is there any way to find out where the sites are that are linking to the proxies and see if its possible to trace the hands doing the attacks?

    Google needs a reasonable solution, if in-fact, site owners are being penalized and/or loosing content from multiple infringements via any application.

  102. Dan, thanks for the info, a little bit late though… supposed you had your reasons. Hope all other SEO gurus start telling us their secrets too, Ha!

  103. I kinda feel like we did talk about it as much as we could – not just me, but Bill Atchison, at both SES and Pubcon. But nobody really reacted to "someone could steal your content and Google might see it as duplicate."

    SEO is a noisy space. There are still a bunch of people running around saying that duplicate content isn’t important. Just like there are people who say canonicalization can be left up to the search engines, so don’t bother with basic redirects.

    I would hope everyone who reads this post would understand how serious the problem can get.

  104. Dan,

    Here’s my $0.02 to help balance out a wee bit of hysteria going on around this topic and other forms of scraping at the moment.

    The flip side of this coin is what I quite often see where someone finds they have a single page hijscked or scraped and think it’s to blame for their entire site going down the toilet.

    If it’s the home page and all your inbound links are only to your home page, no deep links, then it’s entirely possible that hijacking the home page will tank the entire site.

    However, quite often I see a lack of inbound links, bad/incomplete/non-existent SEO, and other things causing the problems in the search engines other than that one copied page.

    People need to think before they jump to a conclusion that a single hijacked page nuked their site, unless it’s the home page.

    Doesn’t mean the proxy sites or scrapers aren’t causing you trouble, but there may be other issues at work to cause your site to be lower in the rankings.

    Last but not least, people need to be VERY CAREFUL about filing a false DMCA notice against a proxy site because the proxy site DOES NOT hold your content, unless they cache it, so the only place your content is actually duplicated is in Google itself.

    I mention this only because you can be sued for filing a false DMCA request so be VERY careful as I’m not sure which way the copyright law would fall on a pass-thru proxy service as they only pass the data through their server at the request of a 3rd party, which in this case, is Google.

  105. Very important points, Bill.

    I have no idea what the legal status of a proxy would be WRT safe harbor provisions either, but it hardly seems worth the risk to pursue a pass-thru proxy service over DMCA.

    On the other hand, if Google is holding a duplicate of your content, it’s probably reasonable to send them a DMCA notice. But as I’ve said, I am not an attorney and this is not legal advice. :D

    You’re absolutely correct in your assessment of where this hack has an impact – for most sites, if they can take out your home page, you’re going to have big problems. If your home page isn’t hijacked, it’s more likely to be just a nuisance.

  106. Being proactive is much better than waiting until it “doesn’t happen to you”. Devote 1 hour a week to locating proxies and ban the whole data center where the proxy is located. Pacific rim telecos are another problem in themselves, ban them.

  107. Hey, don’t blame Nick on Google’s bug as Nick stole NOTHING unless he cloaked a list of sites for Google to crawl via his proxy which some of them do.

    Blame Nick for not installing a robots.txt that stops bots from crawling via his proxy.

    Blame Google for having a bug that attributes your pages to the wrong site.

    Lots of blame to go around, just make sure you blame the right people for the right thing.

  108. Absolutely, Bill. If Nick doesn’t have a robots.txt on his site in a few days… I’m gonna have to withdraw the benefit of the doubt though.

  109. Guys anyone have a good one size fits all script that will brake my site from Frames.

    I have just been searching hours on engines,the dogs have simply gone to yahoo and msn as google “seems” to be making a feable attempt to do something with a half hearted clean up here and there.

    I am serious here someone should put this thread out as a press release to the wider media…stating that advertisers could be loosing money,on hijacked websites,with no real value or traffic for their $dollar.

    let’s see the shareholders and bean counters move into the programming section of the search engines them eh..

  110. P.S i have not even checked the other engines like Ask etc etc.
    you will never see them come up in your stats i.e. your sites,because the dogs simply sue your good name and keywords to point to their sites..

    sorry for the babbling on guys right now i am upset with a headspin..

  111. Sorry guys I admitted I’m not that technical. Now I’m stumped. Bill said: “Hey, don’t blame Nick on Google’s bug as Nick stole NOTHING”

    Some of my pages are completely duplicated on his site.
    They arent framed, the html source is all there too.
    And they are sitting on his server so he copied them, no?

    I don’t want to use my site as an example so lets check out
    someone’s site we all know. (chopped off http”//www

    nickoli.net/superfun/fun.pl/111100A/http/searchengineland.com

    Bill also said: “Lots of blame to go around, just make sure you blame the right people for the right thing.” Maybe I’m missing the whole point but this isnt just a page that a Google bot manufactured and listed. It’s an exact copy of my home page sitting right on his domain.

  112. Pingback: What happen if you are dropped by the Search Engine? - 5 Star Affiliate Marketing Forums

  113. Pingback: And I though everyone knew! | Optimize Your Web

  114. Great post, thank you. I actually thought that everyone knew this. At my last SES this year I sat down with fellow speakers and if memory does not betray me this was one of the topics. Well then again I ended up armwrestling a Russian, but that is another story. Again great post, thanks!

  115. Linda, I think you’re confused about what a proxy site does.

    Your pages don’t sit on their server, they are passed through their server in real time.

    It works like this…

    1. BROWSER REQUEST –> 2. PROXY SERVER –> 3. WEB SERVER –> 2. PROXY SERVER –> 1. BROWSER RESPONSE (WEB PAGE)

    Does that make sense?

    Proxy servers are like a phone line, they only contain the content of the conversation while it’s in progress.

    If you see frames and such around your content that’s injected during the reply to the browser, much like music inserted into a phone call on hold.

    Does that help?

  116. Dan/Jamie,

    I have noticed that the traffic one of my websites (built primarily in HTML) has dropped by 50%. I have done a few google searches for specific keywords/phrases that would be used for my own site only and have noticed a fair few web directories with my details and they are acheiving higher listing than my own site.

    While I appreciate this probably isn’t the same thing as the Google hack you mentioned, it is still having a serious adverse effect on my revenue and status.
    If there anything I can do to block these sites from indexing my website OR to push my site above theirs?

    If I block their indexing would this inadvertently affect my Google ranking also?

    Thanks

    Kirk

  117. Thank you. This makes a life of a website owner so much more complicated and interesting in a strange way.

  118. Pingback: Google Proxy Issue - Any Third Party Can De-Index you! | Reviewer of Sites

  119. Pingback:   How A Third Party Can Remove Your Site From Google SERPs

  120. Take a close look at what “Nick” is doing. That is not a true pass through proxy. I use javascript on parts of my pages.

    Nick, your proxy is removing my js content. Your script is defacing my site. Once you start removing content from my pages I don’t believe you can claim to be innocent of Google have an incorrect “Cached” copy of my work.

    Proxies that “deface” the original work I believe are on very shakey legal ground.

    Ban Them!

  121. Pingback: Google Bug Allows Third Party to Remove Your Site « Internet Marketing Blog by Noon-an-Night

  122. Pingback: ilovecode » Blog Archive » Proxies causing your site to get dropped from Google…say it ain’t so!

  123. Jim, in that case, they could and should. And don’t.

    That’s a simplified example, though – most proxies I’ve seen encode the target URL.

  124. If http://www.mattcutts.com/blog/ has 10,000 inbound links and http://www.example.com/nph-proxy.pl/011110A/http/www.mattcutts.com/blog/ has 2 then it’s an open and shut case. However if a trackback /comment spammer throws 100k links into the proxied URL (which is easily done) then it becomes more of an issue and can knock your page out of the results. When the engine finally figures it out, all the spammer has to do is repeat the easily automated process and do it again. There needs to be a check in place for this to be easily determined but yet there isn’t. This falls to the search engines to correct b/c they can’t expect billions of fairly low-tech webmasters to understand how to stop it.

    “they could and should. And don’t.”

    My sentiments exactly.

  125. Let’s say http://www.example.com is a proxy. Many webpages include in the footer a “copyright Nameofthesite” link.
    Google could check for these kind of links and if the URL differs from http://www.example.com in most of the cases, then googlebot should know that wwww.example.com has duplicate content and not the other way around.

    This could work only if the proxy has good traffic, because it will increase the chance of fetching pages with copyright link in the footer.

  126. When you create an account with Google you are required to upload a specific file that lets Google know that you are the web master before Google will allow you access to sensitive information specific to your web site.

    Surely this could be used as a method to decide which page is the correct one to drop from the index?

    This should be quick to implement and fix on behalf of Google – I can’t see why they would sit on this for so long?

  127. Some of the cgiproxy/phproxy/zelune proxy sites are just users trying to find their way around content blocking firewalls that happen to end up in the search results.

  128. Dan, I don’t think everyone is realising that Google knows about the problem and hasn’t done anything. If Google was taking a proactive approach to the problem there wouldn’t have been any need for you to go public with the “script” and the reason for it.

    Protect your property, don’t expect an SE to do it for you.

    The proxy examples used here are very minor when you think about using a proxy for scraping. Google needs to be able to crawl the proxy “cache” to get your page/s. That cache is on a server somewhere. Find the server and ban the data center.

    Side thaught. Do you think if AOL proxies removed Adsense from all pages AOL users got, we would be having this discussion :) .

  129. Very interesting. I knew as many people do of the inherent danger of duplicate content but this shows it in a whole new light. Looking back it does look blatantly obvious.

    You could always of course enlist the help of sites such as copyscape.com although such a services does not check for exact duplicates but more partial duplicates so you would receive a lot of negative matches as well.

    Why not build an automated tool that will use google’s search api and ensure that the unique term does not show up on another web site, this would at least alert you to a potential problem if it does happen. If it does then you could take steps to rectify it. Tis would save having to constantly check google results manually.

  130. Peter, you can actually run Alerts @ Google (www.google.com/alerts) for the “signature SERP” that should be unique to your own content.

  131. Copyscape and alert @ Google are both useless because by the time you get those alerts the damage is already done and you’re taking reactive measures to fight the problem.

    The only way to mostly solve the issue is with proactive measures such as whitelisted bot blocking, blocking data centers and proxies, and anti-scraper tools to stop stealth crawlers.

    Sure, it’s draconian, but I sleep better at night knowing I’m not subject to dupe content problems unless someone does it manually.

  132. I completely agree with Bill. There’s no point shutting the gate after the horse has bolted.

    Proactive use of .htaccess and setting up a site just to catch all the nasties and then use that data on your revenue generating sites is the way forward.

  133. I have a website with smarty cache and squid-based reverse proxy (http accelerator). I’m afraid that solution provided above (the one with nofollow/noindex meta tag) is useless for me. Depending on what’s in cache it may fool real Googlebot or allow evil content-stealing proxies.
    Is it possible to move mod_rewrite conditions and reverse DNS matching against IPList to Squid custom redirector or authenticator (ACL) ?
    How to do it right?

  134. Thanks for the heads up, Dan. And the instructions on how to fight back are great, too.

    Hopefully by you coming out publicly on this they might actually do something about it.

  135. The Google Alerts service is useful for a lot of things, Peter. It can be tremendously useful to help identify link building and promotion opportunities, for example.

  136. I lost a $6,000/mo website and another $5,000 network of sites to this. I implemented something similar to solution #2 in September 2006 but I never was able to get back in to Google…

    Another words, implement it NOW before you’re attacked and dropped from Google, not after!

  137. Pingback: Interesting Blog Post By Dan Thies… : newbie-network.com blog

  138. Pingback: Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SE » Niche Marketing Success

  139. WOW! I had actually wondered what had happened to My Wedding Favors site as I was watching it for a few weeks and I did notice when it disappeared. Then about a month later one of my sites totally disappears from Google. I thought I had brokedn any number of Googles rules but I couldn’t figure out which ones. I had all original content, not too many links on each page and I wasn’t doing any keyword stuffing or any sort of black hat stuff at all and I was vanished from the first pages. I thought I just had to build more and more backlinks and redo my whole site so I did and now it is just barely coming back after about 10 months! I really wish I had known about this because it sound exactly like what was happening to one my main sites.

    Dan, I really appreciate you spending the time and energy to sort this all out. You are certainly a credit to the web community. Hats off to you…. uh, white hats that is:)

  140. Bravo Dan. I wish we had more people like you on the net, providing good solutions while exposing a flaw out in the public. So many people have developed the habit of hanging the flaws against the big clock, without really contributing to the solution of the problem. You have done both. Splendid Job! Keep up the good work!

    I really hope that this will finally spark some reaction from Google.

  141. Dan,
    My hat’s off to you for this piece of work. I am a Stomper and wondered why myweddingfavors.com was no longer on the SERPs. You have answered that question nicely.

    Further, it is a tribute to you and the whole Stomper organization in doing the “responsible thing” in situations like this.

    Unfortunately this is not the first such problem. We have all suffered with black hats trying to knock off the Biggest Gorilla in the forest. That why there are so many patches to Microsoft products in fending off viruses and spammers. While Bill Gates and company are not my favorites, still it is their customers that suffer equally, if not in the majority.

    For Google to get the message, it has to hit home that someone is out to ruin a multi-billion dollar business that is founded on delivering the “most relevant” information. Once they understand that their business model is compromised by excluding the “most relevant” content then, and only then, will they take action.

    For instance, as soon as YouTube.com drops from the SERPS because of this type of tactic, you can bet it will get fixed.

    If you want to get Google’s attention, show them how and why it will negatively impact their bottom line, if this continues. Self-serving interests are usually the most powerful arguments.

    Again, my thanks for clearing up a mystery…or at least offering up a more than plausible explanation, and what to do about it.

    Good luck and again, Thanks for sharing.

    Regards,
    Carl Carlson

  142. Pingback: Can A Third Party Really Hack Your Site Out of Google? | Improve Your Internet Marketing

  143. Pingback: coffeeblack.co.uk » The Proxy Danger

  144. Todd: You should be able to contact Google and get reinstated in your original places. Contact Matt Cutts.

  145. Hey Dan, so you finally said in public what you have been telling Google in private (and not so private). It is such a shame that THEY have allowed innocent people to suffer HUGE loss of revenue through adsense etc (Google OTOH do NOT lose as they still get their slice).

    Do no evil eh? Ignoring a known problem that is evil, makes one evil by default. To those who are pointing a finger at Dan and accusing him of ‘sitting on this’I say, GOOGLE are the ones who sat on it. THEY KNEW!

  146. Guys i have good news..Google apparently does listen,and help when they can.I sent them an email as stated above, regarding this thread and my site hijacked in frames in MSN well over 15 times,last i checked.well this morning i have clicked the links that are hijacked and they lead to blank pages,perhaps they got in touch with MSN or something, i dont know how their engineeers did it,but kudos.

    here is their respond email.
    ===================
    Hello,

    Thanks for letting us know about this issue. Our specialists are
    investigating and will take action as necessary. Please feel free to let
    us know if you continue to experience this problem.

    Sincerely,
    ***********
    The Google AdSense Team
    ========================
    It does help to send them screen shots and ask for help where ever you find the problem.
    Remember a baby that does not cry gets no MILK :)

    And at the same time by taking these guys down slowly you could be helping others,they have affected without realising it..lets pull together.

  147. I must also add to the above post that i had also gotten in contact with MSN,and only received a reply back from Google,so i dont know which party came to the Alamo.

    In anycase here is a link which helped me with a Captured in a Frame script,which i did not use untill one of the engines had seen the problem.
    Hopefully it can be help to some
    http://www.loriswebs.com/hijacking_web_pages.html

  148. Thanks Dan for the very insiteful information.

    I’d been wondering why Brad’s site was knocked off the search results and sent to PR0 last year, now I know the reason. Athough the wedding site still has a PR0 I see rankings are back up in google for keywords.

    Since all our sites run on php we’ll be looking into the reverse cloaking solution as an option in case this every happens to us.

    Thanks again for sharing this SEO info.

  149. Pingback: WealthMountains News » Is your site dropped from Google?

  150. Pingback: Third Level Push (modified Siloing) For Deeper Index Penetration » Half’s SEO Notebook

  151. Dan,

    You said “If you blog regularly, it’s pretty hard for anyone to hang a ‘duplicate content’ label on your home page, because the content changes all the time.”

    If a webmaster posts a short RSS feed to a static, non-blog site to create this effect, would it create enough of a content change to fix the proxy hacking problem?

    Just curious, as I have multiple clients’ sites that I need to upload a solution to quickly, even if it’s a temporary fix.

    Thanks,

    Derrick

  152. Thank you my site was dumped from google 6 months ago for no reason and it was #2 heading for #1. I have implemented your solution and hopefully it goes back up the rankings.

  153. Hi Dan, I had a similar thing happen to me about 12 months ago with just one website copying my content and presenting it out. Because their site was a 10,000 page portal – unlike my guitar mini-site my PR dropped and a direct search for my site found them at the top – for all my unique keywords.

    Luckily I found out who owned the site – sent a cease and desist email from copying my content – and the site owner actually called me on the phone to say sorry – it was a php hack that was caching the page (with an affiliate link embedded). Google wasn’t recognising it as such and dropped me for duplicate content. It took about 3 weeks to fix it.

    The big problem is this isn’t just proxy servers. Anyone with a bit of php knowledge can download a page on the fly – even sending a browser as the user agent – modify it slightly and then serve it back up. Blocking the IP doesn’t help because they may grab it through a proxy network.

    If they do it with enough sites from their own site then they’ll start rising to the top of the heap. Nasty. I hope you keep at this and find a solution otherwise sites with affiliate programs may get dropped out.

  154. thanxs for such a valuable information . why inspite of knowing for more then 1 year– google is not undertaking any actions?? are we suppposed to take all these stepts to keep ourself safe?

  155. Mick,

    I would not be so sure Google do anything seriuos about it. I got the answer:

    “Thank you for bringing this to our attention. Please be assured that your feedback will be forwarded to our search quality team who will use the information to improve the quality of our search results.”

    And then few words about Google webmasters tool.

    Looks rather like some automatic response. Nothing also changed in Google results. I am loosing my hope.

    BTW: Any ideas how to deal with it when using Squid as http accelerator?

  156. Yeah Al it did cross my mind of course..but the more stink we put up about it ,it might finally register that something has to be done..this is not sustainable for doing due bussines on search engines.The big knowlegable boys know what to do in defence and deterance ,but for the rest of us 98% we need help and guidence.

    As for the “Squid as http accelerator” I am not sure mate,as i was blisfully making websites and promoting them for someone else to profit,untill i discovered incredibills blog,about a year ago and started getting into this area of why my sites seemed to loose links when i know they were growing,and hence came here from bills blog..to continue the crusade,and learn defence..as i am nob in these matters.
    I hope one of our more techie friends chimes in for ya regarding
    “Squid as http accelerator”

    my best.

  157. Pingback: Web Affiliate Creators

  158. Thank you very much for the article. There are several times when my SERPs have dropped and I am left wondering why. Now I at least now one more reason why it might happen and how to fight back.

  159. I wonder how the first solution works according to that web proxies can strip meta tags.

    Regards.

  160. Hi there,

    Great article, and a great read. It’s certainly interesting to hear about things like this.

    I guess you’re only likely to be affected if you have a “middle-ground” site. Small websites probably dont earn enough traffic for black hats to bother with them, and major sites are likely to be unaffected simply because of the other factors that contribute to their PageRank.

    It’s a scary thought though, and definitely something that needs to be looked at.

  161. @Derrick:

    The reason why blogs are hard to “duplicate” is the frequency of the content changes. If you post every day, your home page changes substantially every day.

    So if Google picks up a proxy copy, your home page is probably already changed when they come back to your site.

  162. “proxies can strip meta tags”. That is why they need to be banned and I do mean the whole data center where the proxy is located. What you are describing is a form of scraping, content theft.

    The script provided by Dan is only part of the answer to a much bigger problem. Use it while you continue to locate proxies and scrapers. Pacific rim Telecos are another major problem.

    “Small websites probably dont earn enough traffic for black hats to bother with them..” Think about 1 million pages getting one hit a month. If your small site gets one hit a month, you are a potenial target.

  163. @ Koz:

    The first solution only works with proxies that pass along the user agent. PHP-Proxy is fairly popular (zillions of sites running it), and it can strip meta tags, but it doesn’t spoof the user agent.

  164. Pingback: The Binary Cult Blog » Blog Archive » Google SEO DoS Exploit

  165. Pingback: Pushing WordPress SEO Boundaries | Andy Beard - Niche Marketing

  166. Dan – one aspect of the equation that I haven’t seen addressed here is "Why" a particular site is targeted in the first place?

    Why Brad’s site? Are any sites safer from potential proxy hacking than others? Which sites are most likely to be affected?

    My blog buddy and I were talking about this article during our weekly coffee this morning. I personally applaud you for releasing this information.

  167. This is what I get from your post for us simpletons who find our collective eyeballs popping out of their sockets: People can steal your URL through some techie machination and when you try to reclaim your position you – not they – will be determined to be duplicate content by Google.

    Now, that being so, the entire worldwide web can become a hacker’s paradise. Search engine results which were painstakingly garnered to high levels of hits and targeted responses can be stolen again and again. The incentive to build websites is thus diminished, especially among those of you able to make dynamic, revenue-generating ones. The Internet moves into the Ice Age. And amateurs like me haven’t a fighting chance without hiring techies like you to oversee the evildoers.

    It seems to me that spammers and hackers have got a grip on our collective balls and the small guy without funds or technical savvy has no reason or motivation to compete.

    In my pea-brain analysis, the Internet may be becoming a haven for scavengers, fraudsters and charlatans alone. Honest business people are being marginalized. That does not mean that the good marketing guys and gals do not exist; it does mean that the evildoers are winning and unless you can make advocates out of technical dummies such as myself, you haven’t a Palestinian’s chance in Jeruselum to turn the tide.

    You are ill-advised to talk techie to techies; speak to the common folks!

  168. Pingback: Where Is Their Self Interest?-Make Money Online With Snowboardjohn

  169. @Sparky – my best guess would be that it was a competitor, either from the retail or wholesale side.

    @Richard – wow, a lot of gloom and doom there. It’s gonna be OK. The good guys will win this one. I’m not trying to sell you anything, I’m trying to get the problem solved. It’s a technical issue. It needs a technical discussion.

  170. @ Mick
    Well, it looks like you are right. It is possible to get back to “old” positions, at least it’s been getting better for last couple minutes. I do not know what helped. Mail to google, changes in robots.txt or thousands of other things I did before and after I realised that evil proxy took my places. But many days of hard work helped and it is good we just went down a little bit but are back in the business. I hope for a longer time now.

    Keep fighting guys!

  171. Pingback: Extreme Web Traffic blog » Blog Archive » Can your competition kick you out of Google?

  172. Here is a possible simple solution to this and to the underlying Google duplicate content penalty, if search engines will agree. Why not use the Content-Location http header? And if the header is not enough, then a META tag version of it. That header seems to be intended to specify the URI of entities that are available from multiple locations. This should handle both the cases when you have duplicate content on your own sites and when your pages get proxied or scraped.

    Of course a smart proxy could replace that header and meta tag, but then search engines should investigate further in the rare cases when two different content locations are claimed for duplicate pages.

  173. Proxies normally don’t pass any http headers. An http-equiv meta tag could be applied, if the search engines supported it. But do browsers consistently support it? Are there potential issues there? (I don’t know the answer, but according to this test page @ W3 my browser (Firefox version brand.new) isn’t handling it right.

  174. Pingback: News in 2007 » Why good content websites are penalised by Google

  175. I like the idea to use the domain verification tag to detect proxys.

    It does however only work as long as the proxies keep passing the tag along, they might start stripping away the tag as many of them probably count on their proxied content getting indexed.

    In the end Google needs to fix it’s algorithm before we end up with yet another round of only junk results when all the real results have been dropped.

    Simon

  176. This issue came up for me last year when I began noticing a lot of proxy results, which I didn’t understand at that time, except that I knew it was not a good thing. Unfortunately, I did not receive good advice, was essentially told not to worry about it. Well, my primary site took a major nosedive and I realized I was right, even though quite a newbie in the webmaster realm.

    Dan, kudos to you for shining a bright beacon on this issue and providing solutions.

  177. Pingback: New Design and Some Weekly Links << Vandelay Website Design

  178. Pingback: Google Local Listings Hijacked? » SEO Image Blog: Stardate

  179. It was great reading this article. You’ve provided quite a lot of info. Hats off to you. Mixing up everything I knew since some time and what you have written gives me an idea on how to do these things.

    Bad is bad and wrong is wrong so I never support it. One should know about how this works for defending their self.

    Regards,
    Addies.

  180. Pingback: Using Proxies to De-list Competition in Google | Blue Lotus Project Security Blog

  181. @Koz I am happy for you mate.
    But i am starting to think search engines by overall dont give a twat..yes if you cry and bitch and have someone helping you from inside of whats going on,so you can rectify your problem OK..then it seems to be forgotten,is the impression i am getting.

    In Otherwords as you said we have to be pro-active and act as such that no one will or can defend your site BUT yourself only.

    Did anyone read the link posted by Ban Proxies
    http://www.forbes.com/2007/06/28/negative-search-google-tech-ebiz-cx_ag_0628seo.html

    What we need is more headlines with the situation explained correctly for the big engines to make a move in protecting their shareholders and ultimately bottom line.

    People dont understand that it is more serious then first thought..especially the higher your site turns over money the higher you are a target..such as my wedding favors site ..and the motive seems not to be traffic hijacking exlusively most times but also..actually Phishing as incredibill alluded in his blog most recently.
    How many credit cards can these guys swipe and skim from?
    No wonder that scott and duke(from the forbes article) charge $6000 for hourly work etc..cause the reward are huge for the payer.
    Yes small time guys hijack your traffic and sites for traffic,they can be dealt with,but as Bill alluded in his blog their is more sinister forces at work here.

    very soon someone is going to start a libel case against one of the engines if they are not pro-active in dealing with our concerns here in this blog.
    Becuase we submit in good faith,spend money in advertising in good faith,adhere to terms in good faith etc..to have the ball droped by them to some thief in the night.
    something has to be said or done by the three bigs.

    or we are all going to join the black hat side out of frustration…if no one keeps the rage as usuall everything will die off and pot-luck will continue to be our theme on search engines.

  182. Plenty of Web 2.0 user generated type sites are making money by ripping our content (through their users), and I’ve been growing more concerned about how this washes out in the serps. Online clip storage service users (as in multiple) are doing a full c&p of my posts on these kinds of websites (like clipmarks and clipclip), which are getting spidered and listed in the serps. I’m getting pretty nervous about this.

    IMO the real problem isn’t proxy hacking, it’s how Google and the other engines are determining who the original author is.

  183. Pingback: Third-party Google proxy hacking | Ranking on Google | SEO Handholding.com

  184. Terry, Alot of those sites comfirm the url actually exists, you can ban these IPs.

    Some of what you are referring to is automated eg:
    – RSS Feeds
    – Scraping

    When you find some site using an spider/scraper to access your content locate the data center and ban it.

    I know what I’m advocating seems drastic to many but, it’s now 2007 get used to it or perish.

  185. Chuck, I wouldn’t grant permission for a web reproduction, but we’re working on some guidelines for those who want to reproduce the content in other ways, such as email newsletters.

  186. Hello, Dan. Nice article.

    First of all, I suppose that cleaning meta tags is not a problem for proxies. If they don’t do it, I suppose it is a temporary phenomenon. If things are so serious as you present them, the arms race will follow them, no more, no less.

    The only thing that could help at the next stage is cloaking technique based on premeditated text content damage. The proxified text should be totally unfriendly to SE’s.

    Unfortunately I have no time to check all the comments carefully to be sure that things I’ve mentioned haven’t been mentioned earlier.

  187. Pingback: Google Proxy Hacking

  188. Pingback: SEO Buzz Blog » Blog Archive » Google Proxy Hacking

  189. Hi Dan, You mention in your article that Jamie is working on a solution for those of us on a windows server and do not use PHP.

    Personally I use basic HTML pages and some ASP pages and operate on a windows server (for all of my sites). I would really appreciate some help with this to avoid this trap and ensure that I protect the SERPs for my sites.

    Any news on the solutions from Jamie yet?

    Kirk

  190. Dan, thanks for pointing me in the right direction. Much appreciated. And well done on bringing this to people’s attention.

    Regards

    Kirk

  191. Pingback: Google hacking, did anyone experience this? - vBulletin SEO Forums

  192. Pingback: August ‘07: Best Search/Marketing Posts » Small Business SEM

  193. The question – should we be including the full post in our rss feeds, or just the opening teaser paragraph???

    Yes, Google is trying to determine who is the original author of information on the web, but even it says that you should consider asking people to robots.txt such pages – rather impossible.

    I have an article about the issue on my site – I added feedburner to my site, and had some comments about rss feeds.
    http://www.searchmasters.co.nz/articles/89/rss-feed-statistics/

    Google’s guidelines make interesting reading:
    http://googlewebmastercentral.blogspot.com/2007/06/duplicate-content-summit-at-smx.html

  194. Hi Folks,

    as written earlier, we build the Drupal module for the “anti proxy hack” , which is located here

    http://drupal.org/project/antiproxyhack

    For those interested in bleeding-edge alpha testing,
    you can get the first version already directly from drupal’s CVS server here

    http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/antiproxyhack/

    All you feedback (as Drupal issues) is appreciated!

    thanks & all the best
    Christoph
    - the marketingfan

  195. Nice post and well done for bringing up quite a niche issue out in the open.

    I hope this problem will be noticed by many others and thanks for the resources on improvement and fixes.

  196. @Michael, anyone who is publishing your RSS feed really should be getting permission anyway, but we know that doesn’t happen. Publishing a short teaser/description in your feed makes good sense if you want to avoid having your content duplicated.

    @Christoph – thanks for that contribution.

  197. Pingback: A Never-ending Battle — Protecting your content from CGI hijackers † Hamlet Batista dot Com

  198. Eccellent and informative article you have written here.
    Thanks.
    I have now undertaken prevention steps (htaccess) you outlined in the hope I can prevent this from happening.

    Really good info and it explains what I have been looking into recently with a friends site who is being proxy copied.

    thanks

  199. Excellent post, this no doubt is shaking things up a bit across the web, but I think you did the right thing in exposing the proxy hacking loophole.

    As we all have noticed, there is an increasing number of scrapper sites snagging and exploiting content from legitimate sources and posting it for monetary gain.

    Furthermore, providing the steps to protect yourself from this was a God send. Something has to be done to protect your site after the countless hours spent promoting, writing world class content & link building, only to have some jealous chump come along and throw in the monkey wrench.

    It’s a shame that websites that took years to build are falling prey to unethical crash and burn webmasters targeting their sites just to get a jump in the SERPS and of course a few bucks from the traffic.

  200. Pingback: By Design: Building Trust, Security, Links | Logo Design Works

  201. Dan, thanks a ton for this. I emailed you regarding what I believe to be an example of Google proxy hacking occurring on our site. Adding you to my “must read weekly” bookmarks.

    Thanks again!

  202. I wish I found this sooner. This has been happening to us for almost a year or more. I tried to review the Server logs but it was hitting us so hard we couldn’t figure out what was going on, so we assumed it was some sort of spam referral…

    It was always the same thing too…

    http://www.ourdomain.com/someone_elses-domain.com/our-content/ / / etc etc….

    Or

    http://someone_elses-domain.com/www.ourdomain.com/our-content/ / / etc etc….

    I have been banning the URLs using shell on our server and it has helped a little, but I will definitely implement the First option now!

    Thank you for the information.

    Shawn DesRochers

  203. I put together a short article on this matter and how to protect against it.

    I coded my own bot verifier and works with google, yahoo, msn, ask and archiver.
    It caches DNS queries to mysql and you can configure the lifetime of cached data.

  204. Hi Guys.
    I’m still struggling to find any info on protection when you’re website is not PHP. Mine is built in ASP and plain HTM. It operates on a Windows server.
    If anyone can help, please drop me a message through contact feature on my website.

    Thanks.
    Kirk

  205. Kirk,

    I think ASP can do forward and reverse DNS. I don’t even remember (6 yrs. passed sience last ASP interaction) if it has integrated but a component can do the trick. And there are many out there.

    To understand the process read the basic and lame : http://www.tellinya.com/read/2007/09/09/forward-and-reverse-dns-lookup/ .

    And as last resort consider PHP for the future! I started in ASP but … I hardly rember those days! ASP.NET has much more intergated (no longer rely on components for everything) thanks to .net platform and can do this easily.

  206. I have a question in regards to solution 2 in regards to a high traffic site that generates webpages and caches them for a period of time.

    A PHP script implementation of the “reverse cloaking” defense of putting a “nonindex, nofollow” robots meta tag into all my webpages except if it is a LEGITIMATE robots crawling, makes total sense… but if I am CACHING the generated pages all with noindex nofollow, I think i have one of two problems. 1) either none of my pages will get indexed or… my live pages (actively published by PHP), being generated and servers with the index, follow to legitimate bots will cause my server load to skyrocket and my webpages to be served slowly , also affecting the Serch Engine page rank of my site because my pages are being served slower.

    How do you recommend a website that caches its webpages efficiently and effectively protect itself from this proxy hack and content duplication penalty.

    One more thing… how can we estimate when Google,Yahoo! and MSN will switch over to considering, our content as the ORIGINAL content… what makes the conent on the proxy obsolete? Must all our proxy-duplicated content be updated with new stuff… to help this process along?

    Thanks!

    JT McNaught
    http://www.IsPopularOnline.com

  207. JT, for starters I don’t know that anyone needs to implement anything unless they believe they have a problem. This isn’t an everyday thing or you’d see all kinds of site owners screaming about it every day.

    If you use a firewall or load-balancing proxy you may be able to inspect the user-agent before deciding how to handle the request. The only special case is one where the user-agent actually matches a known spider – for all other cases you could serve a cached page with noindex.

    If you have content that’s been duplicated and it’s causing an issue, then modifying the content, at least slightly, would be a prudent step.

  208. JT consider caching only the contents of a page but put the nofollow in the header. Check the Google blog to see the have a new X-Robots-Tag so you can add meta robots in header. So cache HTML and serve the Robots rules in header based on the outcome of your verification.

    http://googleblog.blogspot.com/2007/07/robots-exclusion-protocol-now-with-even.html

    If robots fail verification I directly push them a 500 HTTP Error. And no bot failed. MSN, Google and Yahoo and Ask resolve well except that domain from MSN phx.gbl which I have not figured out yet and blocked!

  209. Heh Dan,

    Dan, your advice has been right on. Thanks so much. I have been having all kinds of problems with rogue robots and your insight has solved many problems.

    I never thought of caching the content separate from the header. This makes total sense! I will read up on the X-Robots meta tag.

    Regards,

    JT

  210. Noblimal… Seriously… that page was not intended to offend. I appreciated your solution and I thought it was very deserving of a mention and of traffic from my website to yours. This page in no way intended to be even a partial duplicate of your script… rather is it a traffic portal dedicated to promoting your awesome STRIP TAGS script! A frame of your website is a real visit to your website and your content. Our featured sites appear in a frame so that you benefit not only from a link… but the traffic of the currently browsing my website.

    Again, thanks!

  211. JT Dude you took the text from my page into yours. Oh come on … do I look that dumb? A frame is a valid visit? What about bookmarks or links? Remove the inner body text from my page. Keep only the title and a short description and the frame …
    I wouldn’t have said this if you site had a bit of design in it! Nothing … Nada! It’s a shame for me to be framed in that major HTML mess!

    PS: Noobliminal = Noob(beginner)+Subliminal. I’m an eternal beginner.

  212. Pingback: Is Google Going on a Banning Spree?

  213. Thanks Bestoptimized, It’s worth a shot contacting Matt Cutts I guess. I tried contacting Google multiple times, they basically ignored me. I’m sure they didn’t want to admit this major issue a year ago. It’s a good thing I’m a nice guy, it would have been way to easy to knock out the comp.

  214. Todd,

    One of the most recent proxies I’ve banned came from a public school in Illinois. The proxy was being used by a scraper. I just ban the garbage and keep going. Knocking out the comp would probably cause me more trouble than it’s worth :) .

  215. No offense m8 but your solution is suicidal! Without cache the server is killed as DNS queries are slow!!! SLOW!

    Cache replies or delete your comment.

  216. Pingback: What a To Do » Blog Archive » Another Way Of Segmenting Your Visitors

  217. Pingback: Like Flies to Project Honeypot: Revisiting the CGI proxy hijack problem at Hamlet Batista dot Com

  218. It’s so brave of you Dan to spill this info out to the public. Thanks for the concern …

  219. I have been fighting against this issue and I would like to sustain the “ban proxy” idea. I can provide examples of proxy sites that remove the metas so any “noindex” or similar measures are helpless if they already outrank you. so the only thing you can do is to figure out a way to remove your content from their website and wait for google to reindex the stupid link that affects you. Just take a look at all those proxy engines and notice below their search box a check box “remove metas”. there are a lot of them, i guarantee.

    As many posts here mentioned, they do not cache the content so displaying something else for their ip would solve the problem.

    A couple of issues:

    1. what if they cache the content … well those need to be fought by the community … dcma requests, requests for google bans and so on.

    2. how do you kill them all if you’re affected by thousands of proxies? you probably can’t on your own. it seems that when you got yourself a nice duplicate content penalty other proxied links may start to step on your tail … so you’ll end up with a nice long list of proxy sites that comes up in a google search for some phrase of your unique content.

    What I would like to stress here is the idea that the community needs to fight it. If we put up a team that manages a list of banned proxies, i’m sure lots of people would be willing to report the proxies that are affecting them. So part of the work would mean testing and accepting true reports. Another side would be to protect the unharmfull proxies … for example automatically test robots.txt see if they “noindex” their proxy results and come up with some way to match what google sees in the robots file as well. there may be other ideas as well.

    So we could maintain and publish a list of malware proxies and their ips that people can use in a database as a proactive defense. Even if google would come up with a solution, i don’t think our project would be in vain even if only it would mean stressing google to build the solution.

  220. Yes, there are proxies that have a remove metas flag – PHP Proxy for one.

    There are two layers to the defense we’re using:
    1) Inspect IPs vs. user agents (meta tags don’t matter)
    2) Insert meta tag for all but known spiders

    Which, as I’ve mentioned, has held up so far.

    The “beauty” of the proxy hack is that the attacker uses third party web servers, not their own.

    It’s possible to construct a proxy that will thwart any defensive scheme (including IP lists) but it’s a lot harder to convince thousands of people to deploy it for you.

  221. hey guys, here is some sort of a fix that may help some people that got affected by the proxy hijacking technique. depending on how you got the duplicate content penalty this may or may not help you temporarily or permanently.

    here it is … feed fresh content to google first. basically what you want to do is prepare a fresh page with revised content and have it sit unseen until the first google bot visit. then when google bot visits you, then switch the new page online and feed it to that google bot visit as well. what you want to change is title, description and page content. i think that rearranging/altering a few bits here and there should do the trick. this may sound crazy to some people, but as i said there may be some you can afford to do this kind of change to their page and it might help them.

    i have successfully used this on a 150k pages website and got rid of the duplicate penalty on the first google bot visit.

    if anybody would have more results from using this technique, please post your info as it may help others as well.

    this is in no way a replacement for Dan’s solutions presented in his post. you should implement those first and then try to get out of the hole with this content change idea. normally a high PR website should not be affected by those proxies but as we all see it, it happens in mysterious ways. in our case, our penalty came when we revamped our site. so we sat there at the top for 2 years and then suddenly a proxy was better than us when we changed content.

    so feed google with fresh content first when you revamp your site if you get in such a hole.

    hopefully this will help others as well.

  222. I run google alert service on keywords on my pages. IF my page was indexed via proxy I would get an alert for the keyword. At least I get it immediately and not knowing at all.

  223. Pingback: How To Beat Proxy Hi-Jackers, and Have Fun While Doing It : Slightly Shady SEO

  224. Hi dan and other ppl,

    Ive just had a site dissapear from google and yahoo and the only reference i can find is my domain with this REF on the end of it.. mydomain.com/?ref=pislikcs.com

    Now the ?ref=pislikcs.com is nothing to do with me.. and i have no idea how to get rid of it..

    Anyone know anything about it or what to do about it.. how do i get it un indexed and my old domain back in google ?

  225. Pingback: Google - All 2008 Nominees » SEMMYS.org

  226. Also just as a side note remember it is extremely important to do a proper robots.txt for your proxy if you don’t then your proxy will crawl the web and google will index other peoples websites under your domain. This seems great but its pretty unethical if you ask me and causes a lot more problems then the benefits of fresh content.

    Here is the proper format for PHProxy 5

    User-agent: *
    Disallow : /index.php?

    Just put that in your robots.txt file and upload it to the root of your server.

    There are other ones for cgi proxy and glype.

    If your worried you can’t do well in the Search Engines without stealing content check out my site at http://www.proxybolt.com and I guarantee you I am doing very well in the SE’s and have a ton of visitors without having to steal.

  227. A Story.

    We read this blog entry, we sat on the information thinking it was too risky to install-what if the Google recognition script failed for a moment! A clients website got proxy hit. We installed the anti proxy script for only the index page, google recached page and our rankings stayed for about 6 weeks. Then all of a sudden the clients index page was decached from Google. The antiproxy script was all that could be to blame. We uninstalled the script, Google recached the index page, and rankings returned.

    I consider that the script is very risky to install. I don’t know why it failed, the logic seemed foolproof, but losing rankings was not cheap.

    At the moment, its not just proxy scraping. I am seeing websites disappear left right and centre. All it takes is for a website to copy the meta description, or two words either side of the search phrase on a website, and if you happen to be ranking for those phrases, you are dropped from Google. Sometimes your rankings stay, if your website is powerful enough you might just drop from first to 10th. But most clients that have had text copied are disappearing from Google. Then I change the text, and the sites come back the moment they are recached.

    I was seeking links from a top ranking website in the morning. In the afternoon the site had disappeared.

    An idea is to be proactive about the issue – potentially subscribe to the likes of http://www.copyscape.com, then you can know that you need to change the content of your pages.

    Just done a copyscape search, and found some Google cached proxy copies of some of my websites… I wish the proxy protection script was trustworthy.

    This is very scary stuff.

  228. Michael,

    You may be right, or you may be jumping to conclusions on this. I’m not sure exactly what you implemented, so it’s hard to say.

  229. Pingback: Kakkoi

  230. this is something that we should be concerned about. i hope others would also provide more information about this issue. and google should do something about this.

  231. Hi Dan, I hope you still reading these comments, I did as you suggested, Googled a phrase that is in my homepage 16 words same punctuation and quoted, google shows 24 results, I checked each page one by one, many of them are from hackers that offer my software for free. But, there are four results from the same domain: tutorials-win dot com none of those have the phrase visible, I checked code and is not there either…

    Is this the same case that you mentioned or something else?

  232. Hi Dan It would be nice if there is a general update on where we are as in scripts working or not and defences staying up or not etc,is there a new better ways or is the old ways scripts etc still good as it is a fluid changing thing this thing we a re fighting.
    regards.

  233. I have used the script on just the index page, and it worked for several months, then all of a sudden the index page was decached and not included in the index. Somehow Google found the noindex and so decached the page. We removed the script, and the cache and serps came back.

    The problem seems more extensive now. All that needs to happen is that a search phrase snippet is copied on a number of sites. This is enough to ditch a high ranking page from the SERP’s. Change the content, and the serps come back. But changing the total content including meta description of pages once every week is rather tiresome.

    How could Google let this go on for so long!!!!

    I have raised a comment on Matt Cutts Duplicate content thread http://www.mattcutts.com/blog/duplicate-content-question/#comment-122817 and got no reply.

    Dan – are you able to raise this issue again with Matt. It is months since you wrote this article, and almost 2 years since June 2006 – and yet the issue still remains, and has got worse.

  234. Dan, I agree that google is working on that crap. My site is affected since couple months, and this month I see much more traffic from pages what were kicked out.

  235. Pingback: How Many Legitimate Business Did Google Kill? - Page 2 - WebProWorld

  236. SOLUTION:

    It would seem to me that if the SE’s were to implement a chronological record when duplicate content surfaces it would be a relatively simple matter to determine which one is the real McCoy and which one is the copycat. Of course, if an IP changes due to a hosting change, then the original will either (should) have a redirect, and/or an updated DNS profile.

    Its one thing to complain about stuff, its another to also offer up a solution when you complain about stuff. I wonder if Google is listening to this blog?

    I doubt it, but you never know. Here, let me dangle a banana in front of the Google Gorilla and see if that gets his attention.

    One question that comes to mind however; when proxy dupes get in the way like this, is it affecting search results on other SE’s as well, or just Google?

  237. Trusted webmasters? That might be feasible in 1992 but you don’t realize the scale we’re dealing with here, though few people are able to grasp that.

  238. Pingback: Banned By Google | Andy Beard - Niche Marketing

  239. Using curl_init() you can claim to be Googlebot and have the same IP as Googlebot so .htaccess or PHP $_SERVER environmental variable check will not help.

    As you said, if your Website has authority then a proxy duplicating you will not override you.

    How do you want Google to fight this? Who owns the original content? Authority is King the rest are peasants!

  240. Igor, always a pleasure to be TROLLED by a professional.

    You can not change your own IP address with curl or anything else.

  241. Dan I am not a hacker but a programmer. I read on php.net that one can make culr_init feed a different IP and host in the header.

    Need to research about that in the curl library.

    But another way to do it, is on the virtual host configuration. So if you are running a virtual host server you can be Donald Duck if you want.

    Again I am not sure how this done, but it is all possible hypothetically.

    May ask Matt Cutts he is the Father of Black Hat SEO. ;-)

    Or one of his disciples would know as well.

    Do not want to drop any names due to not wanting to incriminate myself by association, also do not want to have the SEO Cosa Nostra after me!

    Thanks for the Troll remark.

    Love to Troll OPBs

  242. You are truly an Excellent Troll, Igor. I am truly honored to have you Troll my blog.

    If my server gets a request that appears to come from 11.22.33.44, my server will send the response there. If that’s not your actual IP, you won’t get the response, will you?

    If your server claims to be crawl55.googlebot.com (or donald.duck.googlebot.com), this is easily proven false by looking up the correct IP for that host.

  243. “If my server gets a request that appears to come from 11.22.33.44, my server will send the response there. If that’s not your actual IP, you won’t get the response, will you?”

    Very good and logical question. Not sure about this. I would imagine not, but who knows! Is the response based on the TCP/IP handshake or on browser sessions?

    Need to consult Dark Vader on this and will get back to you as more information gets illuminated from the Dark Side.

    May the force be with us to defeat Evil Google!

  244. There is a much easier way to detect bots using a spider trap. You simply hide a link somewhere on your site that would be invisible to normal users /trap/trap.php

    Then this script takes note of the IP address and adds it to your .htaccess script blocking the IP after validating if it is a bot you want to deal with (using ARIN lookups). You can have other scripts that expire the blocks or count the number of return offenses.

    Drak

  245. Drak,

    What you describe is a pretty standard step in preventing content theft – this particular issue is a bit more subtle.

    That’s a nice way to identify bad bots if you disallow /trap in robots.txt – since the bad guys rarely bother to read that file, and if they do it’s usually the first place they look. :D

    But it wouldn’t do much good when you’re trying to identify good bots coming in by proxy, unless you plan to allow them to spider past a hidden link.

    It also would not prevent them from grabbing other content (like your home page) at a proxy URL while you’re waiting for them to request the trap page.

  246. One way, if it is possible it would be to check on each one of your pages, if the request is from culr_init() If it is cross reference the IP to your allowed bots IPs if it is on the list allow it.

    This would at least protect you from the home made proxies.

    But I did a search for checking if a request is curl_init and found nothing.

    As far as someone pretending to be an IP that they are not is very easy to do! Piggyback on another IP! Have that IP make the request. You can also use IP spoofing to make the handshake.

    Read this.
    http://www.computing.net/answers/programming/php-remoteaddr-integrity/9140.html

    Tell us if you find a way to check for curl_init() request.

  247. Webmaster I am not, but I am fascinated by your article. This morning my website which has been up for years, disappeared from Google’s index. For months it was #1 when I searched for many variations of juice plus john or juice plus canada. I have spent the day trying to find out where it went and how to get it back. Your article opened a whole new possibility. I’d appreciate any tips.

  248. Igor, buddy… it’s been fun but I’m gonna have to shut you down now.

    There’s no difference between a spoofed header from curl and any other spoofed header.

    Even if it were possible to fake both ends of the forward and reverse lookup, it’s hardly necessary, if your only goal is to construct a proxy that will fetch a page and return it to a search engine’s bot.

    All you have to do is look like a normal user, strip out robot meta tags and X-Robots headers, and pass the result back.

    If this problem reappears (Google seems to have solved it for now), and a bunch of idiots start doing that, we have a number of additional countermeasures which would thwart the attempt.

    I will not discuss the nature of those countermeasures unless it becomes necessary to engage the issue again.

  249. wow! great article… my site was in google for 7 months.. but suddnely disappeared from last two weeks….

  250. This is very distressing, I often wondered if this applies to articles when we distribute them. I’ve placed articles on my site and send them through Ezine or Isnare, would that be considered duplicate content? And does just the homepage get affected, or the entire site?

    Thanks for sharing this info.
    WS

  251. Some pages of my site were briefly replaced by pages from a proxy.

    I complained to Google (with an online spam form, not a DMCA letter or any formal avenue) and they have been removed from the index. I also emailed their hosting company and the site is now down.

    They were definitely doing it deliberately. Their URL resembled that of a legitimate site, and their front page redirected to proxy that site.