Why Google Can't Just "Dump" PageRank

2008 SEMMY FinalistUpdate: This post has been nominated for a SEMMY award – vote today!

When it comes to innovations in information retrieval, and there have been many over the years, none has achieved the legendary status of Page & Brin’s “PageRank” algorithm. Its elegance, its simplicity, its wicked subtlety… the sheer underlying truth of the thing. It’s beautiful.

Unfortunately, Google decided several years ago to release a search toolbar for web browsers, and they included a visual representation of PageRank. The green (or gray, or white) bar. The green pixels, as I’ve been calling them since it came out…  The obsession with those little green dots has loomed large in SEO ever since.

Now forget about the green pixels for a few minutes if you can. This is about the real PageRank score, not what they show you on the toolbar. It’s about why Google can’t just get rid of PageRank, why the Supplemental Index exists, and just for fun, why Robert Scoble shouldn’t talk about this stuff.

This may be a little bit heavy for some readers. Take your time. Read it twice if you have to.

Query-Dependent Ranking Factors: The Search Engines’ Secret Sauce

It’s probably safe to say that the vast majority of the factors used by search engines to rank web pages on search results are query-dependent. That is to say, the search query itself affects what factors are involved, the weight of different factors, etc.

For example, when I search for “purple widgets,” the weight of any occurrences of those terms in a given document depends on other documents in the index, etc. The importance of word proximity and order will depend on other documents.

I have discussed query-dependent factors before, in the context of explaining why statistical analysis doesn’t give you the “magic number.”

PageRank is different, because it is Query-Independent.

The PageRank score for a document exists independently of the search query – it is a property of the document (URL) itself. The algorithm is patented, and public. In theory, if you had a list of every URL in Google’s index, you could determine the PageRank score of every document (URL).

However, you would not be able to arrive at the same score that Google uses. Your number will be different, because you don’t know which links Google is ignoring due to paid link filtering and other link spam detection processes.

Side note: before Google announced nofollow, they had to have a working and thoroughly tested implementation. How would they test it? By using an “invisible nofollow” internally, to tag untrusted links on untrusted pages on untrusted sites.

They’ve been filtering links (e.g. paid links) for years, folks. As I have discussed and illustrated in prior posts, nofollow is also a useful tool for SEO.

Some might argue with me, but I believe that for most search results, query-dependent factors are more important than PageRank.

This is somewhat obscured by Google’s heavy reliance on anchor text (they are not alone in this), which can make it appear that PageRank is more important than it actually is. Since the same links that pass anchor text also pass PageRank, you do tend to see higher-PR pages towards the top of search results.

Now some news that the SEO world hasn’t gotten yet…

PageRank’s True Purpose Is Not Ranking Documents

I was talking with Andy Edmonds one day, and he was sort of wondering why nobody really talks about the real benefit of PageRank to someone trying to operate a large-scale hypertext search engine. In Andy’s world (he studies this stuff a lot more than I do), PageRank is partly a performance hack.

Quick side trip – there are at least 4 steps that a search engine must go through to deliver search results in response to a query. Query analysis (examining the user input), document retrieval (fetch 1000 documents from the index), ranking, and presentation (output to the user).

The ranking stage may consist of an initial ranking and then a reordering (Google’s -30 and -999 penalties, for example) before presenting search results. Universal search may add specialized search results (news, images, maps, local, etc.) depending on the query.

The biggest benefit of using PageRank doesn’t come in at the ranking step, it comes in at the document retrieval stage, when you’re trying to decide which of 27,438,902 documents that matched the query text are actually worth ranking. The search engine’s job at the retrieval stage is to pick 1,000 documents to rank, without losing any really important documents.

PageRank excels at this. In fact, PageRank’s role in the retrieval process is why Google’s “Supplemental Index” exists. When they created the SI, they recognized that some documents had such low PageRank that they were unlikely to make the cut for most search queries. Those documents live in the supplemental index.

Maintaining a smaller main index improved the speed of document retrieval, and if they couldn’t get 1000 results from the main index, they could always go dip into the supplemental index to get more documents. Google didn’t run out of “document ID numbers” for the main index (sorry, Daniel Brandt), they did it to improve performance.

Google has made some changes to how they manage the supplemental index recently, and in many cases they may dip into the supplemental index when there are more than 1000 matching documents in the main index. Obviously, for some queries, the “best results” from the SI (based on query-dependent factors) are often better than results 900-1000 from the main index. Google’s on top of that problem. They’re working on it. They’re doing stuff.

So… if you think that Google is going to just dump PageRank, now you know why they can’t. It’s probably the greatest performance hack in the history of information retrieval.

How PageRank Can Become (Somewhat) Query-Dependent

I wrote a “somewhat speculative” report on Topic-Sensitive PageRank (PDF) a few years back, after Google’s “Florida Update” shook up the SEO world. Now, it’s just possible (OK, probable) that I was wrong about that. It’s also possible that I was right, and I’ll cling to that hope until someone at Google actually denies it.

However, we do know that Google implemented something like Topic-Sensitive PageRank, because they offered a couple topical search products (personalized search and site-flavored search), where you picked from a list of approximately 50 topics to skew your search results. In order to offer a topical search product, they had to have something in place to help deliver those results.

As I described in that paper, with a topical PageRank score for each document, Google could map some search queries to topics at the query analysis stage. They could use that information to retrieve a more topically relevant set of documents at the retrieval stage, and of course, apply those topical scores in the same way that PageRank is used in the ranking of the final result set.

Could they do this? Absolutely, given a small enough set of topics, it’s not out of the question for Google to use Topic-Sensitive PageRank. Can they do it with enough topics to make it worth the effort? Well, I think that’s more difficult. 50 topics wasn’t enough to make their site-flavored and personalized search products terribly useful.

To really do more with topical analysis, search engines would need to understand more, and that means some other kind of analysis. It also means that you’d have a hell of a time scaling it up to work on 20 billion or so documents.

Why You Can’t Have A Different Score For Every Keyword

Scale is a huge problem for search engines. The web is really big, for one thing. For another, there are like a billion people searching for stuff all the time. There are a lot of things you might like to do, that you can even do in a lab, that just don’t work in a large-scale public search engine like Google.

So, when I heard that Robert Scoble was telling people that PageRank is different for every keyword, I laughed at the sheer stupidity. Then I went and read what he actually said. He didn’t say that at all, in fact, he didn’t say enough for me to even figure out what he was saying. This left it up to others in the blogosphere to interpret his words for him.

That, in a nutshell, is why Scoble just shouldn’t talk about this stuff, unless he wants to take the time to explain himself clearly – it’s too easy for the first misinterpretation to become the standard interpretation… and then everyone thinks you’re an idiot. On the other hand, maybe I should be mindful of my own glass house.

Anyway, Robert… in case you really did mean what they said you meant, here’s why that’s dumb: it won’t scale.

“That’s no moon… it’s a gigantic RAID array!” (Chewie, get us out of here…)

To store a PageRank score for every keyword for every document on the web, I calculated that this would take a disk array the size of a small moon. I did the math in my head, of course, but you get the idea. It’s 20 billion pages times 200-300 unique words per page, and you don’t just have to store it, you have to calculate it.

Technically speaking, search engines don’t operate on keywords, they operate on queries, which may be up to 10 words long at Google. Which makes the problem even harder. Let’s see, 300 factorial, squared, to the tenth power, draw the implied hypotenuse, carry the one… yep. The size of a small moon, requiring a Dyson Sphere to power it. At least.

They already have problems with supplying electricity and keeping the damned thing cooled down. No way.

Can Anchor Text Be Weighted On a Sliding Scale?

I get this question all the time, and I started to answer it in my last (SEO) post, but it really needs a longer explanation. Today, I’ll answer it, but I won’t explain the answer until next time. Actually, there are several variations of the question, but it goes something like this:

Is the weight of anchor text weighted differently based on the PageRank of the linking document, the amount of PageRank flowing through the link, etc.?

The short answer is no, but that might not be the right answer. The slightly longer answer is “maybe, but not the way you think.”

I will explain in great detail in my next post. I promise.

43 thoughts on “Why Google Can't Just "Dump" PageRank

  1. For the record, I know Scoble isn’t that dumb… and he was probably alluding to the primary role that query-dependent ranking factors play in search engines.

  2. Dan – Another great and insightful article. Thanks for the food for thought.

    Some might argue with me, but I believe that for most search results, query-dependent factors are more important than PageRank.

    You were right on this assumption. :-) I agree with you partially on this.

    It is true that thanks to PageRank’s query independence and off-line calculation, Google can answer most queries in less than a second; but I don’t think we can say with confidence that retrieval performance is the most important benefit of PageRank.

    For example, if the formula fails to identify the important pages correctly, the search results will be of very poor quality. Proof of this, is Google’s relentless fight against link spamming in order to keep their search results ‘clean’.

    PageRank excels at this. In fact, PageRank’s role in the retrieval process is why Google’s “Supplemental Index” exists. When they created the SI, they recognized that some documents had such low PageRank that they were unlikely to make the cut for most search queries. Those documents live in the supplemental index.

    I personally think that the main reason for creating the supplemental index was to improve Google’s recall, not necessarily to improve the performance of the queries. Why? If you make obscure searches. Those searches would return very few results and those SERPS can be easily enriched with results from the supplemental index (some ‘bad’ results is better than none ;-) ).

    Cheers

  3. Pingback: Search Engine Land: News About Search Engines & Search Marketing

  4. If Google are storing a page, along with which valuable links point at it with various anchor text, then to assign a value to the links doesn’t necessarily add too many binary digits of storage, and practically none in retrieval.

    In simplified terms take 8 links containing one particular keyword.

    That could be counted as
    1+1+1+1+1+1+1+1 = 8 or 00001000

    If you assign some juice to them

    2+3+1+1+2+2+1+1 = 13 or 00001101

  5. @Hamlet, why not have just one big index, if not for performance reasons?

    @ Andy, I’m gonna save it for the next post, but I’ll answer that. :D

  6. @Hamlet, why not have just one big index, if not for performance reasons?

    Good question. Please note that I am not saying that performance is not a consideration. What I am saying is that it is not necessarily the most important or the only one.

    Why have two indexes and not just one? For several reasons:

    1. The main index does excessive filtering. The majority of the filtering is done to thwart spammers. They also do filtering in order to prioritize indexing. Unfortunately, that filtering comes at the expense of having a less comprehensive index (poor recall for obscure queries.)

    2. The less important pages (the ones that are filtered out) are less likely to show up in regular search results. They will eventually show up for invisible long tail queries, and hence the need for another index to take care of those exceptions. The retrieval module only needs to query the secondary index when there are not enough results in the first.

    3. Here is were I agree with you. Searching through a trimmed down main index improves the precision and query performance. It is faster and more efficient to search through an index of the higher quality pages first.

    In summary having two indexes helps them improve precision by only consulting a list of pages that are very likely to rank (because they are important as measured by PageRank), and also improve recall because they can consult the supplemental index when they can’t pull enough match from the main index ( as in the case of very obscure queries).

    Again, I think performance plays a role but I don’t think is necessarily the most important one. Precision and recall are two of the most important measures of the quality of a search engine.

    Cheers

    PS: Here I discuss some of my research on the supplemental index http://hamletbatista.com/2007/07/11/out-of-the-supplemental-index-and-into-the-fire/

  7. I got the basic stuff out of the way by giving away my book, man. For those who want more, I think the best next step is to understand how these systems work.

  8. Hehe, I am too lazy for that, I just tell people to download your book, plus Michael’s Revenge of the Mininet and Leslie’s dynamic linking

  9. Nice. I noticed that all of those are free.

    I should say, it’s one of the next logical steps. Another logical next step is to get really good at marketing your business.

  10. Very good read Dan.

    One point I’d like to make. If the number hasn’t changed, the initial number of documents fetched to build a data set for a query is 40,000 from which they’ll extract up to 1000 to display.

    http://infolab.stanford.edu/~backrub/google.html

    To put a limit on response time, once a certain number (currently 40,000) of matching documents are found, the searcher automatically goes to step 8 in Figure 4. This means that it is possible that sub-optimal results would be returned. We are currently investigating other ways to solve this problem. In the past, we sorted the hits according to PageRank, which seemed to improve the situation.

    I whole heatedly agree with you in regards to Google’s “ability” to add the “invisible Nofollow” to links they “don’t like”. The problem I see is with their ability to do so effectively and efficiently on a link by link basis. Excluding only particular links on a page. The sheer amount of resources it would take not only to implement this across the web, but maintain it simply because of the fluidity of the links themselves precludes them from doing so IMO. If this were the case, there would be no need for them to insist that links they don’t want to pass PR be tagged as such. They’d simply do it themselves and be done with it.

    PR is here to stay. You are spot on. While their “external opinion” (TBPR) they express may likely be changing, their view on “importance” (internal metric) could easily remain true to the original form and alter/adjust its influence. As you aptly put it…

    Its elegance, its simplicity, its wicked subtlety… the sheer underlying truth of the thing. It’s beautiful.

    Dave

  11. Pingback: Microsoft on Index Partitioning -SEO by the SEA

  12. Welcome, Cranky Dave! Nice first post. :D

    Once you recognize that *every* link they index could be nofollowed, it’s easy enough to see how filtering can be done. The problem is that the type of links that they want to filter can’t always be identified easily. Sebastian has a nice post on how search engines can walk link networks, which is something I discuss in my linking clinic.

    It’s a tough balancing act between being too technical and too simplistic. 7,000 word blog posts don’t get read, and they don’t get the kind of comments I get.

    This is what I like about using a blog, though – really smart people come in with a lot of great feedback.

  13. Thank you Dan, and thank you for the link to Sebastian.

    I’ve no doubt in Google’s “ability” to nofollow links. Just their ability to do so effectively and efficiently for the entire web. There’d be no need for them to insist that paid links or advertisements be tagged nofollow if they were.

    Anytime there’s a “network” or “scheme” there will be patterns. How easily they are recognized as such is another question altogether. But as you point out, the links they want to filter are not neccessarily easily identifiable.

    IMO, that may be the key. What they are not able to easily identify may be the majority. “Easily” being the operative word.

    Dave

  14. Pingback: (EMP) E-Marketing Performance » : » Team Reading List 11.12.07

  15. Dan,

    you can comment to the effect MSN/Live has had on PageRank, we have several clients go AWOL in the Microsoft index. the result was a 2+ point drop in silly green pixels resulting in frantic website owners / clients.

    When working with MSN on the issue. I got vivian the near live automated response. After i proved to them that I did in fact know what to do and how to check the MSN indices; they replied with a strange message that they have got the submission tool backup (was it down), they have made a correction and suggested a resubmit and a wait time of 2 weeks. in the middle of the resubmit and wait sentence they stuffed in a “BUT”.

    here is the message:
    URL submission page is back up and running, and you can re-submit your URL by following this link http://search.msn.com/docs/submit.aspx. After you re-submit your URL it will take a couple weeks for our robots to fully crawl and index your domain, but if it passes our spam and other filters your site will be back in our index soon.

    The but part seemed to not flow with the prewritten auto responder and assume that this was their hint that something had been or is still possibly setting off spam filters.

    Have you seen PR dropping from the loss of MSN/Live backlinks too? [please move or save for later if this does not fit here. thx dan - keep us informed!!]

  16. First time I’ve heard anyone make a connection between MSN/Live and toolbar PR, Jason. A lot of sites lost a few green pixels in the last update, I doubt that it’s related to anything that’s happening @ MSN.

    Speaking of MSN, they do have some pretty bad indexing problems. Reinclusion does work, but it’s slow.

  17. Pingback: PageRank, worth it or not?

  18. Pingback: November ‘07: Best Search/Marketing Posts » Small Business SEM

  19. google just likes making life hard hands down it seems its their life goal is to make webmasters lives a little bit harder every few months

  20. Pingback: Google announce the end of the supplemental index

  21. Pingback: Ebook Update: Suggested SEO Resources

  22. Pingback: Google - All 2008 Nominees » SEMMYS.org

  23. Pingback: Google - 2008 Finalists » SEMMYS.org

  24. Pingback: Google - 2008 Winner » SEMMYS.org

  25. Pingback: TheVanBlog Wins A SEMMY - TheVanBlog

  26. Pingback: The SEMMYS - Hobo SEO UK

  27. Page rank is one of the most important features that Google has introduced. It has gained a considerable importance. Totally agree with your viewpoint.

  28. HI Dan and all,
    Anyone has noticed a change in their PR lately?
    I’ve got 2 sites and both had an increased in PR.
    One from 3 to 4, while another from 0 to 3.

    Been hearing a lot that google is updating some stuffs….anybody knows anything here?

    cheers

  29. Hi Crane Man,

    I have seen some changes too — 2 news site from PR0-PR4.

    Google has changed the way they are looking at backlinks again too. Check your sites. You will likely see a significant decrease in backlinks.

    Jeff

  30. Just checked my backlinks…No significant decrease though…
    I’ve heard a lot of news from other forums that lots of websites got their PR increased. Some got from N/A to 4 almost immediately…something is going on definitely…

    btw…mine changed from 0 to 3! Should i rejoice? I am still waiting for some real answer to this .

  31. Unless you can trade those green pixels for hard currency, I think “rejoice” might be a little strong, but it’s always nice to see any sign of progress.

  32. A little green on the toolbar is fun to look at too. (8

    And if you are selling your sites as a package deal, adding a PR4 site makes the deal a little more enticing to a buyer.

  33. HI dan,
    I think what you are saying is valid. Although my sites have increased in PR, 2 of my keyphrases have dropped in rankings -actually they gone :(

    This is a little bit strange.

  34. What I think people can do whatever they want to you and they often will. You however have the power to determine how you want to react to their actions. Google will adjust its PageRank scores in a way they deem fit (it’s their toy after all) and ultimately you must decide how you will cope with the possible results.