Update: This post has been nominated for a SEMMY award – vote today!
When it comes to innovations in information retrieval, and there have been many over the years, none has achieved the legendary status of Page & Brin’s “PageRank” algorithm. Its elegance, its simplicity, its wicked subtlety… the sheer underlying truth of the thing. It’s beautiful.
Unfortunately, Google decided several years ago to release a search toolbar for web browsers, and they included a visual representation of PageRank. The green (or gray, or white) bar. The green pixels, as I’ve been calling them since it came out… The obsession with those little green dots has loomed large in SEO ever since.
Now forget about the green pixels for a few minutes if you can. This is about the real PageRank score, not what they show you on the toolbar. It’s about why Google can’t just get rid of PageRank, why the Supplemental Index exists, and just for fun, why Robert Scoble shouldn’t talk about this stuff.
This may be a little bit heavy for some readers. Take your time. Read it twice if you have to.
Query-Dependent Ranking Factors: The Search Engines’ Secret Sauce
It’s probably safe to say that the vast majority of the factors used by search engines to rank web pages on search results are query-dependent. That is to say, the search query itself affects what factors are involved, the weight of different factors, etc.
For example, when I search for “purple widgets,” the weight of any occurrences of those terms in a given document depends on other documents in the index, etc. The importance of word proximity and order will depend on other documents.
I have discussed query-dependent factors before, in the context of explaining why statistical analysis doesn’t give you the “magic number.”
PageRank is different, because it is Query-Independent.
The PageRank score for a document exists independently of the search query – it is a property of the document (URL) itself. The algorithm is patented, and public. In theory, if you had a list of every URL in Google’s index, you could determine the PageRank score of every document (URL).
However, you would not be able to arrive at the same score that Google uses. Your number will be different, because you don’t know which links Google is ignoring due to paid link filtering and other link spam detection processes.
Side note: before Google announced nofollow, they had to have a working and thoroughly tested implementation. How would they test it? By using an “invisible nofollow” internally, to tag untrusted links on untrusted pages on untrusted sites.
Some might argue with me, but I believe that for most search results, query-dependent factors are more important than PageRank.
This is somewhat obscured by Google’s heavy reliance on anchor text (they are not alone in this), which can make it appear that PageRank is more important than it actually is. Since the same links that pass anchor text also pass PageRank, you do tend to see higher-PR pages towards the top of search results.
Now some news that the SEO world hasn’t gotten yet…
PageRank’s True Purpose Is Not Ranking Documents
I was talking with Andy Edmonds one day, and he was sort of wondering why nobody really talks about the real benefit of PageRank to someone trying to operate a large-scale hypertext search engine. In Andy’s world (he studies this stuff a lot more than I do), PageRank is partly a performance hack.
Quick side trip – there are at least 4 steps that a search engine must go through to deliver search results in response to a query. Query analysis (examining the user input), document retrieval (fetch 1000 documents from the index), ranking, and presentation (output to the user).
The ranking stage may consist of an initial ranking and then a reordering (Google’s -30 and -999 penalties, for example) before presenting search results. Universal search may add specialized search results (news, images, maps, local, etc.) depending on the query.
The biggest benefit of using PageRank doesn’t come in at the ranking step, it comes in at the document retrieval stage, when you’re trying to decide which of 27,438,902 documents that matched the query text are actually worth ranking. The search engine’s job at the retrieval stage is to pick 1,000 documents to rank, without losing any really important documents.
PageRank excels at this. In fact, PageRank’s role in the retrieval process is why Google’s “Supplemental Index” exists. When they created the SI, they recognized that some documents had such low PageRank that they were unlikely to make the cut for most search queries. Those documents live in the supplemental index.
Maintaining a smaller main index improved the speed of document retrieval, and if they couldn’t get 1000 results from the main index, they could always go dip into the supplemental index to get more documents. Google didn’t run out of “document ID numbers” for the main index (sorry, Daniel Brandt), they did it to improve performance.
Google has made some changes to how they manage the supplemental index recently, and in many cases they may dip into the supplemental index when there are more than 1000 matching documents in the main index. Obviously, for some queries, the “best results” from the SI (based on query-dependent factors) are often better than results 900-1000 from the main index. Google’s on top of that problem. They’re working on it. They’re doing stuff.
So… if you think that Google is going to just dump PageRank, now you know why they can’t. It’s probably the greatest performance hack in the history of information retrieval.
How PageRank Can Become (Somewhat) Query-Dependent
I wrote a “somewhat speculative” report on Topic-Sensitive PageRank (PDF) a few years back, after Google’s “Florida Update” shook up the SEO world. Now, it’s just possible (OK, probable) that I was wrong about that. It’s also possible that I was right, and I’ll cling to that hope until someone at Google actually denies it.
However, we do know that Google implemented something like Topic-Sensitive PageRank, because they offered a couple topical search products (personalized search and site-flavored search), where you picked from a list of approximately 50 topics to skew your search results. In order to offer a topical search product, they had to have something in place to help deliver those results.
As I described in that paper, with a topical PageRank score for each document, Google could map some search queries to topics at the query analysis stage. They could use that information to retrieve a more topically relevant set of documents at the retrieval stage, and of course, apply those topical scores in the same way that PageRank is used in the ranking of the final result set.
Could they do this? Absolutely, given a small enough set of topics, it’s not out of the question for Google to use Topic-Sensitive PageRank. Can they do it with enough topics to make it worth the effort? Well, I think that’s more difficult. 50 topics wasn’t enough to make their site-flavored and personalized search products terribly useful.
To really do more with topical analysis, search engines would need to understand more, and that means some other kind of analysis. It also means that you’d have a hell of a time scaling it up to work on 20 billion or so documents.
Why You Can’t Have A Different Score For Every Keyword
Scale is a huge problem for search engines. The web is really big, for one thing. For another, there are like a billion people searching for stuff all the time. There are a lot of things you might like to do, that you can even do in a lab, that just don’t work in a large-scale public search engine like Google.
So, when I heard that Robert Scoble was telling people that PageRank is different for every keyword, I laughed at the sheer stupidity. Then I went and read what he actually said. He didn’t say that at all, in fact, he didn’t say enough for me to even figure out what he was saying. This left it up to others in the blogosphere to interpret his words for him.
That, in a nutshell, is why Scoble just shouldn’t talk about this stuff, unless he wants to take the time to explain himself clearly – it’s too easy for the first misinterpretation to become the standard interpretation… and then everyone thinks you’re an idiot. On the other hand, maybe I should be mindful of my own glass house.
Anyway, Robert… in case you really did mean what they said you meant, here’s why that’s dumb: it won’t scale.
“That’s no moon… it’s a gigantic RAID array!” (Chewie, get us out of here…)
To store a PageRank score for every keyword for every document on the web, I calculated that this would take a disk array the size of a small moon. I did the math in my head, of course, but you get the idea. It’s 20 billion pages times 200-300 unique words per page, and you don’t just have to store it, you have to calculate it.
Technically speaking, search engines don’t operate on keywords, they operate on queries, which may be up to 10 words long at Google. Which makes the problem even harder. Let’s see, 300 factorial, squared, to the tenth power, draw the implied hypotenuse, carry the one… yep. The size of a small moon, requiring a Dyson Sphere to power it. At least.
They already have problems with supplying electricity and keeping the damned thing cooled down. No way.
Can Anchor Text Be Weighted On a Sliding Scale?
I get this question all the time, and I started to answer it in my last (SEO) post, but it really needs a longer explanation. Today, I’ll answer it, but I won’t explain the answer until next time. Actually, there are several variations of the question, but it goes something like this:
Is the weight of anchor text weighted differently based on the PageRank of the linking document, the amount of PageRank flowing through the link, etc.?
The short answer is no, but that might not be the right answer. The slightly longer answer is “maybe, but not the way you think.”
I will explain in great detail in my next post. I promise.