It sounds so seductive… by using advanced statistical methods, you can determine the best mix of on page factors for SEO. Wow, imagine the incredible competitive edge that you’d have. You could use just the right number of bold tags, figure out whether to use bold or strong, and you’d be an unstoppable ranking machine.
The only problem with this approach is that it’s complete bunk. Let’s try a couple examples…
A Statistical Lie I Kind Of Liked: MSN "prefers" sites that run on Microsoft’s own IIS
A while back, someone published a statistical study that appeared to show that MSN’s search results were far more likely to contain pages from sites that run IIS, vs. Google and Yahoo. Did people take this as a sign that they should move their web sites onto IIS? No, of course not… because Google has a lot bigger market share, people actually thought maybe they should switch away from IIS in order to do better on Google!
So… now you wonder: is MSN rewarding you for using IIS while Google doesn’t care, or is Google rewarding you for using Apache while MSN doesn’t care? If Google doesn’t care and MSN does, then you rush to IIS. If Google cares and MSN doesn’t… enough! Spare yourself the circular reasoning before you go mad, and let’s consider some possible root causes.
At the time this study was published, I pointed out that there are many differences between IIS and Apache, aside from the names.
- Whereas Apache is the majority choice of the entire web, as you move to larger sites and in particular the corporate world, IIS has a much stronger position. So if Google crawls more of the web’s smaller sites than MSN, they’re going to have a higher percentage of Apache-delivered pages in their index. Which means, statistically speaking, that you’re likely to see a higher percentage of pages on Google SERPs being served up by Apache.
- ASP.Net, whatever else it does, can come with a lot of extra baggage. Such as the "viewstate" form fields that tend to get inserted, with 10-50k of utter gibberish text. So if MSN taught their bot to ignore this junk, and Google didn’t… well, this alone might account for this statistical variation. Since it’s relatively easy to build a site on IIS without adding all that dead weight, it’s hard to blame the search engines either way.
The bottom line: search engines don’t care what kind of server you run. They might care how it behaves, but not about the name.
The Original Statistical Sin: Keyword Density
If you’ve never used "search engine optimization" software to tell you how to optimize your web pages, good for you. If you run keyword density analyzers to do anything other than extract search terms from web pages… stop. You don’t need to. Keyword density isn’t a factor – search engines just don’t work that way.
Keyword density is loosely defined as "the percentage of the words on the page that are your keywords." I can remember endless debates back in the late ’90s about the "right" way to measure it – did you count all the words, did you only count exact phrases? There was only one problem with those debates – we were all wrong. Search engines do not measure the "keyword density" of a web page.
What they do is, in fact, immensely more complicated. Don’t follow that link unless you want to get hit with a firehose full of math, BTW… it talks about the vector space model, information retrieval theory, linearization, TF/IDF (term frequency / inverse document frequency), and other stuff that can give you tired head real fast. I’ll summarize what it all means in a minute.
So there’s no such thing as keyword density… at least to the search engines. However, the "fact" that keyword density isn’t even measured by search engines hasn’t stopped people from peddling their latest "statistical analysis" of the optimal keyword density.
There are two main approaches that are used to push keyword density:
- Take the top 10 pages for a particular search query, measure their keyword density (yes, I know nobody can agree on how), and then take the average score as the "ideal" keyword density. This is the approach that most optimization software uses. Never mind that the #1 result may be 2%, the #2 result 41%, etc. – if the average is precisely 4.67291% then that’s what you shoot for… and while you’re at it, make sure your page matches the average number of words that were used by the top 10.
- Dive deeper, categorize the pages into buckets based on their keyword density. Then you analyze a whole bunch of search results and determine that, statistically speaking, the pages that fall in a certain range are more likely to be ranked higher. Depending on the search terms you use, the numbers will vary a bit, but generally the "magic number" is discovered to lie somewhere in the 1-4% range. Which, as it turns out, is pretty much what happens when you just write naturally.
As you already know, there are other factors in play when it comes to rankings. In fact, "on page text" is probably nowhere close to the most important factor in SEO. What this statistical data should be telling you, is that you are wasting your time by worrying about keyword density. If you translate all the technical stuff in Dr. Garcia’s paper on keyword density into English and then summarize, it says "use relevant keywords, but write naturally."
So yes, my friends, there is a magic number for exactly how many times and in what places you want to place your keywords. Unfortunately, without access to the entire search engine index and their ranking algorithm, you don’t stand a snowball’s chance in Hades of discovering what that actually is… and it’s different for every search query.
Even if you could measure the same things the search engines were measuring, you’d still be unable to get there with statistical analysis, because there are a few things that will tend to skew the statistical averages higher or lower than what’s optimal from a pure "vector space search" perspective:
- People who are doing SEO work to improve their rankings will probably tend to repeat their keywords just a little more than the average writer… which might actually be WORSE for their rankings than writing naturally, but they also tend to do other things, like building links to their sites and using anchor text to boost their rankings. This will drive the numbers up.
- People in general are more likely to enjoy reading pages that are well written, with natural word use… and they tend to not enjoy reading keyword stuffed garbage. This leads to a general trend where "over optimized" pages will receive fewer links, and sites that are full of keyword stuffed jibba-jabba will receive fewer links. This tends to reward sites that don’t have an extremely high "keyword density" and drives the numbers down.
- Blogs make a big difference in the math, because blog posts tend to gather more links over time (increasing rankings) and collect comments as they collect traffic (decreasing keyword density towards the average of natural language). This drives the numbers down, or toward the averages at least.
The bottom line on keyword density is that there is no such thing. Write naturally, write persuasively, write to communicate… because no matter what your keyword density is, you can always fire more links at the page to improve its ranking, but the only way to make your copy do its job is to write well, and forget about the damned search engines.
We’ll talk more soon.