July 25, 2007

Crawling Out Of The SI (For Large Sites)

In my last post, I explained a simple method for working on your site's indexing in Google… and I promised to give some more information for folks with large web sites (1000+ pages).

Unfortunately for those with large sites, the process can be "just a bit" more involved… and Google seems to have taken away a few of the tools we would have used.

So if we want to do this, we'll we have to improvise a bit… Let's begin by reviewing some key ideas from SEO Fast Start.

The pages on your site can be divided into "tiers" based on how far they are from the home page. If the home page is the first tier, then any page that has a crawlable link from the home page is in the second tier. The first step is to make sure that your second tier pages are not in the SI.

There are plenty of tools that will give you an outbound link report for a web page. I like this one:
http://www.webconfs.com/search-engine-spider-simulator.php
Because it gives me a list of links that I can copy and paste into Microsoft Excel.

Now, Google has thrown a bit of a monkey wrench into the works recently, because the info: search (which we would have used) currently isn't showing whether pages are in the Supplemental Index or not. Craig posted a search hack to see "just the supplemental results" last week… and that hack isn't working right now either.

Like I said, we gotta improvise. We gotta think outside of the box. We gotta revive an old hack that we used to use way back in the day… page tagging. Back in the day, when we wanted to see whether groups of pages were getting indexed, we'd just tag those pages with some kind of unique text.

So if I wanted to check on a set of pages (maybe my category pages), I'd add some text to the bottom of these pages, like "Page Code: Zebra" – this will get indexed, and then I can do a site: search for that text. This limits the # of pages that show up, and allows me to measure my indexing. If I have 55 pages in my "Zebra" group, I can determine how many are indexed, etc.

This isn't the post I wanted to make this week… but it's all we've got to work with right now.

If anyone out there can come up with a search hack that will let us check the status of an individual web page, please post it here in the comments. At the very least, I'll send you a signed copy of the SEO Fast Start print edition. :D

Filed under Blog by

Permalink Print Comment

Comments on Crawling Out Of The SI (For Large Sites) »

July 25, 2007

Suzanne @ 4:51 pm

Hi Dan, I'm a newbie at this so I may have gotten this wrong. When I use site:www.myshrink.com I get all my indexed pages and the ones that are in supplemental. Are you looking for more than that? Otherwise, it's been working for me for a long time.
Suzanne

James @ 4:57 pm

Sounds like after using this we can't tell if the pages are in the SI–just if they are indexed. I have a set of pages that are indexed and were in the SI last month. Now they are still indexed and don't show the SI-although other pages do show SI.

I did hear about Google blending the SI back in, but it still could be the case that they think those pages have thin content (which they do, it's on my to-do list to improve).

So I should assume that I don't know if they are actually in the SI or not?

Dan Buglio @ 5:01 pm

Dan:

I'm not sure I see the same thing as you. When I initially searched Google for:

site: http://www.mydomain.com (Probably the wrong syntax, but I included a space)

I got the results that were in the primary index only….NO Supplemental results. Kinda freaked me out. But when I eliminated the space between the : and the http, I got a full listing of all pages INCLUDING those in the supplemental index.

site:http://www.mydomain.com (No space after site: seems to work still.)

So at least for right now, I can still print out a list of all my pages in AND out of the S.I.. I just printed a full listing just in case things change again with Google. Hope this helps!

Jason T Chandler @ 5:59 pm

Does this relate to the cache dynamically updating to my brand new content just FTP'd with a cache date that is a week old?

not sure if I can post a URL here so I will leave only part:
/google-services/google-trustrank.htm. I JUST made the page PINK in the body. Pushed the page live with DW. the cache shows the PINK page with a date from:
google-services/google-trustrank.htm as retrieved on 9 Jul 2007 05:46:14 GMT.

So the cached HIGHLIGHTED text is gone. Meaning that G may not be ranking keywords as much as passion and knowledge. Or in other words "socially accepted".

Good luck Dan, I have been following you since 2003. Good stuff!!

Jason T Chandler @ 6:00 pm

PS – both browsers (FF /IE) demonstrate this cache change.

Dan Thies @ 6:11 pm

Suzanne & Dan, doing a site: search will still show a mix of supplemental and main index results. The "hack" Craig posted last week was:
site:www.yoursite.com *** -sjpked
Which was showing *only* supplemental results… which would be very helpful for a large site since it would let us find more of the SI pages.

Jason, link away, you're good for it. If we have to edit a posted comment we will. For those who would like to see the page Jason's talking about:
http://www.jasontchandler.com/google-services/google-trustrank.htm

That's one ugly looking page, Jason… White on pink, it burns my rods and cones!!! The reason why Google is showing the pink background on the cached page, is because you defined that color in your stylesheet (CSS), which isn't cached, so it has to be loaded when we view the (cached) page.

Dan Thies @ 6:27 pm

James, right now Google displays "Supplemental Result" next to pages that are found in the SI.

The rumors about it basically run along the lines of:
- Google engineers are tired of answering questions about the SI…
- They are not just a little tired , they are very very tired of it…
- So they will solve their problem by removing the label from SERPs…
- Which would let them keep the SI, without having to explain it…
- Because nobody can see it any more.

Dan

Jason T Chandler @ 7:37 pm

hey Dan – since I have your ear… Is it true that Google can identify the content based on the GUI that produced it? We are getting 90% errors on some sites due to the server looking for massive amounts of files with the Dreamweaver lock on the end. LIKE this:

/images/blank.gif.LCK

Essentially G just added another factor to combat astro-turfers?

Dan Thies @ 8:02 pm

You got me on that one, Jason. Are you saying you've got spiders requesting those files?

Manny @ 8:26 pm

Dan, Jerry West already posted a hack on the stompernet forums, have you seen it?

Dan Thies @ 9:06 pm

Yes, Manny. It's the same one Craig posted. It was working last week. It's not working now.

David Leonhardt @ 9:19 pm

It seems to me that people spend a lot of time worrying about this whole SI thing that needn't be. As PageRank rises, fewer pages will be in the SI.

The best test for a given page really is to see where it ranks for its main search term, which is all that counts for that page anyway.

And the best way to get the page out of the SI is to build links to it or to its parent.

In my view, any page can be a first tier page if you build enough inbound links to it from other websites. So a page two levels deep from the home page can in fact be like a first tier page, and all the pages linking from it are much less likely to be lost in SI. But again, the key test is how each deep page ranks for its main search term.

The bigger the site, the more work this is (building deep links), but I am willing to bet that websites with a good array of linked-to pages just do not have the same SI concerns that sites with a high concentration of inbound links focused on the home page (of course, "I am willing to bet" is pretty poor evidence, but I'm not the stats cruncher to prove or disprove my own wild assertians).

Jason T Chandler @ 9:34 pm

domain.com site:domain.com

But I do not know how you are applying it. I have not been doing hacks as I do not do rank checks… (or maybe it is I am just too G'd out by the end of the day to think about rock pigeons flying backwards). Good Luck Dan!

Dan Thies @ 9:57 pm

Hi, David. Nice to see you here! Have you read the book yet?

I doubt you've watched my link building course, but it does cover the need for deeper linking beyond the home page.

Low PageRank for a deep page, especially as the site grows and obtains more inbound links, is still more likely to be a structural issue. As you move deeper, and the site gets larger, the exponential explosion of pages makes it exceptionally difficult to attract external links to every page.

(Assuming you don't have a giant link farm to work with… because I sure don't!)

Blogs are, or can be, an exception… if your blog has a sufficient presence and following you can expect every post to have some inlinks.

Dan Thies @ 10:17 pm

Jason, what we're after is some way to craft a search query on Google that will list ONLY the Supplemental pages from a site. I suspect we're up against Google's intentions on this one, though.

Jeff Knize @ 11:53 pm

Here is a search query that checks all C-Class datacenters and displays the number of URLs in the supplemental index at Google: http://oyoy.eu/google/supplemental/

For those interested, here is more information on the topic: http://oyoy.eu/huh/supplemental-tool/

Jeff

July 26, 2007

Dan Thies @ 11:05 am

Jeff, it doesn't work any more. It's based on the same broken hack. Stuff like this "tool" (which massively abuses Google's services) is what forces Google to change stuff up on us.

Jeff Knize @ 11:32 am

Okay Dan. I hope it comes back.

Thanks,
Jeff

Jason T Chandler @ 11:45 am

Dan – that is exactly what we have in our awstats when viewed with a log analyzer.

July 30, 2007

Mike Belasco @ 11:43 am

Try this
site:www.mydomain.com/&

Jeff Knize @ 11:56 am

Thanks Mike. It works!

Jeff

Dan Thies @ 2:34 pm

Mike, you are my new hero.

Mike Belasco @ 11:13 pm

no problem Dan, with all the help you've given me over the last couple years, it really is nothing

July 31, 2007

Gail Mills @ 11:05 am

Thanks Dan- I followed your recommendations re: tools for site testing re: Google SI etc. After testing and checking for content duplication,the only thing I could come up with is the duplication of my left hand index linking column on the site. The column is duplicated on every page. There isn't anything to inhibit the spider from the content. What is usually done in this situation? I studied Leslie's 330 video- no follow etc., but I am still confused as to how to handle the situation. What do you recommend?
Gail

Gail Mills @ 11:13 am

Mike- I did a copy and paste and received a 404 error. what are the symbols after .com
ie: site:www.mydomain.com/&- This is what I got???

Thanks,

Gail

Gail Mills @ 11:16 am

Mike-

When it went up on the portal board it was converted differently. I get the ampersand. (and) sign. I have an apple.
Thanks again.

Gail

Dan Thies @ 11:17 am

Gail, type that into a Google search box.

For your site, two things I'd work on:
1) Redirects so that you don't have so many URLs for the home page and stuff. You have http://www.q—.com, q—.com, http://www.q—.com/index.html, etc. and that should be reduced to one variation. You also want to make sure that you only link to one version. Jerry West's 301 redirect & htaccess tutorials in the Stompernet forums are great at explaining how to do all kinds of redirects.
2) More incoming links to get the site itself more PageRank – your internal structure can only get you so far. You need more linking & promotion.

August 1, 2007

bart van der Velden @ 9:43 am

Dan, I saw a hack on http://andrescholten.nl/index.php/bekijk-de-supplemental-index-van-google/
He says with "*-view" after the url in Google you get the supplemental results for that url. Is this correct?

Dan Thies @ 9:56 am

Bart, that's the hack that used to work. Mike Belasco posted a new hack here yesterday, that does work:
site:www.mydomain.com/& (add slash/ampersand after the domain)

never mind, it's the stomper Jerry West hack but with one instead of 3 asteriks, so the same as Craigs.

Gail Mills @ 6:21 pm

Thanks Dan- Followed your advise using the tool 'site://mydomain.com/&' for SI listings and they all disappeared. I have another question: The left hand column on my site is like a vertical nav bar. With links to different pages. However this vertical nav bar is duplicated on every page. Does this count as duplicate content? Is there a way to use some coding so it will he human friendly and not read by spiders? What does the expert recommend?

I just spoke with one the people I recommended you to and he agrees you are the best.
Please explain exactly what you would like us to do as we 'spread the word'.

August 2, 2007

Michael VanDeMar @ 12:33 am

And with one fell swoop, they make it almost impossible to diagnose the problem, without actually doing anything to help it:

Death of the supplementals label.

Nice, Google, nice.

thanos @ 12:47 am

Hi Guys, this new hack for me, doesnt work, or my websites are not any more in SI :) . Can you give me a test domain name?

Thanks

Michael VanDeMar @ 12:50 am

thanos, are you saying it doesn't work because when you search you don't see the supplemental label next to each of the results…?

thanos @ 9:00 am

Hi Michael…Yes…
A week before i checked with the old method to see my SR, and i saw some pages, now i check it with the new method and i dont see anything..
or cause i used the google sitemap protocol to my websites?

February 29, 2008

Patrick Ryall @ 8:44 am

God your behind the times aren't you supplementary index has been gone for a while and your advising people how to index spam better. Well all I can say is why not try some good content for a change and simple SEO and maybe all would be sound.
Ranking and traffic is so simple it is called value content original content and a concious give it a try you could be surprised.

Dan Thies @ 10:26 am

Patrick, apparently you can't read the dates on old posts…

As for the rest of your little rant… LOL.

Leave a Comment

Subscribe without commenting

Login