Craig Francis


Search Engine CSS

Unfortunately this document is no longer relevant.

After having a discussion on the W3C mailing lists, it has been brought to my attention that we will soon see new tags like <nl> and <nav>, along with the role attribute. These are part of XHTML 2 and WHATWG and will allow us to add more semantic meaning to our documents.

For example, by using a navigation list <nl>, search engines could ignore its content when it comes to indexing that page, but could still use the text found in the links to help index the linked to document.

Original document

I am sure most website developers have all heard about CSS, and what it can do to make websites look beautiful, there is nothing new there... it has been around since 1996 and has made quite an impact on how we work and build websites.

One aspect of CSS that I found interesting is being able to take a single HTML document and apply style sheets in a way that will benefit how each website visitor views the website.

For example, most websites will design for the "screen" media type... but you could also make a style sheet for handheld devices (narrow display), for the TV (big fonts), etc, etc.

How about adding a new one to the arsenal, how about one for search engines?

Lets call this media type "spider", in honour of those little robots which crawl though your website looking for all kinds of information. They are visitors of the website, and they are the ones we are targeting.

At the moment, search engine spiders need to guess what is important on your page, where the navigation bar is (if there is one), and where to find the real title of the page (supposedly the most important thing on the page).

All those spiders realistically have is the <h1> to <h6> tags and guess work, lucky them!

So this is my proposal, which is up for debate, so lets see where it takes us...

My problem

I have just recently been working on a website where I structured my page in the following way:

<h1>Site Name</h1>
<h2>Page Title</h2>
<h3>Sub Section Title</h3>
<p>...</p>
<h3>Sub Section Title</h3>
<p>...</p>
<h3>Sub Section Title</h3>
<p>...</p>
<h2>Page Navigation</h2>
<ul>...</ul>

NOTE: the <div>'s have been removed to keep the example simple and hopefully easier to understand.

I was fairly pleased with this structure, as in my opinion this reflected how the document was laid out structurally, with the 2 main sections of the page being titled using the relevant <h2> tags. In effect, these become the children of the main <h1> tag, which is the grand master for the page.

Unfortunately on this website we managed to inherit an SEO expert, who insisted that the page title was put into the <h1> tag. I can see why though, as it does need to be made more important for search engines, but this changed the document structure to:

<h1>Page Title</h1>
<h2>Sub Section Title</h2>
<p>...</p>
<h2>Sub Section Title</h2>
<p>...</p>
<h2>Sub Section Title</h2>
<p>...</p>
<h3>Page Navigation</h3>
<ul>...</ul>

The indenting has been added to show how I think the data can be viewed (like XML), where this new structure seems to indicate that the Page Navigation is now a child to the last "Sub Section Title".

If it ain't broke...

So, we do not want to reinvent the wheel, the CSS structure already works perfectly for us (well, ignoring a particular browsers very poor support). So we can still target elements on the page in the traditional way.

And this is in keeping with the W3C spec "separating the presentation style of documents from the content of documents", which is what we are trying to achieve... we do not want to alter the documents content or semantics, but present it differently for search engine spiders (one of the most important website visitors).

Some people may still argue that CSS is purely for design, but look at it this way... there is already the "speech" media type designed for screen readers. The rules like "volume" and "speech-rate" were not built for traditional design, but as a way to present the information to the visitor in the way the website author intended (with no guess work on the screen readers part).

That's not navigation!

How do search engines know where the navigation bar is?

I don't know the finer points on how the spiders work, but I am fairly sure it is a guessing game for them, even at the best of times. For example, in the following HTML, what does the <ul> tag represent... a list of references for the article which should be indexed, or the website navigation bar?

<h1>Page Title</h1>
<h2>Sub Section Title</h2>
<p>...</p>
<h2>Sub Section Title</h2>
<p>...</p>
<ul>...</ul>

Normally we do not want the text found in the navigation bar to be indexed with our page. As an example, we might not want the text "company history", which was found on a link in the navigation bar, to be indexed with the "contact us" page. However we do want the search engine to follow that link, in the knowledge that the link destination is about the company history.

But turn that the other way around. If I am writing a paragraph about ambivalence (with link), I really do want that text to be indexed with the page, as I have just been talking about it. Without the word "ambivalence", the paragraph might not mean anything.

So to give the spiders a clue on content you could use:

#mainNav {
content: navigation;
}
#pageContent p:first-child {
content: description; /* Description meta tag alternative */
}
#pageContent p stong {
content: keywords; /* Keywords meta tag alternative */
}

Unfortunately the content rule is already in use, so a new name will have to be assigned to this behaviour.

Pesky comment spam

We have to thank Google for the ability to add the "nofollow" value to the "rel" attribute. This is an attempt to protect us from people entering unrelated links (viagra anyone?) to forums, blogs, etc in order to boost their search engine raking.

But isn't this as annoying as having to write target="_blank" for popup links?

We need to take this further, strip is back and simply add the following to the spider style sheet:

#userComments {
links: ignore;
}

I do have some reservations on this approach though (like Blair Millen from GAWDS), if it does work for content spam, it should become redundant.

Although this is only one example usage of the rule, if you consider the "index,nofollow" value on the "robots" <meta> tag, it has the same effect, but gives you more control over which links on the page should not be followed.

What is important in this world?

Moving on, we might need to tell the search engine what we think is important on a page. This is normally the title and its content, ignoring all the cruft around it (like the footer links).

This could be a simple "importance" rule with values starting at 1 for most important, and going up-to any number the author decides is right for their website. It will also need to handle -1 which is the flip side of the coin... to ignore the content.

So, for my website, I might have the rules:

#pageContent {
importance: 5; /* General rule, to handle <p>, <ul>, etc */
}
#pageContent h1 {
importance: 6; /* Site title, not relevant to this page */
}
#pageContent h2 {
importance: 1;
}
#pageContent p#intro {
importance: 2;
}
#legal {
importance: -1; /* Do not index */
}

Some people might think this is up for abuse if a website author sets "importance: 1" on the body.

Fortunately this is not the case. You should not see it as being an importance rating for this page in relation to other pages on the website or internet... but see at as highlighting the important parts of that page as an isolated unit.

For example, using the Google PageRank system, if you had a page which has been assigned a 3/10 importance rating (where 9/10 is for pages like the BBC homepage), that 3 rating will be divided between the elements on your page. This means that if you were to say that everything on the page has "importance: 1" then, this 3 rating will be distributed evenly between all the words on the page. On the other hand, if you set "importance: 1" to the title "earth sandwich" while the rest of the body had "importance: 3"... then someone searching for "earth sandwich" would now be more likely to find that page.

Testing

How do we know if our "spider" style sheets are working? It is not as if a standard web browser should implement this functionality.

Really it is going to be down to either the good old W3C or companies who run search engines.

For example Yahoo could try to win over developers hearts, by implementing theses changes into their spider, then proving an interface to display what their spider "sees".

The output of the test should be fairly simple though, for example just 3 boxes could include:

  1. Multiple paragraphs, the first one holding the most important words, and the other paragraphs following suit. It might not matter the order of the words, as they just make up the index. What the website author needs to-do is simply ensure that the most important words (top) are not being polluted with pointless information, like "click here".
  2. A simple list of links which the spider will follow, with the text that will be indexed with the destination page.
  3. Text which is being ignored... for example, why pollute your pages content in the search engines index with a privacy policy found at the bottom of every page?

Support me baby

Like traditional CSS, you do not lose anything fundamental if a search engine spider does not support this implementation. It is there to act as a guide, a way to help the spiders out.

Look at this way... the spider gets a HTML document and pulls it apart. Now with the current batch of spiders, they will just ignore the rules... doesn't matter, it is not stopping them from doing their job. But if any spiders do look for these rules, they get a basic understanding of the document, and some of the guessing game disappears - theoretically no mistakes!

Now fly my pretties

This is still only theory, but it could potentially come into play very quickly... This is because each search engine spider only needs to implement it once and that's it. It is not like waiting for a browser to be released with the required features, then waiting for all of your customers to upgrade.

Any feedback would be greatly appreciated, I don't include comments due to the admin time required, but if you email me, I will reply and make appropriate updates. Also, if you would like to take a copy of this article, please read the terms this article is released under. This article was originally written Tuesday 6th June 2006 and was updated on Tuesday 17th October 2006.