Google and RSS

Posted by | January 29, 2003 | Uncategorized | No Comments

Doc asks if Google takes advantage of RSS to aggregate news and if not why not.

The answer is no, and for two reasons:

1. Few RSS feeds contain the full text of the article and a search engine needs this to index. Sure, you could harvest the headlines and then index the site URL for the full text, some sites do this, but this approach is not reliable. Headlines are very dynamic and you increase the risk of synch. problems if you pull headlines from one URL and pull content from another. Google scrapes headlines and indexes articles from the same URL. With RSS 1.0 and 2.0 it is possible to put full text in an item, but commercial publishers are loathed to do this, and even webloggers like myself like the traffic to the original article, where I can deliver the message as I want it to be read, with graphics etc. One possible solution is to have an RSS tag that includes a tokenized version of the full text of an article, so that it can be indexed by a search engine but is not human readable. I proposed this a while back, but there doesn’t seem to be much interest, a tokenizer would need to be built into the publishing tools and RSS readers would need to be able to switch off the display of the tokenized text.

2. On a more conceptual level Google uses fuzzy full text search as opposed to parametric search. What do I mean by this? Full text search engines differ from databases in that they are not optimized for structured queries such as ‘select from where’, they use relevance algorithms to rank results. A new breed of XML databases offer the power of structured, parametric searching, combined with the retrieval performance of full text indexing. To fully take advantage of metadata such as is contained in RSS then Google is not the right animal.

Of course, at the moment there is almost no metadata in RSS (a headline, a URL for the headline and a bunch of other stuff that often varies in meaning is thrown into the description tag), but RSS 1.0 and 2.0 are modular and as adoption grows new modules will mean more metadata and greater ability to do powerful ‘select from where’ searches across multiple tags.

This does not mean that Google absolutely can’t make use of RSS, just that a search engine built on top of a native XML database will be able to do things that Google can’t. Think about Google and exact phrase matching without removing stop words like ‘the’, this very simple type of query is fudged by Google, they don’t really allow precise structured queries.

lets take a search for the band ‘The The’.

On Google

On Altavista

See, Altavista wins, Google returns many items that don’t contain the search string.