Archive for the ‘search engines’ Category

Is Google Beatable by Re-inventing Search?

Monday, May 12th, 2008

Powerset have a demo out and its interesting, technically proficient and built by a solid team, but winning requires questioning the premise: is better search a problem and is it solved by changing the way people are currently used to searching to the the way people naturally speak?

Google is a long term threat to Microsoft’s hegemony not by having built a better OS, but by owning Search. The web shifted the landscape of technology and a a once niche application, dominated by companies like Verity: full text search, became the ‘command line of the web’. Since Microsoft had always owned the command line, this made web search a strategic threat.

Powerset has some very bright people like Barney Pell behind it, and who am I to challenge it, but I have a nagging doubt, which is to do with my years spent in architecture rather than technology. In architecture the first thing you do is question the brief: if someone asks you for a building with a sloping facade, you ask why and you may have a good reason for doing something differently. If someone asks you for a better search engine, you would ask why. Here is my asking why.

If the value in building a better search engine is to beat Google, perhaps Google can only be beaten when something other than a search engine becomes a starting point for the web. It doesn’t take a stretch of the imagination to see that if Facebook became a truly monopolistic social network it would be a strategic threat to Google. If building a better search engine is the way to beat Google then Powerset is on the right track.

Is the way to build a better search engine based on the ability to answer questions the way they are spoken? If so, then natural language technology is the right approach and Powerset is on the right track. A few years ago this would definitely be the case, but these days, the ergonomics of the web have evolved in tandem with Google. People don’t tend to type question into search engines, but type a few salient words. This may not be the most elegant practice, but it is the de facto standard behavior and to try and change it might be like trying to change the QWERTY keyboard for a more rational one.

Assuming that there is a better search practice than currently used, how does Powerset stack up when natural language queries are typed into it. This would require very thorough testing, but I’ll give on example: ‘who was churchills father’ [sic]. Both sites return the correct answer, but Powerset requires adding an apostrophe: churchill’s, not a big deal for them to fix but a perfect example of how a simple grammatical rule dealt with by query parsing can sometimes get forgotten in the attempt to index perfectly.

Lastly, intelligent indexing comes at a cost - it may be slower to query, and it is definitely slower to index. Quick response time has always been a priority for search - and Powerset can possibly match. But the biggest change on in search in this second phase of the web, has been the rise of ubiquitous, news style (e.g. weblog) publishing systems and the importance of search by date. AltaVista’s last throw of PR success against Google was their news search which was pounded after 911, before Google News, let alone weblog search existed. Fast updates require fast indexing.

I wish Powerset every success, and think that this will come when something else is thrown into their mix.

In defense of Technorati

Wednesday, May 16th, 2007

After 911 AltaVista scored some rare Brownie Points against Google, in the press, because Google didn’t have news search, but AV did, via Moreover. Google News was built largely as a result of 911.

It proved that Moreover was a news search engine, but it was too late.

Our PR company had told us that ’search is dead’ and without a revenue model for search engines, there was pressure from all sides to make Moreover something else - which resulted in all sorts of convoluted bullshit and meant that Moreover never had decent technology for full-text search.

Eppur Si Muove.

Technorati is in the same boat, there is probably a great deal of pressure not to call it a blog search engine, and perhaps for a different reason than Moreover - that Google is too difficult to take head on.

I may sound arrogant, but I cant help feeling that I’ve seen this train wreck before, and am also a particular fan of
Dave Sifry who I think has had a rough deal lately.

Technorati is a Blog Search Engine, period. It has some peripheral features that help differentiate, but they are peripheral. Anyone who thinks differently on the board or within the company, is a liability.

I say all this because Google have finally got their act together with Blog Search, and the window of opportunity for Technorati to be what they are, and what they created, is closing.

At last - Dapper

Thursday, August 17th, 2006

Dapper fills a perfect niche.

People forget that before RSS there was screenscraping. And that after RSS there is still screenscraping. Most of Google News is scraped and does not come from RSS.

Amazingly, because nobody really puts any useful metadata in RSS, you still need to screenscrape to produce useful aggregation services.

Other than enterprise companies such as WebMethods which had a scraping tool as part of a web services builder, or the innovative Junglee that was snapped up by Amazon before the last .com boom got underway, nobody has built an online screen scraping tool, despite the fact that its actually a massive gaping hole in fundamental services of the web.

At Moreover.com, RSS was largely useless to us, because you can’t build a news search engine without full text, and the bigger news sources don’t want to output full text RSS, without prior negotiation. So, like Google News, we were managing tens of thousands of scrapers, for search engines like MSN and Yahoo, - which is a pain in the ass.

Because this is a pain in the ass, Dapper is a damn good idea, but because people imagine that RSS is something its not, people may not realize.

If the right people get to using it, Dapper could become a prime mover in making RSS be what people think it is, allowing people to build good vertical search services such as real estate where you want to search by number of rooms etc.

Waiting around for people to create a real estate module for RSS may not be practical. It would be better to scrape and then make the module yourself, using Dapper.

For Dapper to succeed I’d guess that they need to focus on a community of content aggregators, rather than be purely a software service.

Dapper: The Data Mapper

Google’s Gmail adds Map This links to addresses mentioned within emails.

Thursday, January 12th, 2006

I just noticed that Google add automatic Map Links when something that looks like an address appears within a message in Gmail.

This kind of on-the-fly detection of metadata to create searches could be used for auto-dialing phone numbers or adding appointments to a calendar - but I guess we’ll have to wait for a Google Calendar product for that.

“Gmail makes it easy for you to keep track of your packages, and map out directions to your destinations; when you open a message that lists an address or package tracking number, Gmail shows you handy links to maps and directions, or your package’s delivery status.”

Adwords, Adsense now Adballoons - Google is stealth testing Yellow Pages killer, ad network for maps

Wednesday, January 11th, 2006

Although unannounced publicly, Google appears to be testing its Yellow Pages killer, maps based advertising.

If you do a search for Hotels in New York on Google Local, you get something that you don’t get for a search for ‘hotels in San Francisco’ - ads. Right there as little blue map balloons rather the red, algorithmic, local search results.

Not only are the ads local, but they are contextual i.e. hotel searches bring up sponsored results for local hotels.

In some ways this is a relatively obvious move, however its big news considering that:

1. The Yellow Pages advertising market is bigger than the entire existing online search advertising market.

2. Offline Yellow Pages directories will clearly be replaced, over time, by online products, and it looks like maps are how this plays out.

3. Ad products are where Google makes the money that justifies its gargantuan Market Cap. so a new ad product is a big deal. Now, alongside Adwords and Adsense it has a third revenue source that is in a bigger marketplace.

With ads - Google Local - hotels loc: New York, NY

Without ads - Google Local - hotels loc: San Francisco, CA

why is weblog search so hard?

Monday, January 9th, 2006

Buried within the comments of Jermey Zawadny’s post about Feedster is this comment:

“I don’t recall Feedster ever being all that useful. But I also don’t find Technorati particularly useful. Why can’t someone just create a simple search engine for feeds/blogs?”

The truth is that it is very difficult to build a search engine with real-time updates, since search engines are optimized for retrieval and usually use batch indexing. In addition, the majority of weblogs are spam, further compounding the problem.

Blog search, which may once have seemed niche, will eventually be a standard part of search engines. At the moment, nobody, including Google, have a weblog search product that works.
If they did it would be very useful.

The real reason this is important is that it has nothing to do with weblogs, long term. There are only two things that matter in search - freshness and relevancy.

At the moment search engines like Google do not have a button that says order results by date - they will, eventually, and from that comes blog search or from blog search comes that.

Feedster Will Die in 2006 (by Jeremy Zawodny)

What the Moreover, Weblogs.com, Verisign deal means.

Tuesday, October 18th, 2005

This is my personal opinion and does not reflect any company policy.

Most web content is published and then indexed when a search engine finds it, taking up to 30 days. In the past submitting your site to a search engine was the done thing - now its coming back, only better.

Search engines have completely different indexes for news and weblog search, because the indexes need to be updated more quickly, to be able to do this they cannot search the entire web every few minutes but need to be alerted - or pinged. Currently, ‘pings’ to sites like weblogs.com or ping-o-matic or blo.gs say that SOMETHING has been updated on a weblog or news site. Specs such as RSSPing change this to a ping that says WHAT has been updated. If all pages being published on the web did this (and there is no technical reason why they couldn’t), then search engines would not need to crawl websites and search engines would be updated instantly.

Search engines are measured on how much, how relevant and how fresh. Pings are the answer to the fresh bit.

Mike Graves points out that Verisign plan to build value add services on top of pings, but acknowledges that pings themselves should be free:

“Ping services are not a profitable business, in and of themselves. Pings are free by tradition and by necessity. Attempts to introduce cost or latency into the ping layer would be self-defeating; the network simply routes around such problems. A free, open, scalable service fabric for pings is a powerful base for us to build value-added services on, however.”

This is good news because a single vendor ‘owning’ pings would really mess things up for publishers and Internet users in the long term.

In order to maintain innovation and development in weblog style publishing, RSS syndication and possibly even, in the long run, search, publishers such as SixApart should now bake the default ping to a (soon to be) non-profit service such as Ping-O-Matic (unless Feedmesh gets its act together) who would then pass it on to weblogs.com, blo.gs etc.

Alternatively, Verisign could keep Weblogs.com in a non-profit entity and develop premium services within Verisgn itself. This was pretty much how Dave Winer had things, separating church and state between his own publishing engine and Weblogs.com, so people trusted him to keep it neutral. The benefit with this option would be that there needs to be money from somewhere to make pings reliable and filter out the spam. The amount of money or infrastructure needed is not that great. I would argue that despamming, if it is by authentication, isn’t part of the value add but that custom subscriptions to alerts on topics are, but that’s debatable. In addition, despamming pings doesn’t need heavyweight authentication like certificates because the publisher to ping server ratio is not many to many. The problem with a Verisgn controlled root ping server (even non-profit) is that there are other large companies with ping server aspirations, such as Yahoo, who own blo.gs. There may need to be a truly neutral ping service for there to be a central one.

If this does not happen, ‘pinging’ will disappear as it either: Balkanizes, with companies who have both publishing and search products such as Google or Yahoo refusing to ping Verisign, or each other; stagnates with a single vendor having a lock on the whole thing, stopping competition and therefore, evolution.

The Internet works because nobody owns the roads. Keeping the infrastructure free and making money at the edges is what preserves the marketplace.

Verisign outline what they are up to with weblogs.com

Thursday, October 6th, 2005

Welcome to the Infrablog: Weblogs 2.0

Is Yahoo more Web 2.0 than Google?

Wednesday, October 5th, 2005

Whatever Web 2.0 really is, and in some ways its an empty ‘container meme’ for a meme that will morph into whatever is most convenient and successful, Yahoo are looking pretty well equiped to give Google a run for their money in the more media centric worlds of social applications and publishing. When did you last use Orkut? When did you last use Flickr?

With a media savvy exec. team and some small but smart acquisitions: Oddpost; Flickr and now Upcoming, Yahoo have the people, the components and the technical approach to create a synergy of social applications with next generation UI.

It used to be that using online apps. was a trade-off of functionality and performance vs not having to worry about maintenance, upgrades or backups or ability to move from one machine to another. With Gmail or Oddpost, there is no trade-off, my desktop email client crashed when I had a GB of email and took minutes to search for anything. Web email is now better all round than desktop versions.

As Anil Dash pointed out, the genealogy of AJAX traces back to the Oddpost founders’ previous project, the Blox online spreadsheet application.

I wouldn’t be surprized if five years from now as the advantages of AJAX style UI, trivially simple RSS driven web services and Weblog driven approaches to online publishing nibble away at Microsoft Office and end up coming full circle with a Blox like replacement for Excel.

Google Maps and Gmail show that their in-house development is strong, but the innovation happens at the edges and with billions in the bank and a competitor that has done some smart acquisitions lately, Google could perhaps do with spending some of that cash on acquiring.

The spread between Google and Yahoo’s shareprice/earnings multiple may narrow.

Yahoo Inc. Acquires Upcoming.org

Google and NASA to join forces in breast implants.

Thursday, September 29th, 2005


Google is to build a gigantic campus in The Silicone Valley

‘Silicone’ Valley - that would be the San Fernando Valley where all the porn stars hang out, I guess.