SM: How successful is your architecture when it comes to indexing?
TC: We now index a lot more pages than other people. Microsoft and Yahoo! have fallen behind in terms of indexing and keeping up with Google. Google will get better with more competition. Google only increased their index size when someone else got bigger. The glory of the web is not the big sites, it is the long tail. If they are not indexed, then they are not accessible.
SM: How many subjects do you cover in your index? How do you manage your taxonomy?
TC: It used to be there were a lot of ideas in the world, at least far more than there were web pages. However, the number of ideas has stayed pretty much constant while the number of web pages has kept growing. It turns out there are now far fewer ideas than there are webpages. We have about 16 million ideas that we index.
SM: That is an awful lot, although it is still finite.
TC: That will probably grow slowly. Things happen. Sarah Palin is not a new concept, as she was governor of Alaska and was definitely a known entity on the web two years ago. However, there are a lot of things like her which were known but have now exploded. ‘Saturday Night Live’ has new ideas there. Languages, however, change slowly. You have to be able to explain to somebody something which they have not heard before. You have to use words and ideas they already know.
SM: Let’s examine those 16 million ideas. How many ideas do you let a single machine handle?
TC: We have about 150 machines we are serving, and we then have a distribution of ideas to machines. When you get a query in you identify it as a query about chess and Capablanca. That might be on two different machines, and if they are they pair up as a team. One machine will send its information to another, who will do computation, come up with a result and send it back.
Because those two machines are only doing 150th of the space of concepts, they can know a lot more about those concepts. In terms of chess they can know that it is a game with pieces.
SM: So you do have relevancy and correlation?
TC: We do know what things are related to other things and how they are related. We are not exposing a lot of that right now, but my background is knowledge representation and that is what I used to teach at Stanford. That is one of the things we are very excited about and working on. One of the problems is how to show people the kind of knowledge we are trying to do.
Because we know how things are related, our ranking is primarily based on the page talking about the right things and listing the right relationships. Google’s ranking is a really good idea, which is ranking by popularity. We have a different feel because we are more about content than popularity. That is one of the things that both Microsoft and Yahoo! did. They came along trying to do what Google had already done. I don’t think you are going to be successful trying to copy someone else. You have to come at it from a different angle.
If you go to Google and do not find what you want, why would you try out another search engine that does the same thing as Google? You are going to get the same results. You really need to go somewhere that has an alternate way of thinking about things. That is one of the reasons we have tried so hard to be different.
This segment is part 3 in the series : The Audacity of Tom Costello
1 2 3 4 5 6 7