Deima Elnatour: The Future of Indexing

Tuesday, August 08, 2006

The Future of Indexing

Document indexing is a key to good search. The better the index is, the more relevant the search results are likely to be. However, indices are traditionally viewed as a static vocabulary structure that represents a document collection. There used to be human involvement in indexing, which did not work and had many limitations for obvious reasons. With massive amounts of data finding its way to the web the need for automatic indexing became pressing. So nowadays, most indices are generated automatically by machines. However, IR community still views indices if a collection as a static set. I believe that indices are properties that evolve and grow over time due to the social construct of collection usage. If you are not sure what I mean then do this: go to google and type miserable failure and see what comes up. Then ask yourself the question of how did this actually happen.

I believe that indices must be obtained in real time and they need to be built with a dynamic notion that allows them to grow and change over time. This is in deed the reason why my research is focused on discovering human indices or what you know as tags which is also called Folksonomy (taxanomy by folks or people). I believe that this new construct is the one foundation for quality searches in the future.

4 Comments:

Deima you are quite right in pointing that indices need to be obtained in real time so as to capture the current social context of a given phrase. However dont we already capture that by continuously updating indices and using purely statistical techniques in deciding the current social meaning of a phrase ? Maybe i am getting you wrong here but how would human generated indices help over automatically generated onces except for obvious benefits in semantic accuracy ?

By Anonymous, at 10:17 AM
You are right. Semantic accuracy is one main benefit and that is what I am generally after when I talk about capturing of social context. In other words, what words would people use to describe a doc or search for it? I have done a quick study on del.icio.us data and found that people usually use "how to" when they are looking for tutorials. Traditionally, since the doc does not contain the word "how to" it would not be returned despite its relevance. When I uncovered this semantic pairing with the words "how to and tutorial" now I can expand the query to enhance retrieval.

By Deima Elnatour, at 1:31 AM
guess we have different opinions on achieving the same ends. I have always talked about the importance of automatic semantic analysis in solving such problems and boosting semantic accuracy of text IR solutions. We currently do have systems which can easily map how to => tutorial and lots of such semantically close pair. I know current statistical systems are no match for human judgement but trust me we are getting there. Pretty Fast.
A major problem with human based approaches is that 90% of the times humans talk about only 10% of the stuff making the rest 90% content slightly less reliable and open to spam. Moreover for companies it is really tough to rollout services for new languages in the absence of for example ample data on del.ico.us in korean.

A recent talk on google videos however did change my views on human computations by a huge degree. I am now complete sold out on the importance of this approach. I am not sure if you have seen this before http://video.google.com/videoplay?docid=-8246463980976635143

So i guess what you are focusing on is a really powerful method which can be very useful if done right.

By Anonymous, at 2:21 PM
Abhinai,

thanks much for the video link. I know about the esp game. luis von ahn came to our school about a year a go and presented his game.

I meant to ask what do you do? are you a phd student or a practitioner? You can email me so I can have your email. How did you come across my blog?

By Deima Elnatour, at 6:46 PM

Deima Elnatour

Tuesday, August 08, 2006

The Future of Indexing

4 Comments:

About Me

Previous