Prepared Seminars: Document Clustering

Clustering is used to partition a set of data so objects in the same cluster are more similar to one another than they are to objects in other clusters. In the field of information retrieval (IR), document clustering is used to automatically organize large collection of retrieval results, grouping together documents that belong to the same topic in order to facilitate user’s browsing of retrieval results [1]. It is also used in word sense disambiguation, as well as many other applications.

Conventional document retrieval systems return long lists of ranked documents that users are forced to sift (to filter out) through to find relevant documents. The majority of today's Web search engines (e.g., Excite, AltaVista) follow this paradigm. Web search engines are also characterized by extremely low precision.

The low precision of the Web search engines coupled with the ranked list presentation make it hard for users to find the information they are looking for. Instead of attempting to increase precision (e.g., by filtering methods) [2], we attempt to make search engine results easy to browse. This report considers whether document clustering is a feasible method of presenting the results of Web search engines.

FUll Paper Link