Effective document clustering system for search engines
People use web search engines to fill a wide variety of navigational, informational and transactional needs. However, current major search engines on the web retrieve a large number of documents of which only a small fraction are relevant to the user query. The user then has to manually search for relevant documents by traversing a topic hierarchy, into which a collection is categorised. As more information becomes available, it becomes a time consuming task to search for required relevant information.
This research develops an effective tool, the web document clustering (WDC) system, to cluster, and then rank, the output data obtained from queries submitted to a search engine, into three pre-defined fuzzy clusters. Namely closely related, related and not related. Documents in closely related and related documents are ranked based on their context.
The WDC output has been compared against document clustering results from the Google, Vivisimo and Dogpile systems as these where considered the best at the fourth Search Engine Awards [24]. Test data was from standard document sets, such as the TREC-8 [118] data files and the Iris database [38], or 3 from test text retrieval tasks, "Latex", "Genetic Algorithms" and "Evolutionary Algorithms". Our proposed system had as good as, or better results, than that obtained by these other systems. We have shown that the proposed system can effectively and efficiently locate closely related, related and not related, documents among the retrieved document set for queries submitted to a search engine.
We developed a methodology to supply the user with a list of keywords filtered from the initial search result set to further refine the search. Again we tested our clustering results against the Google, Vivisimo and Dogpile systems. In all cases we have found that our WDC performs as well as, or better than these systems.
The contributions of this research are:
- A post-retrieval fuzzy document clustering algorithm that groups documents into closely related, related and not related clusters. This algorithm uses modified fuzzy c-means (FCM) algorithm to cluter documents into predefined intelligent fuzzy clusters and this approach has not been used before.
- The fuzzy WDC system satisfies the user's information need as far as possible by allowing the user to reformulate the initial query. The system prepares an initial word list by selecting a few characteristics terms of high frequency from the first twenty documents in the initial search engine output. The user is then able to use these terms to input a secondary query. The WDC system then creates a second word list, or the context of the user query (COQ), from the closely related documents to provide training data to refine the search. Documents containing words with high frequency from the training list, based on a pre-defined threshold value, are then presented to the user to refine the search by reformulating the query. In this way the context of the user query is built, enabling the user to learn from the keyword list. This approach is not available in current search engine technology.
- A number of modifications were made to the FCM algorithm to improve its performance in web document clustering. A factor swkq is introduced into the membership function as a measure of the amount of overlaping between the components of the feature vector and the cluster prototype. As the FCM algorithm is greatly affected by the values used to initialise the components of cluster prototypes a machine learning approach, using an Evolutionary Algorithm, was used to resolve the initialisation problem.
- Experimental results indicate that the WDC system outperformed Google, Dogpile and the Vivisimo search engines. The post-retrieval fuzzy web document clustering algorithm designed in this research improves the precision of web searches and it also contributes to the knowledge of document retrieval using fuzzy logic.
- A relational data model was used to automatically store data output from the search engine off-line. This takes the processing of data of the Internet off-line, saving resources and making better use of the local CPU.
- This algorithm uses Latent Semantic Indexing (LSI) to rank documents in the closely related and related clusters. Using LSI to rank document is wellknown, however, we are the first to apply it in the context of ranking closely related documents by using COQ to form the term x document matrix in LSI, to obtain better ranking results.
- Adjustments based on document size are proposed for dealing with problems associated with varying document size in the retrieved documents and the effect this has on cluster analysis.
History
Start Page
1End Page
362Number of Pages
362Publisher
Central Queensland UniversityPlace of Publication
Rockhampton, QueenslandOpen Access
- Yes
Era Eligible
- No
Supervisor
Professor R J Stonier ; Dr Ron BalsysThesis Type
- Doctoral Thesis
Thesis Format
- By publication