You will be provided with a zip file that contains 91 HTML documents collected from Wikipedia.
- Pre-process the documents by removing all HTML tags and convert everything into lower case.
- Implement a stop list and a stemmer to pre-process the documents (for the stop list and stemmer, you are allowed to use third-party open source code).
- Build an inverted index (including dictionary and posting lists) for the documents. Please make sure to keep all the frequency information.
The goal is to provide a TF-IDF-based ranking for the documents.
- Since you have already collected frequency information in step 1, please further compute IDF for each term.
- For each document, calculate the length of the corresponding document vector.
- For each incoming query, pre-process the query with the stop list and stemmer. Identify candidate documents that contain at least one query term.
- Meanwhile, compute the length of the query vector.
- Finally, compute the TF-IDF similarity score between the query and each candidate document (there is no need to construct the complete document vector, or loop through all dimensions in the vector space).
- Sort the documents by the score.
- Identify a domain of interest (e.g., Wikipedia, NFL, etc.). Ideally, the size of the domain should be manageable, and the link structure is not too complicated to follow. The crawler should contain at least three components:
- (1) a multi-threaded spider that fetches and parses webpages,
- (2) the URL frontier which stores to-be-crawled URLs; and
- (3) the URL repository that stores crawled URLs. Please be polite to the site. Please collect a few hundreds to a few thousands of pages.
-
Feed the collected documents to the search engine that you implemented in step 2.
-
Implement a Web-based interface to take user queries and return answers (document names, snapshot with search term(s) highlighted, and URL) to the user. You only need to provide a reasonable (not so fancy) interface, you can use WYSIWYG editors to generate HTML. Keep this version of your search engine, since it will be compared with two future versions.
-
Define your own score that reflects the proximity of search terms in each document.
-
Define your own algorithm to integrate term proximity score with the tf-idf score from step 2.
- Search personalization: use cookies to track users. Record each search and each click-through. For a new query, add a small component of the "search history" as query expansion.
- Relevance feedback. For each query, allow the user to identify a set of "positive" and "negative" results. Use user feedback to update the query and return new (refined) results to the user.