Information_Retrieval_System

Problem Statement

The central aim of this project is to engineer a sophisticated yet user-friendly search application that capitalizes on the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm's potential. This algorithm serves as the bedrock for measuring the relevance of documents in response to user-entered search queries.

Pipeline Architecture

Tools

AWS Services: Leveraged Amazon S3 for data storage, AWS EMR for distributed data processing, and DynamoDB for efficient data retrieval.
Apache Spark: Utilized Spark RDDs to perform TF-IDF calculations, resulting in a significant reduction in processing time.
TF-IDF (Term Frequency-Inverse Document Frequency): Calculated and analyzed TF-IDF scores to quantify the relevance of documents to search queries.
S3- For Data ingestion and intermediate storage of tf-idf calculations and title of the documents.
DynamoDB Tables: Created two DynamoDB tables, 'tfidf' and 'doctitle,' to store processed data efficiently, with careful consideration of partition and sort keys.
AWS Lambda: Orchestrated the system's functionality into an AWS Lambda function, ensuring seamless execution and scalability.
HTML and CSS: Developed an HTML-based user interface, 'search.html,' which communicates with the Lambda function to display search results.

Data Collection and Preprocessing:

Curate a diverse collection of text files from different authors along with a file containing their titles.
Perform initial data cleaning and transformation to ensure uniformity and consistency.
TF-IDF Calculation with Spark:
- TF- IDF calculation steps :

Formula:
Leverage the distributed computing capabilities of Apache Spark running on AWS EMR (Elastic MapReduce) to calculate TF-IDF scores.
Utilize Spark RDDs to efficiently process and analyze the text data, resulting in accurate TF-IDF values for each term-document pair.

DynamoDB Table Design:

Create two DynamoDB tables: 'tfidf' and 'doctitle,' with careful consideration of partition keys and sort keys.
Efficiently store the processed TF-IDF data and document titles for seamless data retrieval.

Relevance Ranking and Search:

Develop Python code that implements the TF-IDF algorithm to assess the relevance of documents to a given search query.
Formulate a scoring mechanism, the relevance is just the sum of the TF-IDF values for each term in the query, normalized for the length of the query terms. If Q is a set of terms, then relevance is defined as follows:

Lambda Function Integration:

Design and configure an AWS Lambda function that encapsulates the relevance ranking logic.
Set up the Lambda function to accept user input (search queries) and return the search results.

User-Friendly Interface:

Create an interactive HTML-based user interface ('search.html') that allows users to enter search queries.
Connect the HTML interface to the Lambda function for querying and result retrieval.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
tesxtSearch		tesxtSearch
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information_Retrieval_System

Problem Statement

Pipeline Architecture

Tools

Data Collection and Preprocessing:

DynamoDB Table Design:

Relevance Ranking and Search:

Lambda Function Integration:

User-Friendly Interface:

Output screenshot

Search Page

Search for a relevant documents

About

Releases

Packages

Languages

madhuroopa/Information_Retrieval_System

Folders and files

Latest commit

History

Repository files navigation

Information_Retrieval_System

Problem Statement

Pipeline Architecture

Tools

Data Collection and Preprocessing:

DynamoDB Table Design:

Relevance Ranking and Search:

Lambda Function Integration:

User-Friendly Interface:

Output screenshot

Search Page

Search for a relevant documents

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages