Skip to content

madhuroopa/Information_Retrieval_System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation

Information_Retrieval_System

Problem Statement

The central aim of this project is to engineer a sophisticated yet user-friendly search application that capitalizes on the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm's potential. This algorithm serves as the bedrock for measuring the relevance of documents in response to user-entered search queries.

Pipeline Architecture

image image

Tools

  • AWS Services: Leveraged Amazon S3 for data storage, AWS EMR for distributed data processing, and DynamoDB for efficient data retrieval.
  • Apache Spark: Utilized Spark RDDs to perform TF-IDF calculations, resulting in a significant reduction in processing time.
  • TF-IDF (Term Frequency-Inverse Document Frequency): Calculated and analyzed TF-IDF scores to quantify the relevance of documents to search queries.
  • S3- For Data ingestion and intermediate storage of tf-idf calculations and title of the documents.
  • DynamoDB Tables: Created two DynamoDB tables, 'tfidf' and 'doctitle,' to store processed data efficiently, with careful consideration of partition and sort keys.
  • AWS Lambda: Orchestrated the system's functionality into an AWS Lambda function, ensuring seamless execution and scalability.
  • HTML and CSS: Developed an HTML-based user interface, 'search.html,' which communicates with the Lambda function to display search results.

Data Collection and Preprocessing:

  • Curate a diverse collection of text files from different authors along with a file containing their titles.
  • Perform initial data cleaning and transformation to ensure uniformity and consistency.
  • TF-IDF Calculation with Spark:
    • TF- IDF calculation steps :
image
  • Formula:

    • image
  • Leverage the distributed computing capabilities of Apache Spark running on AWS EMR (Elastic MapReduce) to calculate TF-IDF scores.

  • Utilize Spark RDDs to efficiently process and analyze the text data, resulting in accurate TF-IDF values for each term-document pair.

DynamoDB Table Design:

  • Create two DynamoDB tables: 'tfidf' and 'doctitle,' with careful consideration of partition keys and sort keys.
  • Efficiently store the processed TF-IDF data and document titles for seamless data retrieval.

Relevance Ranking and Search:

  • Develop Python code that implements the TF-IDF algorithm to assess the relevance of documents to a given search query.
  • Formulate a scoring mechanism, the relevance is just the sum of the TF-IDF values for each term in the query, normalized for the length of the query terms. If Q is a set of terms, then relevance is defined as follows:
image

Lambda Function Integration:

  • Design and configure an AWS Lambda function that encapsulates the relevance ranking logic.
  • Set up the Lambda function to accept user input (search queries) and return the search results.

User-Friendly Interface:

  • Create an interactive HTML-based user interface ('search.html') that allows users to enter search queries.
  • Connect the HTML interface to the Lambda function for querying and result retrieval.

Output screenshot

Search Page

image

Search for a relevant documents

image

About

Information_Retrieval_System - Data Engineer Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages