Skip to content

Product Recommendation System using Collaborative Filtering

Notifications You must be signed in to change notification settings

RycroftNotHolmes/Neorvex2.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The code in scraping.py is designed to scrape product information and reviews from Flipkart.com using the Scrapy framework and Selenium with PhantomJS. Here's a breakdown of what each function does:

  1. PhantomSource: This is a class that uses Selenium with PhantomJS to scrape web pages. The getSource method takes a URL as input and returns the HTML source code of the page.
  2. toDB: This is a class that inherits from Django's BaseCommand class. It is used to insert product information and reviews into a database. The handle method takes product data as input and stores it in the database. The insertSpecs, insertImages, and insertReviews methods are used to insert specific information into the database.
  3. ScrapeFL: This is another class that inherits from Django's BaseCommand class. It is used to scrape product information and reviews from Flipkart.com. The getPID method is used to extract the product ID from a URL. The scrapeProduct method is used to scrape the product page and extract information such as the product name, specifications, images, and reviews. The extractProductNames method is used to extract product names from a search results page. The scrapeIndividual method is used to scrape the individual product page and extract information such as the product name, specifications, and images. The getReviews method is used to scrape the reviews page and extract information such as the review heading, review text, and ratings. The getReviewPerPage method is used to extract information from each page of reviews. The ScrapeFL class also includes a handle method that takes a product ID as input and calls the scrapeProduct method to scrape the product information and reviews. Overall, the code in scraping.py is designed to automate the process of scraping product information and reviews from Flipkart.com and storing it in a database.
  4. The use of Selenium with PhantomJS allows for the scraping of web pages that require JavaScript to be rendered.
  5. The use of Scrapy allows for efficient and structured scraping of web pages. The use of Django's BaseCommand class allows for easy integration with a Django project. PhantomSource is a class defined in the scraping.py file of the Neorvex Django framework. It is used to scrape web pages using the PhantomJS headless browser.

About Phantom Source and Phantom JS and Selenium Web driver

The PhantomSource class has a single method, getSource, which takes a URL as input and returns the HTML source code of the page. The getSource method works by creating a new instance of the PhantomJS browser, navigating to the specified URL, and retrieving the HTML source code of the page. The browser is then closed to free up resources. The PhantomSource class is used in the ScrapeFL class to scrape product information and reviews from Flipkart.com. By using PhantomJS, the ScrapeFL class is able to scrape web pages that require JavaScript to be rendered, which is necessary to extract all the relevant information from the Flipkart product pages. Overall, the PhantomSource class provides a convenient way to scrape web pages using the PhantomJS headless browser in the Neorvex Django framework.

ScrapeFL class function descriptions

The scrapeFL class in the scraping.py file is a Django management command that is used to scrape product information and reviews from Flipkart.com. Here's a breakdown of what each function in the scrapeFL class does:

getPID: This function takes a string as input and extracts the product ID from it. It is used to extract the product ID from the URL of a product page.
scrapeProduct: This function takes a product name as input and searches for the product on Flipkart.com. It then displays the search results and prompts the user to select a product. Once a product is selected, it scrapes the product page and extracts information such as the product name, specifications, images, and reviews. It then saves this information to the database using the toDB class.
extractProductNames: This function takes a search URL as input and extracts the names, links, and product IDs of the products in the search results. It is used by the scrapeProduct function to display the search results to the user.
scrapeIndividual: This function takes a product object as input and scrapes the product page to extract more detailed information such as the product specifications and images. It returns a dictionary containing this information.
getReviews: This function takes a product ID and a review URL as input and scrapes the number of review pages.
getReviewPerPage: This function takes a review URL and a product ID as input and extracts the reviews for the product from each review page. It returns a list of dictionaries containing the review information.
add_arguments: This function is used to add command line arguments to the scrapeFL command. In this case, it adds a required argument for the product name.
handle: This function is called when the scrapeFL command is run. It extracts the product name from the command line arguments and calls the scrapeProduct function to scrape the product information.

Overall, the scrapeFL class is used to scrape product information and reviews from Flipkart.com and save it to the database. The PhantomSource class is used to scrape web pages using the PhantomJS headless browser, and the toDB class is used to save the scraped data to the database.

Modules used in scraping.py

The scraping.py script uses several modules to perform web scraping and data manipulation:

django.core.management.base: This module provides a base class for creating custom management commands in Django. The BaseCommand class is used to define the ScrapeFL and toDB commands.
selenium: This module is used for web scraping and automation. The webdriver.PhantomJS class is used to create a headless browser instance for scraping web pages.
.models: This module contains the data models for the Django application. The Product, Review, and Images classes are used to interact with the database.
requests: This module is used for making HTTP requests to web pages. The get function is used to retrieve the content of web pages.
bs4: This module is used for parsing HTML and XML documents. The BeautifulSoup class is used to create a parser object that can extract data from web pages.
pprint: This module is used for pretty-printing Python data structures. The pprint function is used to print the contents of dictionaries in a readable format.
.sentiment: This module contains the Analyse class, which is used to perform sentiment analysis on text data.
scrapy: This module is used for web scraping and crawling. It provides a framework for creating and running web scraping spiders.

These modules are used together to scrape product information and reviews from Flipkart.com, perform sentiment analysis on the reviews, and store the data in a database.

Sentimental Analysis of the reviews

The sentiment.py script is a Django management command that performs sentiment analysis on product reviews and predicts a rating based on the sentiment. Here's a breakdown of the code:

The script imports several modules, including BaseCommand from Django's management command module, TextBlob for sentiment analysis, log and exp from Python's math module, and Product and Review models from the local models.py module.
The Analyse class is defined as a subclass of BaseCommand. It defines a handle method that is called when the command is executed.
The handle method takes any number of arguments and keyword arguments, but in this case, it expects a single argument PID (product ID) as a string.
The getRating method retrieves the rating of the product with the given PID from the Product model. If the product is not found, it returns 0.
The _sigmoidal method calculates the sigmoid function of a given value.
The _sentimentFactor method calculates a sentiment factor based on the number of upvotes and downvotes of a review.
The averageSentiment method calculates the average sentiment of all reviews for the product with the given PID. It does this by iterating over all reviews for the product, calculating the sentiment of each review using TextBlob, and then calculating the weighted average sentiment using the sentiment factor.
The predictRating method predicts the rating of the product based on the average sentiment. It calculates the predicted rating as the average of the Flipkart rating and the NLP rating. If the Flipkart rating is 0, it returns the NLP rating. If the NLP rating is 0, it returns the Flipkart rating. Otherwise, it returns the average of the two ratings.

Overall, the sentiment.py script uses sentiment analysis to predict a rating for a product based on the sentiment of its reviews. It calculates the sentiment of each review using TextBlob and then calculates a weighted average sentiment based on the number of upvotes and downvotes of each review. The predicted rating is then calculated as the average of the Flipkart rating and the NLP rating.

Modules used in sentiment.py

django.core.management.base: This module provides a base class for creating custom management commands in Django. The BaseCommand class is used to define the Analyse command.
textblob: This is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more. In this script, it is used to calculate the sentiment of the reviews.
math: This is a built-in Python module that provides mathematical functions. The log and exp functions are used in the _sigmoidal method to calculate the sigmoid function.
.models: This is a module that contains the data models for the Django application. The Product and Review models are used to retrieve the product and review data from the database.

In summary, the sentiment.py script uses the django.core.management.base module to define a custom management command, the textblob library to perform sentiment analysis on the reviews, the math module to perform mathematical calculations, and the .models module to retrieve data from the database.