Skip to content

Commit

Permalink
Minor code changes and build files added
Browse files Browse the repository at this point in the history
  • Loading branch information
ishan-surana committed Jul 2, 2024
1 parent 7b4298b commit 57f73f4
Show file tree
Hide file tree
Showing 9 changed files with 162 additions and 60 deletions.
89 changes: 89 additions & 0 deletions MetaDataScraper.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
Metadata-Version: 2.1
Name: MetaDataScraper
Version: 1.0.2
Summary: A module designed to automate the extraction of follower counts and post details from a public Facebook page.
Author-email: Ishan Surana <ishansurana1234@gmail.com>
Project-URL: Homepage, https://metadatascraper.readthedocs.io/en/latest/
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: Microsoft :: Windows
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: selenium==4.1.0
Requires-Dist: webdriver-manager==4.0.1

[![Licence](https://badgen.net/github/license/ishan-surana/MetaDataScraper?color=DC143C)](https://github.com/ishan-surana/MetaDataScraper/blob/main/LICENCE) [![Python](https://img.shields.io/badge/python-%3E=3.10-slateblue.svg)](https://www.python.org/downloads/release/python-3119/) [![Wheel](https://img.shields.io/badge/wheel-yes-FF00C9.svg)](https://files.pythonhosted.org/packages/02/80/c53d5e8439361c913e23b6345e85e748a7ac7e82e22cb9f7cd9ec77d5d52/MetaDataScraper-1.0.0-py3-none-any.whl) [![Latest](https://badgen.net/github/release/ishan-surana/MetaDataScraper?label=latest+release&color=green)](https://pypi.org/project/MetaDataScraper/1.0.0/) [![Releases](https://badgen.net/github/releases/ishan-surana/MetaDataScraper?color=orange)](https://github.com/ishan-surana/MetaDataScraper/releases) [![Stars](https://badgen.net/github/stars/ishan-surana/MetaDataScraper?color=yellow)](https://github.com/ishan-surana/MetaDataScraper/stargazers) [![Forks](https://badgen.net/github/forks/ishan-surana/MetaDataScraper?color=dark)](https://github.com/ishan-surana/MetaDataScraper/forks) [![Issues](https://badgen.net/github/issues/ishan-surana/MetaDataScraper?color=800000)](https://github.com/ishan-surana/MetaDataScraper/issues) [![PRs](https://badgen.net/github/prs/ishan-surana/MetaDataScraper?color=C71585)](https://github.com/ishan-surana/MetaDataScraper/pulls) [![Last commit](https://badgen.net/github/last-commit/ishan-surana/MetaDataScraper?color=blue)](https://github.com/ishan-surana/MetaDataScraper/commits/main/) ![Downloads](https://img.shields.io/github/downloads/ishan-surana/MetaDataScraper/total) [![Workflow](https://github.com/ishan-surana/MetaDataScraper/actions/workflows/python-publish.yml/badge.svg)](https://github.com/ishan-surana/MetaDataScraper/blob/main/.github/workflows/python-publish.yml) [![PyPI](https://d25lcipzij17d.cloudfront.net/badge.svg?id=py&r=r&ts=1683906897&type=6e&v=1.0.0&x2=0)](https://pypi.org/project/MetaDataScraper/) [![Maintained](https://img.shields.io/badge/maintained-yes-cyan)](https://github.com/ishan-surana/MetaDataScraper/pulse) [![OS](https://img.shields.io/badge/OS-Windows-FF0000)](https://www.microsoft.com/software-download/windows11) [![Documentation Status](https://readthedocs.org/projects/metadatascraper/badge/?version=latest)](https://metadatascraper.readthedocs.io/en/latest/?badge=latest)

# MetaDataScraper

MetaDataScraper is a Python package designed to automate the extraction of information like follower counts, and post details & interactions from a public Facebook page, in the form of a list. It uses Selenium WebDriver for web automation and scraping.
The module provides two classes: `LoginlessScraper` and `LoggedInScraper`. The `LoginlessScraper` class does not require any authentication or API keys to scrape the data. However, it has a drawback of being unable to access some Facebook pages.
The `LoggedInScraper` class overcomes this drawback by utilising the credentials of a Facebook account (of user) to login and scrape the data.

## Installation

You can install MetaDataScraper using pip:

```
pip install MetaDataScraper
```

Make sure you have Python 3.x and pip installed.

## Usage

To use MetaDataScraper, follow these steps:

1. Import the `LoginlessScraper` or the `LoggedInScraper` class:

```python
from MetaDataScraper import LoginlessScraper, LoggedInScraper
```

2. Initialize the scraper with the Facebook page ID:

```python
page_id = "your_target_page_id"
scraper = LoginlessScraper(page_id)
email = "your_facebook_email"
password = "your_facebook_password"
scraper = LoggedInScraper(page_id, email, password)
```

3. Scrape the Facebook page to retrieve information:

```python
result = scraper.scrape()
```

4. Access the scraped data from the result dictionary:

```python
print(f"Followers: {result['followers']}")
print(f"Post Texts: {result['post_texts']}")
print(f"Post Likes: {result['post_likes']}")
print(f"Post Shares: {result['post_shares']}")
print(f"Is Video: {result['is_video']}")
print(f"Video Links: {result['video_links']}")
```

## Features

- **Automated Extraction**: Automatically fetches follower counts, post texts, likes, shares, and video links from Facebook pages.
- **Comprehensive Data Retrieval**: Retrieves detailed information about each post, including text content, interaction metrics (likes, shares), and multimedia (e.g., video links).
- **Flexible Handling**: Adapts to diverse post structures and various types of multimedia content present on Facebook pages, like post texts or reels.
- **Enhanced Access with Logged-In Scraper**: Overcomes limitations faced by anonymous scraping (loginless) by utilizing Facebook account credentials for broader page access.
- **Headless Operation**: Executes scraping tasks in headless mode, ensuring seamless and non-intrusive data collection without displaying a browser interface.
- **Scalability**: Supports scaling to handle large volumes of data extraction efficiently, suitable for monitoring multiple Facebook pages simultaneously.
- **Dependency Management**: Utilizes Selenium WebDriver for robust web automation and scraping capabilities, compatible with Python 3.x environments.
- **Ease of Use**: Simplifies the process with straightforward initialization and method calls, facilitating quick integration into existing workflows.

## Dependencies

- selenium
- webdriver_manager

## License

This project is licensed under the Apache Software License Version 2.0 - see the [LICENSE](https://github.com/ishan-surana/MetaDataScraper/blob/main/LICENCE) file for details.
10 changes: 10 additions & 0 deletions MetaDataScraper.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
LICENCE
README.md
pyproject.toml
MetaDataScraper/FacebookScraper.py
MetaDataScraper/__init__.py
MetaDataScraper.egg-info/PKG-INFO
MetaDataScraper.egg-info/SOURCES.txt
MetaDataScraper.egg-info/dependency_links.txt
MetaDataScraper.egg-info/requires.txt
MetaDataScraper.egg-info/top_level.txt
1 change: 1 addition & 0 deletions MetaDataScraper.egg-info/dependency_links.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

2 changes: 2 additions & 0 deletions MetaDataScraper.egg-info/requires.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
selenium==4.1.0
webdriver-manager==4.0.1
1 change: 1 addition & 0 deletions MetaDataScraper.egg-info/top_level.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
MetaDataScraper
117 changes: 58 additions & 59 deletions MetaDataScraper/FacebookScraper.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
import time
import logging
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import logging
from webdriver_manager.chrome import ChromeDriverManager
logging.getLogger().setLevel(logging.CRITICAL)

class LoginlessScraper:
Expand Down Expand Up @@ -471,7 +470,7 @@ def __scroll_to_top(self):

def __get_xpath_constructor(self):
"""Constructs the XPath for locating posts on the Facebook page."""
xpath_return_script = r"""
_xpath_return_script = r"""
var iterator = document.evaluate('.//*[@aria-label="Like"]', document);
var firstelement = iterator.iterateNext()
var firstpost = firstelement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement
Expand Down Expand Up @@ -509,79 +508,79 @@ def __get_xpath_constructor(self):
}
return xpath_first
"""
xpath_constructor = self.driver.execute_script(xpath_return_script)
split_xpath = xpath_constructor.split('[')
split_index = split_xpath.index('1]/div/div/div/div/div/div/div/div/div/div/div')
self.xpath_first = '['.join(split_xpath[:split_index])+'['
self.xpath_last = '['+'['.join(split_xpath[split_index+1:])
self.xpath_identifier_addum = ']/div/div/div/div/div/div/div/div/div/div/div'
if len(self.driver.find_element(By.XPATH, xpath_constructor).find_elements(By.TAG_NAME, 'video')):
self.xpath_last = '/'.join(self.xpath_last.split('/')[:3])
_xpath_constructor = self.driver.execute_script(_xpath_return_script)
_split_xpath = _xpath_constructor.split('[')
_split_index = _split_xpath.index('1]/div/div/div/div/div/div/div/div/div/div/div')
self._xpath_first = '['.join(_split_xpath[:_split_index])+'['
self._xpath_last = '['+'['.join(_split_xpath[_split_index+1:])
self._xpath_identifier_addum = ']/div/div/div/div/div/div/div/div/div/div/div'
if len(self.driver.find_element(By.XPATH, _xpath_constructor).find_elements(By.TAG_NAME, 'video')):
self._xpath_last = '/'.join(self._xpath_last.split('/')[:3])

def __extract_post_details(self):
"""Extracts details of posts including text, likes, shares, and video links."""
c = 1
error_count = 0
_c = 1
_error_count = 0
while True:
xpath = self.xpath_first + str(c) + self.xpath_identifier_addum + self.xpath_last
if not self.driver.find_elements(By.XPATH, xpath):
error_count += 1
if error_count < 3:
print('Error extracting post', c, '\b. Count', error_count,'Retrying extraction...', end='\r')
_xpath = self._xpath_first + str(c) + self._xpath_identifier_addum + self._xpath_last
if not self.driver.find_elements(By.XPATH, _xpath):
_error_count += 1
if _error_count < 3:
print('Error extracting post', _c, '\b. Count', _error_count,'Retrying extraction...', end='\r')
time.sleep(5)
self.driver.execute_script("window.scrollBy(0, +40);")
continue
break
error_count = 0
_error_count = 0
print(" "*100, end='\r')
print("Extracting data of post", c, end='\r')
self.driver.execute_script("arguments[0].scrollIntoView();", self.driver.find_elements(By.XPATH, xpath)[0])
post_components = self.driver.find_element(By.XPATH, xpath).find_elements(By.XPATH, './*')
if len(post_components) > 2:
post_text = '\n'.join(post_components[2].text.split('\n'))
if post_components[3].text.split('\n')[0] == 'All reactions:':
post_likes = post_components[3].text.split('\n')[1]
if len(post_components[3].text.split('\n')) > 4:
post_shares = post_components[3].text.split('\n')[4].split(' ')[0]
elif len(post_components) > 4 and post_components[4].text.split('\n')[0] == 'All reactions:':
post_likes = post_components[4].text.split('\n')[1]
if len(post_components[4].text.split('\n')) > 4:
post_shares = post_components[4].text.split('\n')[4].split(' ')[0]
print("Extracting data of post", _c, end='\r')
self.driver.execute_script("arguments[0].scrollIntoView();", self.driver.find_elements(By.XPATH, _xpath)[0])
_post_components = self.driver.find_element(By.XPATH, _xpath).find_elements(By.XPATH, './*')
if len(_post_components) > 2:
_post_text = '\n'.join(_post_components[2].text.split('\n'))
if _post_components[3].text.split('\n')[0] == 'All reactions:':
_post_like = _post_components[3].text.split('\n')[1]
if len(_post_components[3].text.split('\n')) > 4:
_post_share = _post_components[3].text.split('\n')[4].split(' ')[0]
elif len(_post_components) > 4 and _post_components[4].text.split('\n')[0] == 'All reactions:':
_post_like = _post_components[4].text.split('\n')[1]
if len(_post_components[4].text.split('\n')) > 4:
_post_share = _post_components[4].text.split('\n')[4].split(' ')[0]
else:
post_likes = 0
post_shares = 0
self.post_texts.append(post_text)
self.post_likes.append(post_likes if post_likes else 0)
self.post_shares.append(post_shares if post_shares else 0)
elif len(post_components) == 2:
_post_like = 0
_post_share = 0
self.post_texts.append(_post_text)
self.post_likes.append(_post_like if _post_like else 0)
self.post_shares.append(_post_share if _post_share else 0)
elif len(_post_components) == 2:
try:
post_shares = post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text
_post_share = _post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text
except:
print("Some error occurred while extracting post", c, ". Skipping post...", end='\r')
c += 1
print("Some error occurred while extracting post", _c, ". Skipping post...", end='\r')
_c += 1
continue
post_likes = post_components[1].find_element(By.XPATH, './/*[@aria-label="Like"]').text
post_shares = post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text
_post_like = _post_components[1].find_element(By.XPATH, './/*[@aria-label="Like"]').text
_post_share = _post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text
self.post_texts.append('')
self.post_likes.append(post_likes if post_likes else 0)
self.post_shares.append(post_shares if post_shares else 0)
elif len(post_components) == 1:
post_text = post_components[0].text.split('\n')[0]
post_likes = post_components[0].find_element(By.XPATH, './/*[@aria-label="Like"]').text
post_shares = post_components[0].find_element(By.XPATH, './/*[@aria-label="Share"]').text
self.post_texts.append(post_text)
self.post_likes.append(post_likes if post_likes else 0)
self.post_shares.append(post_shares if post_shares else 0)
if len(self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'video')) > 0:
if 'reel' in self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'):
self.video_links.append('https://www.facebook.com' + self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'))
self.post_likes.append(_post_like if _post_like else 0)
self.post_shares.append(_post_share if _post_share else 0)
elif len(_post_components) == 1:
_post_text = _post_components[0].text.split('\n')[0]
_post_like = _post_components[0].find_element(By.XPATH, './/*[@aria-label="Like"]').text
_post_share = _post_components[0].find_element(By.XPATH, './/*[@aria-label="Share"]').text
self.post_texts.append(_post_text)
self.post_likes.append(_post_like if _post_like else 0)
self.post_shares.append(_post_share if _post_share else 0)
if len(self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'video')) > 0:
if 'reel' in self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'):
self.video_links.append('https://www.facebook.com' + self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'))
else:
self.video_links.append(self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'a')[4].get_attribute('href'))
self.video_links.append(self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'a')[4].get_attribute('href'))
self.is_video.append(True)
else:
self.is_video.append(False)
self.video_links.append('')
c += 1
_c += 1

self.post_likes = [int(i) if str(i).isdigit() else 0 for i in self.post_likes]
self.post_shares = [int(i) if str(i).isdigit() else 0 for i in self.post_shares]
Expand Down
Binary file added dist/MetaDataScraper-1.0.2-py3-none-any.whl
Binary file not shown.
Binary file added dist/metadatascraper-1.0.2.tar.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "MetaDataScraper"
version = "1.0.1"
version = "1.0.2"
authors = [
{ name="Ishan Surana", email="ishansurana1234@gmail.com" },
]
Expand Down

0 comments on commit 57f73f4

Please sign in to comment.