Telegram Posts Scraper is a standalone application and script designed to scrape posts from a specified Telegram channel. The output is saved in a compressed .parquet.gzip
format.
- Standalone application with a graphical user interface (GUI)
- Script-based operation for command-line users
- Outputs data in
.parquet.gzip
format - Optional content masking for post data
Important: most of the GUI code has been generated using Chat GPT
The standalone application provides an easy-to-use GUI for scraping posts from a Telegram channel. It was build on top of this script, and converted into an executable file using auto-py-to-exe
.
- Download: Telegram Posts Scraper.exe
- Run the Application: Double-click the downloaded executable file.
- Wait: It might take a few seconds to start up.
- Input Details: Enter the channel details and date range as prompted.
- Scrape: Click the "Start Scraping" button to begin.
The tg-scraper.py
script is designed for users who prefer the command-line interface. It scrapes posts from a specified Telegram channel and writes the output into a compressed parquet file using the snscrape
module.
-
Run the Script:
python tg-scraper.py
-
Provide Inputs: When prompted, enter:
- The
XXXX
part of the Telegram channel URLhttps://web.telegram.org/k/#@XXXX
- The start and end dates to define the scraping period
- The
-
Optional Content Masking: Choose whether to mask post contents in the output file with
#####
.
python tg-scraper.py
# Example prompts
Enter channel name (XXXX): example_channel
Enter start date (YYYY-MM-DD): 2023-01-01
Enter finish date (YYYY-MM-DD): 2023-12-31
Include post contents? (y/n): y
The output file name is automatically generated: tg-posts
+ [channel name]
+ [start date]
+ [end date]
+ .parquet.gzip
.
It will be saved in the data
folder (in a current directory) as a compressed parquet file with the following structure:
post_id | post_url | date | content |
---|---|---|---|
12345 | https://t.me/12345 | 2023-01-01 12:00:00 | Lorem Ipsum |
The excluded (masked) post contents, if a respective option is chosen, will be replaced with #####
:
post_id | post_url | date | content |
---|---|---|---|
12345 | https://t.me/12345 | 2023-01-01 12:00:00 | ##### |
- Python 3.10
- Required Python libraries:
snscrape
,pyarrow
,PyQt5
Install dependencies using:
pip install -r requirements.txt
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit CC BY-NC 4.0.