-
Asynchronous Processing: Utilizes
asyncio
andaiohttp
to handle asynchronous HTTP requests to the YouTube Data API, enabling efficient retrieval of video metadata. -
Concurrent Worker Tasks: Implements concurrent worker tasks and queues for parallel processing of video metadata, improving performance and scalability.
-
Modularity and Extensibility: Organized into classes and functions, making it easy to extend and maintain.
-
External Libraries: Employs
aiohttp
for asynchronous HTTP requests andfire
for command-line interface. -
Database Integration: Interacts with
DatabaseOperations
to insert and update video metadata in the database. -
Logging and Error Handling: Uses
loguru
for comprehensive logging and includes robust error handling to ensure reliable operation during API calls and data processing.
The video_metadata.py
module is designed to fetch video metadata from the YouTube Data API using video IDs. It processes inputs from command-line arguments or files, retrieves metadata for the associated video IDs, and stores them in a database. The module leverages asynchronous programming and queues to manage and process metadata concurrently, making it suitable for handling large datasets efficiently.
- Google API Key: This module requires a Google API key with access to the YouTube Data API v3. You can obtain an API key and ensure appropriate access by following Google's official instructions.
-
asyncio
: Asynchronous I/O framework used for managing asynchronous tasks and queues. -
aiohttp
: Asynchronous HTTP client/server framework for Python, used to make HTTP requests to the YouTube Data API. -
fire
: A library for automatically generating command-line interfaces. -
loguru
: A modern logging library that simplifies logging in Python. -
dotenv
: Loads environment variables from a.env
file. -
dataclasses
: A module that provides a decorator and functions for automatically adding special methods to classes. -
First Party Libraries:
Utils
: Contains utility functions for data preparation and reading IDs from files.DatabaseOperations
: Manages database operations for inserting and updating video metadata.LoggerConfig
: Configures logging settings usingloguru
.EventFetcher
: Fetches event metadata related to video metadata processing.
The QueueManager
class manages queues for video metadata and event metadata, along with tracking the number of active tasks for each operation.
The VideoIdOperations
class handles fetching video metadata from the YouTube Data API.
Fetches video metadata for the given list of video IDs using the YouTube Data API.
@staticmethod
async def fetch_video_metadata(video_ids: List[str], worker_id: str) -> List[Dict[str, Any]]:
"""
Fetches video metadata from the YouTube Data API.
Args:
video_ids (List[str]): List of video IDs to fetch metadata for.
worker_id (str): ID of the worker executing the function.
Returns:
List[Dict[str, Any]]: List of dictionaries containing the fetched metadata.
"""
Populates the event metadata queue with dates that require event metadata processing.
async def populate_event_metadata_queue(queue_manager: QueueManager) -> None:
"""
Populates the event metadata queue with dates that require event metadata processing.
Args:
queue_manager (QueueManager): The queue manager instance.
Returns:
None
"""
The Fetcher
class coordinates the fetching and processing of video metadata. It initializes the queues and starts the worker tasks.
Initializes the Fetcher
with a QueueManager
instance and a shutdown event.
@dataclass(slots=True)
class Fetcher:
queue_manager: QueueManager
shutdown_event: asyncio.Event
def __init__(self):
"""
Initializes the Fetcher with a QueueManager instance.
Args:
None
Returns:
None
"""
Method: fetch(self, video_ids: Optional[List[str]] = None, video_id_files: Optional[List[str]] = None, num_workers: int = 2) -> None
Fetches video metadata from the YouTube Data API and starts worker tasks.
async def fetch(self, video_ids: Optional[List[str]] = None, video_id_files: Optional[List[str]] = None, num_workers: int = 2) -> None:
"""
Fetches video metadata from the YouTube Data API and starts worker tasks.
Args:
video_ids (Optional[List[str]]): A list of video IDs.
video_id_files (Optional[List[str]]): A list of file paths containing video IDs.
num_workers (int): The number of worker tasks to create.
Returns:
None
"""
These asynchronous functions are used as worker tasks to process video metadata from the queues concurrently.
Function: worker_retrieve_metadata(queue_manager: QueueManager, shutdown_event: asyncio.Event, worker_id: str) -> None
Retrieves video metadata from the YouTube Data API and puts it into the metadata queue.
async def worker_retrieve_metadata(queue_manager: QueueManager, shutdown_event: asyncio.Event, worker_id: str) -> None:
"""
Worker function that retrieves video metadata from the YouTube Data API.
Args:
queue_manager (QueueManager): Manages the queues and active tasks.
shutdown_event (asyncio.Event): Event to signal when to shut down the worker.
worker_id (str): ID of the worker executing the function.
Returns:
None
"""
Function: worker_save_metadata(queue_manager: QueueManager, shutdown_event: asyncio.Event, worker_id: str) -> None
Saves video metadata from the metadata queue into the database.
async def worker_save_metadata(queue_manager: QueueManager, shutdown_event: asyncio.Event, worker_id: str) -> None:
"""
Worker function that saves video metadata to the database.
Args:
queue_manager (QueueManager): Manages the queues and active tasks.
shutdown_event (asyncio.Event): Event to signal when to shut down the worker.
worker_id (str): ID of the worker executing the function.
Returns:
None
"""
Processes event metadata by calling an instance of the EventFetcher
class.
async def worker_event_metadata(queue_manager: QueueManager, worker_id: str) -> None:
"""
Worker function that processes event metadata.
Args:
queue_manager (QueueManager): Manages the queues and active tasks.
worker_id (str): ID of the worker executing the function.
Returns:
None
"""
Provides a command-line interface for running the Fetcher
class.
def cmd() -> None:
"""
Command-line interface for running the Fetcher.
Returns:
None
"""
-
Python Environment: Ensure you have Python 3.7 or higher installed.
-
Dependencies: Install the required packages by running:
pip install aiohttp fire loguru python-dotenv
-
YouTube Data API Key: Obtain an API key from Google Cloud Console and set it in your environment variables or a
.env
file asYT_API_KEY
.
The video_metadata.py
module provides a command-line interface through the cmd()
function. You can use this interface to fetch video metadata from various sources and insert it into the database for further processing.
To use the module from the command line, navigate to the directory containing video_metadata.py
and run it using Python. Below are examples of how to use it.
Suppose you have a list of video IDs that you want to process.
python video_metadata.py fetch --video_ids "dQw4w9WgXcQ,9bZkp7q19f0"
- Parameters:
--video_ids
: A comma-separated list of YouTube video IDs.
Explanation:
- The
fetch
command invokes thefetch
method of theFetcher
class. - The
--video_ids
argument accepts a list of video IDs.
If you have a text file containing video IDs (one per line), you can specify the file path.
python video_metadata.py fetch --video_id_files "/path/to/video_ids.txt"
- Parameters:
--video_id_files
: A comma-separated list of file paths containing video IDs.
Explanation:
- The
fetch
method reads video IDs from the specified file(s) and processes them.
If you have video IDs stored in the database without metadata, you can simply run the fetch command without any parameters.
python video_metadata.py fetch
Explanation:
- The module will fetch video IDs that are in the database but lack metadata and attempt to retrieve and store their metadata.
If you prefer to use the module within another Python script, you can invoke the Fetcher
class directly.
import asyncio
from video_metadata import Fetcher
async def main():
fetcher = Fetcher()
await fetcher.fetch(
video_ids=["dQw4w9WgXcQ"],
video_id_files=["/path/to/video_ids.txt"],
num_workers=5,
)
if __name__ == "__main__":
asyncio.run(main())
Explanation:
- This script creates an instance of
Fetcher
and calls thefetch
method with desired parameters. - It uses
asyncio.run
to execute the asynchronousfetch
method.
-
Database Operations: The module assumes that
DatabaseOperations
is properly configured to handle database interactions, including methods likeinsert_update_video_metadata
,set_video_id_failed_metadata_true
, andget_video_ids_without_metadata
. -
YouTube Data API Quota: The module handles the
quotaExceeded
error from the YouTube Data API and stops making requests when the quota is exceeded. -
Logging: The module uses
loguru
for logging. Logs will be output to the console and can be configured via theLoggerConfig
class. -
Error Handling: The module includes robust error handling to manage network issues or API errors when invoking the YouTube Data API.
-
Concurrency: Adjust the
--num_workers
parameter according to your system's capabilities and API quota limits to optimize performance.
-
Initialization:
- The
Fetcher
class initializes aQueueManager
instance to manage the queues for metadata and event metadata. - A shutdown event is also initialized to signal when workers should stop.
- The
-
Adding Video IDs:
- Video IDs provided via command-line arguments or files are added to the database if not already present.
- The module then retrieves video IDs from the database that lack metadata.
-
Worker Tasks:
- Retrieve Metadata Workers:
- Fetch video IDs without metadata from the database.
- Use the YouTube Data API to fetch metadata for these video IDs.
- Put the retrieved metadata into the metadata queue.
- Save Metadata Workers:
- Retrieve metadata from the metadata queue.
- Save the metadata to the database using
DatabaseOperations
.
- Event Metadata Workers:
- Fetch dates that require event metadata processing.
- Use
EventFetcher
to process event metadata for these dates.
- Retrieve Metadata Workers:
-
Asynchronous Processing:
- All workers run asynchronously, enabling concurrent processing and efficient utilization of resources.
- The shutdown event ensures that workers stop gracefully when there is no more work to do.
-
Database Insertion:
- Video metadata collected from the YouTube Data API is inserted or updated in the database for downstream tasks.
The video_metadata.py
module provides a flexible and efficient way to collect video metadata from the YouTube Data API. By utilizing asynchronous programming and concurrent workers, it can handle large volumes of data effectively. The module is designed to be both a standalone command-line tool and a module that can be integrated into other Python applications.
Adjust the parameters and extend the functionality as needed to suit your specific use case.