Skip to content

PoopLord1/centipede

Repository files navigation

Centipede

Centipede is a data ingestion framework. For the time being, it is mostly used to acquire and process data scraped from the Internet.

Centipede uses a series a of "limbs", or standardized classes, to represent a step in the ingestion process. Each limb has a series of configuration options to handle incoming data. One example is the SendText limb, which uses Twilio to send text notifications according to its configuration. Developers provide a function that accepts the current data package and returns True if it should send a text. This configuration is defined in a module-level dictionary:

SendText = {"get_text_flag": lambda thread: thread.is_malicious, 
            "message_template": "The thread was found to be malicious!"}

This config module is passed into the onstructor of the Centipede object. Afterwards, specify the pipeline with references to each of the classes you would like to invoke. Then "walk" along the data to process.

import centipede_config

def main():
    cent = Centipede(config=centipede_config)
    cent.define_limbs([FourChanScraper, DetectMaliceInText, DeepCopyPage, SendText])
    cent.walk()

This example code was taken from "FourChanMaliceDetection", found on my GitHub.

Limbs

Centipede allows the developer to architect a data ingestion pipeline using a system of "limbs", with each limb acting as a step in the pipeline. These limbs are designed to be generic enough to be repurposed for multiple projects.

  • DeepCopyPage - Deeply saves an offline copy of the provided URL. Can do it under certain conditions if desired.
  • DetectMaliceInText - Given certain pieces of text, determines if there is a reference to a real location (specifically, a school or town)
  • FourChanScraper - Scrapes 4Chan to gather thread contents and other data
  • SendText - Conditionally sends a text to a given phone number
  • SqlManager - Saves Youtube-specific data to an SQL database
  • YoutubeDownloader - Given a Youtube video, downloads the video and saves it to a given directory
  • YoutubeScraper - Given the URL for a Youtube video, saves metadata for that Youtube video

About

A soon-to-be-scalable data ingestion framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages