Skip to content
SpiffyChatterbox edited this page Jul 21, 2024 · 12 revisions

Writing your own site support for Gallery‐dl

Gallery-dl is an excellent tool for a datahoarder. It allows you to download images, videos, comics, and more. And it has many features that a run-of-the-mill web scraper doesn't:

  • It keeps track of what you download so you don't need to download it again
  • has configurations galore
  • It supports authentication

However, it needs to support the site you want to use. For this, someone must create a module called an extractor.

Prerequisites:

These aren't strictly required, but if you don't know at least the basics about these areas, you should read up on them before going too much further. I'll skim them in case you're unfamiliar, but if you have trouble following, you may want to read up more on them.

  1. Github basic terms. Forking, branching, merging, and pushing.
  2. Regex. Don't need to be an expert, but need to create an accurate URL filter.
  3. Python Classes/class inheritance
  4. Need to understand websites well enough to determine how you want to handle downloads.
    e.g. If you don't know what an API is, you probably aren't ready to decide if an API is available.

Setting up your code in Github:

Fork your code so you have a copy to work from.

Login to Github. Go to https://github.com/mikf/gallery-dl/. Hit the "fork" icon in the top right. This will add a copy of gallery-dl to your Github Repositories. Clone your fork locally, and setup a new branch for your extractor.

Test your setup to make sure it works. At the command line, go to your gallery-dl top-level directory and run:

python -m gallery_dl --verbose https://files.catbox.moe/94gjye.txt

It should give you a few debug lines to show you the versions of python modules in use, and show you that it downloads the file.

Creating your extractor framework:

Now let's setup the framework for your code. We'll need to have a few configuration options to refer to.

  1. A human-readable name for your site. For example here I'll use "Contoso Widgets"
  2. A system-readable name for your site, like an abbreviation. Here I'll use "cw".

In scripts/supportedsites.py add a row for your site in dictionary format. "cw" : "Contoso Widgets",

In gallery-dl/extractor/__init__.py add a row to modules with the system-readable name for your site. "cw",

TODO: <Placeholder for unit test>

Create a file in gallery-dl/extractor/cw.py The filename here must match the abbreviation for your site. Now populate that file with this bare-bones example to get us started:

from .common import Extractor, Message
from .. import text

class CWExampleExtractor(Extractor):
category = "cw"
subcategory = "test"
pattern = r"(?:https?://)?contosoweb\.com"

def items(self):
url = "https://www.contosoweb.com/_img/gallery/2022/fancy-image.jpg"
data = text.nameext_from_url(url)
yield Message.Directory, data
yield Message.Url, url, data

Now your framework is all setup. In part 2 we'll talk about that last file as the Extractor file and how that works.