The `priority` parameter is not reflected with Scrapyd picks up jobs from the "pending" queue #533

aaronm137 · 2024-11-04T16:18:26Z

I have 3 projects and I have created:

50 jobs for project_a like this:
curl http://localhost:6800/schedule.json -d project=project_a -d spider=spider_name -d priority=0

Then 50 jobs for project_b like this:
curl http://localhost:6800/schedule.json -d project=project_b -d spider=spider_name -d priority=0

At this point, I had 100 pending jobs with priority=0 and these jobs were gradually picked up for processing.

Then, I added 1 new job from project_c and the priority was set as:
curl http://localhost:6800/schedule.json -d project=project_c -d spider=spider_name -d priority=1

What happened now was that this job from project_c with priority 1 was put in the end of all the (100) tasks of project_a and project_b. All of these jobs had priority=0. My expectation was that if there's added a new job to the pending queue and its priority is higher (1 vs 0) than the existing jobs in the queue, this job with priority 1 will be pushed at the beginning of the queue and it will be processed immediatelly once the capacity is released - but, the job with priority=1 was put in the end of all jobs in the pending queue.

My expectation was either wrong or I am doing something wrong, so the priority parameter is ignored.

In the documentation is stated:

priority - the job’s priority in the project’s spider queue (0 by default, higher number, higher priority)

How do I properly prioritize jobs in the queue?

The text was updated successfully, but these errors were encountered:

jpmckinney · 2024-11-04T16:44:33Z

I think you are observing this issue: #187

Basically, right now, each project has its own queue, and jobs are prioritized within that queue. But, what we really want is one queue for all projects, so that jobs are prioritized across all queues.

I'll close this issue as duplicate.

aaronm137 · 2024-11-04T19:21:08Z

Understood, thanks for shedding some light on it.

I am currently dealing with a situation where I have 4 Scrapy projects. 3 projects are completed pretty fast, but the 4th one is a long running task (even if I break it into smaller chunks) and this 4th one is blocking the other 3 projects. The idea was to use the priority parameter and prioritize higher the first 3 projects, but priority does not work across multiple projects.

Does Scrapyd has any feature or workaround how I could prevent that one project blocking the other 3? (the 4th project can run for ~48-72 hours, the first 3 for ~3 hours each). I was wondering like dedicating one CPU core for a project, or somehow "isolating" it (ideally, I would not want to move this 4th project to a separate server).

jpmckinney · 2024-11-04T19:29:01Z

The Scrapyd poller (interface here) calls pop on the spider queue (interface here) to get the next job to run.

It is possible to provide your own poller class or spiderqueue class.

So, the way to fix it (other than fixing #187) is to implement your own poller and/or queue, and then update your configuration to use those new classes.

Edit: For example, maybe you'd want to change the poller to "peek" at the next job across all queues, and then take the highest priority among them. (Or have it do a round-robin, or some other strategy.)

aaronm137 · 2024-11-07T10:46:55Z

I am new to Python, so to build this extension myself might be overwhelming. However, I am trying to figure it out and when debugging it, I noticed that I cannot confirm whether the priority parameter has been properly accepted by Scrapyd.

This is how I schedule jobs:
curl http://localhost:6800/schedule.json -d project=project_b -d spider=project_b -d priority=2

And in the Scrapyd terminal, I can see the following output when Scrapyd accepted the newly incoming job:

2024-11-07T11:39:59+0100 [scrapyd.launcher#info] Process started: project='project_b' spider='project_b' job='a285e9ac9cf411efbc951a1d2b761a4c' pid=93748 args=['/Users/aaronm/.venv/bin/python3.13', '-m', 'scrapyd.runner', 'crawl', 'project_b', '-s', 'LOG_FILE=/Users/aaronm/pythondev/scrapyd_test/logs/project_b/project_b/a285e9ac9cf411efbc951a1d2b761a4c.log', '-a', '_job=a285e9ac9cf411efbc951a1d2b761a4c']

In the terminal output, I can see that Scrapyd received through the API the call and processed these parameters:

project (project_b)
spider (project_b)
job (a285e9ac9cf411efbc951a1d2b761a4c)
pid (93748)

But there's missing the priority parameter, although it is included in the API call. Am I attaching this parameter incorrectly?

jpmckinney · 2024-11-07T14:27:07Z

Yes, you are setting the priority correctly. Using the default configuration, this priority is set in a SQLite database in the dbs/ directory. The database file is named after your project, e.g. myproject.db. If you open that database with the sqlite3 command, you can run SELECT * FROM spider_queue; and you'll see the priority stored.

The "Process started" log message is printed later, after the poller takes the highest priority pending job from the queue, and starts the process to run the job.

jpmckinney added the status: duplicate label Nov 4, 2024

jpmckinney closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `priority` parameter is not reflected with Scrapyd picks up jobs from the "pending" queue #533

The `priority` parameter is not reflected with Scrapyd picks up jobs from the "pending" queue #533

aaronm137 commented Nov 4, 2024 •

edited

Loading

jpmckinney commented Nov 4, 2024

aaronm137 commented Nov 4, 2024

jpmckinney commented Nov 4, 2024 •

edited

Loading

aaronm137 commented Nov 7, 2024 •

edited

Loading

jpmckinney commented Nov 7, 2024 •

edited

Loading

The priority parameter is not reflected with Scrapyd picks up jobs from the "pending" queue #533

The priority parameter is not reflected with Scrapyd picks up jobs from the "pending" queue #533

Comments

aaronm137 commented Nov 4, 2024 • edited Loading

jpmckinney commented Nov 4, 2024

aaronm137 commented Nov 4, 2024

jpmckinney commented Nov 4, 2024 • edited Loading

aaronm137 commented Nov 7, 2024 • edited Loading

jpmckinney commented Nov 7, 2024 • edited Loading

The `priority` parameter is not reflected with Scrapyd picks up jobs from the "pending" queue #533

The `priority` parameter is not reflected with Scrapyd picks up jobs from the "pending" queue #533

aaronm137 commented Nov 4, 2024 •

edited

Loading

jpmckinney commented Nov 4, 2024 •

edited

Loading

aaronm137 commented Nov 7, 2024 •

edited

Loading

jpmckinney commented Nov 7, 2024 •

edited

Loading