Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The priority parameter is not reflected with Scrapyd picks up jobs from the "pending" queue #533

Closed
aaronm137 opened this issue Nov 4, 2024 · 5 comments

Comments

@aaronm137
Copy link

aaronm137 commented Nov 4, 2024

I have 3 projects and I have created:

50 jobs for project_a like this:
curl http://localhost:6800/schedule.json -d project=project_a -d spider=spider_name -d priority=0

Then 50 jobs for project_b like this:
curl http://localhost:6800/schedule.json -d project=project_b -d spider=spider_name -d priority=0

At this point, I had 100 pending jobs with priority=0 and these jobs were gradually picked up for processing.

Then, I added 1 new job from project_c and the priority was set as:
curl http://localhost:6800/schedule.json -d project=project_c -d spider=spider_name -d priority=1

What happened now was that this job from project_c with priority 1 was put in the end of all the (100) tasks of project_a and project_b. All of these jobs had priority=0. My expectation was that if there's added a new job to the pending queue and its priority is higher (1 vs 0) than the existing jobs in the queue, this job with priority 1 will be pushed at the beginning of the queue and it will be processed immediatelly once the capacity is released - but, the job with priority=1 was put in the end of all jobs in the pending queue.

My expectation was either wrong or I am doing something wrong, so the priority parameter is ignored.

In the documentation is stated:

priority - the job’s priority in the project’s spider queue (0 by default, higher number, higher priority)

How do I properly prioritize jobs in the queue?

@jpmckinney
Copy link
Contributor

I think you are observing this issue: #187

Basically, right now, each project has its own queue, and jobs are prioritized within that queue. But, what we really want is one queue for all projects, so that jobs are prioritized across all queues.

I'll close this issue as duplicate.

@jpmckinney jpmckinney closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2024
@aaronm137
Copy link
Author

Understood, thanks for shedding some light on it.

I am currently dealing with a situation where I have 4 Scrapy projects. 3 projects are completed pretty fast, but the 4th one is a long running task (even if I break it into smaller chunks) and this 4th one is blocking the other 3 projects. The idea was to use the priority parameter and prioritize higher the first 3 projects, but priority does not work across multiple projects.

Does Scrapyd has any feature or workaround how I could prevent that one project blocking the other 3? (the 4th project can run for ~48-72 hours, the first 3 for ~3 hours each). I was wondering like dedicating one CPU core for a project, or somehow "isolating" it (ideally, I would not want to move this 4th project to a separate server).

@jpmckinney
Copy link
Contributor

jpmckinney commented Nov 4, 2024

The Scrapyd poller (interface here) calls pop on the spider queue (interface here) to get the next job to run.

It is possible to provide your own poller class or spiderqueue class.

So, the way to fix it (other than fixing #187) is to implement your own poller and/or queue, and then update your configuration to use those new classes.

Edit: For example, maybe you'd want to change the poller to "peek" at the next job across all queues, and then take the highest priority among them. (Or have it do a round-robin, or some other strategy.)

@aaronm137
Copy link
Author

aaronm137 commented Nov 7, 2024

I am new to Python, so to build this extension myself might be overwhelming. However, I am trying to figure it out and when debugging it, I noticed that I cannot confirm whether the priority parameter has been properly accepted by Scrapyd.

This is how I schedule jobs:
curl http://localhost:6800/schedule.json -d project=project_b -d spider=project_b -d priority=2

And in the Scrapyd terminal, I can see the following output when Scrapyd accepted the newly incoming job:

2024-11-07T11:39:59+0100 [scrapyd.launcher#info] Process started: project='project_b' spider='project_b' job='a285e9ac9cf411efbc951a1d2b761a4c' pid=93748 args=['/Users/aaronm/.venv/bin/python3.13', '-m', 'scrapyd.runner', 'crawl', 'project_b', '-s', 'LOG_FILE=/Users/aaronm/pythondev/scrapyd_test/logs/project_b/project_b/a285e9ac9cf411efbc951a1d2b761a4c.log', '-a', '_job=a285e9ac9cf411efbc951a1d2b761a4c']

In the terminal output, I can see that Scrapyd received through the API the call and processed these parameters:

  • project (project_b)
  • spider (project_b)
  • job (a285e9ac9cf411efbc951a1d2b761a4c)
  • pid (93748)

But there's missing the priority parameter, although it is included in the API call. Am I attaching this parameter incorrectly?

@jpmckinney
Copy link
Contributor

jpmckinney commented Nov 7, 2024

Yes, you are setting the priority correctly. Using the default configuration, this priority is set in a SQLite database in the dbs/ directory. The database file is named after your project, e.g. myproject.db. If you open that database with the sqlite3 command, you can run SELECT * FROM spider_queue; and you'll see the priority stored.

The "Process started" log message is printed later, after the poller takes the highest priority pending job from the queue, and starts the process to run the job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants