-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The priority
parameter is not reflected with Scrapyd picks up jobs from the "pending" queue
#533
Comments
I think you are observing this issue: #187 Basically, right now, each project has its own queue, and jobs are prioritized within that queue. But, what we really want is one queue for all projects, so that jobs are prioritized across all queues. I'll close this issue as duplicate. |
Understood, thanks for shedding some light on it. I am currently dealing with a situation where I have 4 Scrapy projects. 3 projects are completed pretty fast, but the 4th one is a long running task (even if I break it into smaller chunks) and this 4th one is blocking the other 3 projects. The idea was to use the Does Scrapyd has any feature or workaround how I could prevent that one project blocking the other 3? (the 4th project can run for ~48-72 hours, the first 3 for ~3 hours each). I was wondering like dedicating one CPU core for a project, or somehow "isolating" it (ideally, I would not want to move this 4th project to a separate server). |
The Scrapyd poller (interface here) calls It is possible to provide your own poller class or spiderqueue class. So, the way to fix it (other than fixing #187) is to implement your own poller and/or queue, and then update your configuration to use those new classes. Edit: For example, maybe you'd want to change the poller to "peek" at the next job across all queues, and then take the highest priority among them. (Or have it do a round-robin, or some other strategy.) |
I am new to Python, so to build this extension myself might be overwhelming. However, I am trying to figure it out and when debugging it, I noticed that I cannot confirm whether the This is how I schedule jobs: And in the Scrapyd terminal, I can see the following output when Scrapyd accepted the newly incoming job:
In the terminal output, I can see that Scrapyd received through the API the call and processed these parameters:
But there's missing the |
Yes, you are setting the priority correctly. Using the default configuration, this priority is set in a SQLite database in the The "Process started" log message is printed later, after the poller takes the highest priority pending job from the queue, and starts the process to run the job. |
I have 3 projects and I have created:
50 jobs for
project_a
like this:curl http://localhost:6800/schedule.json -d project=project_a -d spider=spider_name -d priority=0
Then 50 jobs for
project_b
like this:curl http://localhost:6800/schedule.json -d project=project_b -d spider=spider_name -d priority=0
At this point, I had 100 pending jobs with priority=0 and these jobs were gradually picked up for processing.
Then, I added 1 new job from
project_c
and the priority was set as:curl http://localhost:6800/schedule.json -d project=project_c -d spider=spider_name -d priority=1
What happened now was that this job from
project_c
with priority 1 was put in the end of all the (100) tasks ofproject_a
andproject_b
. All of these jobs had priority=0. My expectation was that if there's added a new job to thepending
queue and its priority is higher (1 vs 0) than the existing jobs in the queue, this job with priority 1 will be pushed at the beginning of the queue and it will be processed immediatelly once the capacity is released - but, the job with priority=1 was put in the end of all jobs in thepending
queue.My expectation was either wrong or I am doing something wrong, so the
priority
parameter is ignored.In the documentation is stated:
How do I properly prioritize jobs in the queue?
The text was updated successfully, but these errors were encountered: