Skip to content

Commit

Permalink
cfq-iosched: temporarily boost queue priority for idle classes
Browse files Browse the repository at this point in the history
commit 26feb9bc3c5322d331d0b713106ac7631cbab938
Author: Ritesh Harjani <riteshh@codeaurora.org>
Date:   Wed Aug 9 18:28:32 2017 +0530

    cfq: Give a chance for arming slice idle timer in case of group_idle

    In below scenario blkio cgroup does not work as per their assigned
    weights :-
    1. When the underlying device is nonrotational with a single HW queue
    with depth of >= CFQ_HW_QUEUE_MIN
    2. When the use case is forming two blkio cgroups cg1(weight 1000) &
    cg2(wight 100) and two processes(file1 and file2) doing sync IO in
    their respective blkio cgroups.

    For above usecase result of fio (without this patch):-
    file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan  1 19:41:49 1970
      write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)
    <...>
    file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan  1 19:41:49 1970
      write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)
    <...>
    // both the process BW is equal even though they belong to diff.
    cgroups with weight of 1000(cg1) and 100(cg2)

    In above case (for non rotational NCQ devices),
    as soon as the request from cg1 is completed and even
    though it is provided with higher set_slice=10, because of CFQ
    algorithm when the driver tries to fetch the request, CFQ expires
    this group without providing any idle time nor weight priority
    and schedules another cfq group (in this case cg2).
    And thus both cfq groups(cg1 & cg2) keep alternating to get the
    disk time and hence loses the cgroup weight based scheduling.

    Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
    to arm the slice timer in case group_idle is enabled.
    In case if group_idle is also not required (including for nonrotational
    NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
    such cases.

    With this patch result of fio(for above usecase) :-
    file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan  1 00:06:08 1970
      write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)
    <..>
    file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan  1 00:06:08 1970
      write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)
    <..>
    // In this processes BW is as per their respective cgroups weight.

    Change-Id: Ia68ba16e76fdcc60d66e256f94290c871c5b2a73
    Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: mydongistiny <jaysonedson@gmail.com>

commit 2133aba5a779fa206289ea3e1ebbc5ef67be7d2c
Author: Hou Tao <houtao1@huawei.com>
Date:   Wed Mar 1 09:02:33 2017 +0800

    cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode

    When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
    as the delay of cfq_group's vdisktime if there have been other cfq_groups
    already.

    When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
    from jiffies to nanoseconds") could result in a large iops delay and
    lead to an abnormal io schedule delay for the added cfq_group. To fix
    it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
    when iops mode is enabled.

    Despite having the same value, the delay of a cfq_queue in idle class
    and the delay of cfq_group are different things, so I define two new
    macros for the delay of a cfq_group under time-slice mode and iops mode.

    Change-Id: I6d3a048e063b2826136e5416c28c1a7b471f5d18
    Fixes: 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds")
    Cc: <stable@vger.kernel.org> # 4.8+
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Acked-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Signed-off-by: mydongistiny <jaysonedson@gmail.com>

commit 9b5738156b8f9da812966499f8982b6cfd25e144
Author: Matthias Kaehlcke <mka@chromium.org>
Date:   Fri May 26 14:22:37 2017 -0700

    cfq-iosched: Delete unused function min_vdisktime()

    This fixes the following warning when building with clang:

        block/cfq-iosched.c:970:19: error: unused function 'min_vdisktime'
            [-Werror,-Wunused-function]

    Change-Id: I74bcf4a5407496a627e6ac41b274c7a3cc887cc3
    Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Signed-off-by: mydongistiny <jaysonedson@gmail.com>

commit 962bbb5e92abc53a11042a8d992057b6f1e7505a
Author: Markus Elfring <elfring@users.sourceforge.net>
Date:   Sat Jan 21 22:44:07 2017 +0100

    cfq-iosched: Adjust one function call together with a variable assignment

    The script "checkpatch.pl" pointed information out like the following.

    ERROR: do not use assignment in if condition

    Thus fix the affected source code place.

    Change-Id: I54338d5e9de5d3b8b593b63a535c98ec9a4a0f05
    Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Signed-off-by: mydongistiny <jaysonedson@gmail.com>

commit a846d0ed19f1b51314a27a01810922643b457ec8
Author: Jens Axboe <axboe@fb.com>
Date:   Thu Jun 9 15:47:29 2016 -0600

    cfq-iosched: temporarily boost queue priority for idle classes

    If we're queuing REQ_PRIO IO and the task is running at an idle IO
    class, then temporarily boost the priority. This prevents livelocks
    due to priority inversion, when a low priority task is holding file
    system resources while attempting to do IO.

    An example of that is shown below. An ioniced idle task is holding
    the directory mutex, while a normal priority task is trying to do
    a directory lookup.

    [478381.198925] ------------[ cut here ]------------
    [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
    [478381.201324]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.203462] ionice          D ffff8803692736a8     0 1168369      1 0x00000080
    [478381.203466]  ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
    [478381.204589]  ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
    [478381.205752]  ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
    [478381.206874] Call Trace:
    [478381.207253]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
    [478381.208175]  [<ffffffff8177cea7>] schedule+0x37/0x90
    [478381.208932]  [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250
    [478381.209805]  [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50
    [478381.210706]  [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0
    [478381.211489]  [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110
    [478381.212402]  [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90
    [478381.213280]  [<ffffffff8177d616>] bit_wait_io+0x36/0x50
    [478381.214063]  [<ffffffff8177d325>] __wait_on_bit+0x65/0x90
    [478381.214961]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
    [478381.215872]  [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90
    [478381.216806]  [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40
    [478381.217773]  [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30
    [478381.218641]  [<ffffffff8123c557>] ext4_bread+0x57/0x70
    [478381.219425]  [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380
    [478381.220467]  [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170
    [478381.221357]  [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0
    [478381.222208]  [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510
    [478381.223090]  [<ffffffff812471a2>] ext4_lookup+0x52/0x160
    [478381.223882]  [<ffffffff811c401d>] lookup_real+0x1d/0x60
    [478381.224675]  [<ffffffff811c4698>] __lookup_hash+0x38/0x50
    [478381.225697]  [<ffffffff817745bd>] lookup_slow+0x45/0xab
    [478381.226941]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
    [478381.227880]  [<ffffffff811c6a42>] path_init+0xc2/0x430
    [478381.228677]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
    [478381.229776]  [<ffffffff811c8c57>] path_openat+0x77/0x620
    [478381.230767]  [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70
    [478381.232019]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
    [478381.233016]  [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70
    [478381.234072]  [<ffffffff811c0cb0>] do_open_execat+0x70/0x170
    [478381.235039]  [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0
    [478381.236051]  [<ffffffff811c214c>] do_execve+0x2c/0x30
    [478381.236809]  [<ffffffff811ca392>] ? getname+0x12/0x20
    [478381.237564]  [<ffffffff811c23be>] SyS_execve+0x2e/0x40
    [478381.238338]  [<ffffffff81780a1d>] stub_execve+0x6d/0xa0
    [478381.239126] ------------[ cut here ]------------
    [478381.239915] ------------[ cut here ]------------
    [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
    [478381.242673]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.244902] python2.7       D ffff88005cf8fb98     0 1168375 1168248 0x00000080
    [478381.244904]  ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
    [478381.246023]  ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
    [478381.247138]  ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
    [478381.248252] Call Trace:
    [478381.248630]  [<ffffffff8177cea7>] schedule+0x37/0x90
    [478381.249382]  [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10
    [478381.250465]  [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100
    [478381.251409]  [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f
    [478381.252199]  [<ffffffff817745ae>] lookup_slow+0x36/0xab
    [478381.253023]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
    [478381.253877]  [<ffffffff811aeb41>] ? try_charge+0xc1/0x700
    [478381.254690]  [<ffffffff811c6a42>] path_init+0xc2/0x430
    [478381.255525]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
    [478381.256450]  [<ffffffff811c8c57>] path_openat+0x77/0x620
    [478381.257256]  [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
    [478381.258390]  [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720
    [478381.259309]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
    [478381.260139]  [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120
    [478381.260962]  [<ffffffff811b95ac>] do_sys_open+0x13c/0x230
    [478381.261779]  [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170
    [478381.262851]  [<ffffffff811b96c2>] SyS_open+0x22/0x30
    [478381.263598]  [<ffffffff81780532>] system_call_fastpath+0x12/0x17
    [478381.264551] ------------[ cut here ]------------
    [478381.265377] ------------[ cut here ]------------

    Change-Id: I8e7e1dd2f9bbdf4b42954260ecde1a4ea53ad1dd
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
    Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
    Signed-off-by: mydongistiny <jaysonedson@gmail.com>

commit 05190a47bb024b6a03ce480054b27d1ea6fe315a
Author: Jens Axboe <axboe@fb.com>
Date:   Wed Jun 10 08:01:20 2015 -0600

    cfq-iosched: fix the setting of IOPS mode on SSDs

    A previous commit wanted to make CFQ default to IOPS mode on
    non-rotational storage, however it did so when the queue was
    initialized and the non-rotational flag is only set later on
    in the probe.

    Add an elevator hook that gets called off the add_disk() path,
    at that point we know that feature probing has finished, and
    we can reliably check for the various flags that drivers can
    set.

    Change-Id: I854ce21ffae67cb7d9b4164dc77ad4bc57731ebb
    Fixes: 41c0126 ("block: Make CFQ default to IOPS mode on SSDs")
    Tested-by: Romain Francoise <romain@orebokech.com>
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Signed-off-by: mydongistiny <jaysonedson@gmail.com>

commit 6bd1f2f55e64640caf1a2def7dea38cab3dc3c36
Author: Jan Kara <jack@suse.com>
Date:   Tue Jan 12 16:24:15 2016 +0100

    cfq-iosched: Don't group_idle if cfqq has big thinktime

    There is no point in idling on a cfq group if the only cfq queue that is
    there has too big thinktime.

    Change-Id: I988a6f6681410cdcc9b1f95e0e0a40ed60f60c31
    Signed-off-by: Jan Kara <jack@suse.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Jens Axboe <axboe@fb.com>
    Signed-off-by: mydongistiny <jaysonedson@gmail.com>

Change-Id: Idaaa4f30cd537fa785a53775b75589df178aedd3
Signed-off-by: mydongistiny <jaysonedson@gmail.com>
  • Loading branch information
axboe authored and DennySPB committed Mar 5, 2018
1 parent 6fb839a commit 23d3a68
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 17 deletions.
5 changes: 5 additions & 0 deletions block/blk-core.c
Original file line number Diff line number Diff line change
Expand Up @@ -1728,6 +1728,11 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
if (sync)
rw_flags |= REQ_SYNC;

/*
* Add in META/PRIO flags, if set, before we get to the IO scheduler
*/
rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO));

/*
* Grab a free request. This is might sleep but can not fail.
* Returns with the queue unlocked.
Expand Down
67 changes: 50 additions & 17 deletions block/cfq-iosched.c
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,13 @@ static const u64 cfq_target_latency = (u64)NSEC_PER_SEC * 3/10; /* 300 ms */
static const int cfq_hist_divisor = 4;

/*
* offset from end of service tree
* offset from end of queue service tree for idle class
*/
#define CFQ_IDLE_DELAY (NSEC_PER_SEC / 5)
/* offset from end of group service tree under time slice mode */
#define CFQ_SLICE_MODE_GROUP_DELAY (NSEC_PER_SEC / 5)
/* offset from end of group service under IOPS mode */
#define CFQ_IOPS_MODE_GROUP_DELAY (HZ / 5)

/*
* below this threshold, we consider thinktime immediate
Expand Down Expand Up @@ -136,7 +140,7 @@ struct cfq_queue {

/* io prio of this group */
unsigned short ioprio, org_ioprio;
unsigned short ioprio_class;
unsigned short ioprio_class, org_ioprio_class;

pid_t pid;

Expand Down Expand Up @@ -977,15 +981,6 @@ static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
return min_vdisktime;
}

static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
{
s64 delta = (s64)(vdisktime - min_vdisktime);
if (delta < 0)
min_vdisktime = vdisktime;

return min_vdisktime;
}

static void update_min_vdisktime(struct cfq_rb_root *st)
{
struct cfq_group *cfqg;
Expand Down Expand Up @@ -1361,6 +1356,14 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
cfqg->vfraction = max_t(unsigned, vfr, 1);
}

static inline u64 cfq_get_cfqg_vdisktime_delay(struct cfq_data *cfqd)
{
if (!iops_mode(cfqd))
return CFQ_SLICE_MODE_GROUP_DELAY;
else
return CFQ_IOPS_MODE_GROUP_DELAY;
}

static void
cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
Expand All @@ -1380,7 +1383,8 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
n = rb_last(&st->rb);
if (n) {
__cfqg = rb_entry_cfqg(n);
cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
cfqg->vdisktime = __cfqg->vdisktime +
cfq_get_cfqg_vdisktime_delay(cfqd);
} else
cfqg->vdisktime = st->min_vdisktime;
cfq_group_service_tree_add(st, cfqg);
Expand Down Expand Up @@ -2566,9 +2570,11 @@ static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
if (!cfqg)
return NULL;

for_each_cfqg_st(cfqg, i, j, st)
if ((cfqq = cfq_rb_first(st)) != NULL)
for_each_cfqg_st(cfqg, i, j, st) {
cfqq = cfq_rb_first(st);
if (cfqq)
return cfqq;
}
return NULL;
}

Expand Down Expand Up @@ -2737,6 +2743,7 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_arm_slice_timer(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq = cfqd->active_queue;
struct cfq_rb_root *st = cfqq->service_tree;
struct cfq_io_cq *cic;
u64 sl, group_idle = 0;
u64 now = ktime_get_ns();
Expand All @@ -2746,7 +2753,8 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
* for devices that support queuing, otherwise we still have a problem
* with sync vs async workloads.
*/
if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag &&
!cfqd->cfq_group_idle)
return;

WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
Expand Down Expand Up @@ -2788,8 +2796,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
return;
}

/* There are other queues in the group, don't do group idle */
if (group_idle && cfqq->cfqg->nr_cfqq > 1)
/*
* There are other queues in the group or this is the only group and
* it has too big thinktime, don't do group idle.
*/
if (group_idle &&
(cfqq->cfqg->nr_cfqq > 1 ||
cfq_io_thinktime_big(cfqd, &st->ttime, true)))
return;

cfq_mark_cfqq_wait_request(cfqq);
Expand Down Expand Up @@ -3518,6 +3531,7 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct cfq_io_cq *cic)
* elevate the priority of this queue
*/
cfqq->org_ioprio = cfqq->ioprio;
cfqq->org_ioprio_class = cfqq->ioprio_class;
cfq_clear_cfqq_prio_changed(cfqq);
}

Expand Down Expand Up @@ -4159,6 +4173,24 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
cfq_schedule_dispatch(cfqd);
}

static void cfqq_boost_on_prio(struct cfq_queue *cfqq, int rw)
{
/*
* If REQ_PRIO is set, boost class and prio level, if it's below
* BE/NORM. If prio is not set, restore the potentially boosted
* class/prio level.
*/
if (!(rw & REQ_PRIO)) {
cfqq->ioprio_class = cfqq->org_ioprio_class;
cfqq->ioprio = cfqq->org_ioprio;
} else {
if (cfq_class_idle(cfqq))
cfqq->ioprio_class = IOPRIO_CLASS_BE;
if (cfqq->ioprio > IOPRIO_NORM)
cfqq->ioprio = IOPRIO_NORM;
}
}

static inline int __cfq_may_queue(struct cfq_queue *cfqq)
{
if (cfq_cfqq_wait_request(cfqq) && !cfq_cfqq_must_alloc_slice(cfqq)) {
Expand Down Expand Up @@ -4189,6 +4221,7 @@ static int cfq_may_queue(struct request_queue *q, int rw)
cfqq = cic_to_cfqq(cic, rw_is_sync(rw));
if (cfqq) {
cfq_init_prio_data(cfqq, cic);
cfqq_boost_on_prio(cfqq, rw);

return __cfq_may_queue(cfqq);
}
Expand Down

0 comments on commit 23d3a68

Please sign in to comment.