Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination using Range header #234

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Pagination using Range header #234

wants to merge 13 commits into from

Conversation

gudmundur
Copy link
Member

Note: 🐝 This is still work in progress. 🐝

This code is based on how the Heroku Platform API deals with pagination and ranges, both in terms of code and design. Credit for that goes to @brandur and @geemus.

end

context 'with an unbound end range' do
let(:range_header) { 'objects 0-' }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geemus Do you think unbound ranges should be allowed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gudmundur I think they are in the API already!

$ curl -H "Range: id ]006e2c53-e3a9-4152-851c-abf6e9991c63..; max=1" -H "Accept: application/vnd.heroku+json; version=3" -n https://api.heroku.com/apps
[
  {
    "archived_at":null,
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not absolutely required, but certainly nice to have.

@brandur
Copy link
Member

brandur commented Dec 14, 2015

@gudmundur Looks good so far! Let me know when you want a more detailed review.

One tricky piece that we may still need are some kind of response helpers that are used to generate a next range.

BTW, not sure if I missed something, but did we change the range from the two dots to a hyphen? Might make ranging via UUID a little tricky ;)

@gudmundur
Copy link
Member Author

@brandur Yeah, there is an internal RFC at Heroku that changes a few things. The major difference is that it's using slicing (limit & offset in SQL), rather than >= and <=. The benefit there is that if the key being sorting over is not unique, then there is no need to make it unique to sort out boundary issues between ranges. Then a minor difference is a hyphen, rather than two dots.

@brandur
Copy link
Member

brandur commented Dec 15, 2015

Ah, I see! Maybe a few considerations for you and @geemus to consider:

  1. This is a pretty large breaking change in that it's not compatible with the old/current ranging interface. It might be worthwhile considering just reverting to a more traditional style of pagination (i.e. via query parameter) if it has to change anyway.
  2. Postgres is bad at limit/offset because the entire dataset up to the offset must still be computed. Unlike the where predicate, its performance doesn't scale to large data sets/offsets. Here's some token information from the manual (see the last paragraph) and here's a more detailed explanation.
  3. Querying by offset instead of ID will result in less stable results because indices mutate (especially if using UUIDs). It's always hard to paginate over a large data set in a stable way, but offset by ID makes this even harder.

I don't have too much skin in the game anymore, but it might be wise to at least make the pagination mechanism modular so that users can vary their implementation if necessary.

@geemus
Copy link
Member

geemus commented Dec 15, 2015

@brandur thanks for the feedback, some thoughts:

  1. I would be fine maintaining double-dots instead of dash (or accepting either). I was a bit torn between backwards compat here and sticking more closely to the spec. Double-dot does seem more universal though and only one extra character.
  2. We ran into the postgres limitation and discussed some. It seems not great, but the ease of implementation is nice. Do you have a sense of what scale of results starts to be a problem? ie for smallish sets is it also problematic or only in the tens/hundreds of thousands or something?
  3. We had discussed order by created-at or similar, in the hopes that it would preserve order.

Also, happy to share the RFC itself if you are curious, will probably think about trying to move it into the HTTP API Guide in some capacity once we settle on some things.

@brandur
Copy link
Member

brandur commented Dec 15, 2015

@geemus

I would be fine maintaining double-dots instead of dash (or accepting either). I was a bit torn between backwards compat here and sticking more closely to the spec. Double-dot does seem more universal though and only one extra character.

Yeah, that's a tough one. I suppose that if you do adopt offset-based pagination, it might make sense to go back to hyphen if most client libraries will have to change anyway.

We ran into the postgres limitation and discussed some. It seems not great, but the ease of implementation is nice. Do you have a sense of what scale of results starts to be a problem? ie for smallish sets is it also problematic or only in the tens/hundreds of thousands or something?

Check out this graph from the article above. It's far from a comprehensive analysis, but it shows the performance drop-off very well. The discrepancy starts to show up right away, but it took 600 pages to see a very serious cliff.

The add-on API endpoint to retrieve lists of apps is probably large enough for at least some add-on providers for the difference to show (if you ever get that moved over to V3 ;). Obviously since it's offset-based today it's not going to take down the database or anything, but it's less efficient.

And one more data point: Stripe started with offset-based pagination but moved off of it (somewhat anecdotal ;). This wasn't so much for performance reasons, but more because offset-based pagination's inferior characteristics around stability.

@geemus
Copy link
Member

geemus commented Dec 22, 2015

@brandur I guess maybe we end up stuck with both, at least potentially. Offset is simpler to implement and sometimes desired (when you want to range over non-unique and/or not ordered/incrementing values), but comes with tradeoffs. Otherwise using our original style (or something like it) over incrementing id would be more stable and more performant, but if I'm not mistaken it tends to need auto-incrementing id to work well in terms of stability (and I'm not sure we want to reveal those and/or depend upon their presence). Definitely something I'll have to ponder more before we settle on something here.

@bjeanes
Copy link

bjeanes commented Dec 23, 2015

Joining in here because part of the discussion for changing this was prompted by a few situations my team ran into. Content-based pagination is superior in all the ways you've mentioned @brandur but it still fundamentally relies on paginating over unique values. Some workarounds might be conceivable based on faux-composite indexes, but these have other trade-offs. Cleverer but inconceivable to implement well alternatives could also include returning smaller or larger ranges than requested when the page boundary lands in the middle of a set of same-value fields. But... :/

Have you thought much about a way to handle non-unique sort fields? In practice, these might be dates which rarely collide but in theory one might be sorting by something like a state field with lots of collisions.

@bjeanes
Copy link

bjeanes commented Dec 23, 2015

I don't think I like what I'm about to propose but I'll do so anyway in case it can be a strawman which prompts other thoughts...

If, like Heroku, the sortable fields are pre-declared and the range headers are treated opaquely, we could create extra columns which hold unique values corresponding in order to the non-unique column, but which arbitrarily orders same-value records uniquely. E.g. the created_at column might have a corresponding created_at_sort which is unique.

The trade-off here might be not working too well for mutable fields, but even this could be accommodated, to an extend, by spacing out the sort values. This might mean a header could be something like Content-Range: created_at ]2015-12-22T1500Z-002.., where the -002 is acting as the tiebreaker.

@bjeanes
Copy link

bjeanes commented Dec 23, 2015

To take this further, Postgres is capable enough that these fields could actually be a custom type or domain which is a tuple of the original value and an auto-incrementing number.

@gudmundur
Copy link
Member Author

@bjeanes Wouldn't the tiebreaker have to be based on something unique in the content?

@tmaier tmaier mentioned this pull request Jan 5, 2016
8 tasks
@bjeanes
Copy link

bjeanes commented Jan 6, 2016

@gudmundur yes, whether a sequence or some other field of the record, like id.

@brandur
Copy link
Member

brandur commented Jan 7, 2016

@bjeanes

Have you thought much about a way to handle non-unique sort fields? In practice, these might be dates which rarely collide but in theory one might be sorting by something like a state field with lots of collisions.

Not really :/ I don't know of any non-unique examples in the Stripe API off-hand (which isn't to say there aren't; I still don't know it all that well), but I'll dig in if one comes up.

The trade-off here might be not working too well for mutable fields, but even this could be accommodated, to an extend, by spacing out the sort values. This might mean a header could be something like Content-Range: created_at ]2015-12-22T1500Z-002.., where the -002 is acting as the tiebreaker.

Good idea! That certainly seems worthy of investigation at the very least. There are of course that some possible downsides like the ranges looking a tad unwieldy and possibly a more difficult implementation, but I think you probably considered all of that already.

A variation of the same idea might also be to allow expand the spec to allow ranging on composite fields and then just make sure that composite fields are always used in place of non-unique fields by themselves. So for example, instead of saying that it's possible to range on Range: created_at, you'd have something like Range: created_at/id and all range values would be compounds like 2015-12-22T1500Z/001.

I was thinking originally that you might need some composite indexes on (field, id), but thinking it about it more, it should be fine without in most cases by just having Postgres scan inside the smaller data set filtered on field. You'd obviously want to aim to only index on non-unique fields where you have some reasonable guarantees that field values won't be hugely duplicated.

Anyway, I'm looking forward to see where you guys go with this!

@brandur
Copy link
Member

brandur commented Jan 7, 2016

And sorry about the delay in reply here! I let my inbox runaway on me over the holidays :/

@bjeanes
Copy link

bjeanes commented Jan 8, 2016

There are of course that some possible downsides like the ranges looking a tad unwieldy and possibly a more difficult implementation

Potentially, but it should be pretty opaque. Since it's a field > value, you could still do Accept-Range: created_at ]2015-12-22T1500Z (i.e. drop the -002) for manually crafted requests. After that, you're likely piping through whatever you get back in Next-Range or whatever that header is.

@brandur
Copy link
Member

brandur commented Jan 8, 2016

Since it's a field > value, you could still do Accept-Range: created_at ]2015-12-22T1500Z (i.e. drop the -002) for manually crafted requests. After that, you're likely piping through whatever you get back in Next-Range or whatever that header is.

+1. That would be a nice feature for convenience.

@gudmundur
Copy link
Member Author

For the case of ranging over unique values, I think our approach is fairly nailed down.

I've mulled non-unique values for a while now, then finally spent time on Googling this and then stumbled on this http://www.slideshare.net/MarkusWinand/p2d2-pagination-done-the-postgresql-way. It seems that we can use row value expressions in Postgres to solve this.

In conjunction with a non-unique field, we mix in a unique value, uuid for example. That would give us a header in the form of something like Next-Range: created_at ]2016-02-09T0730Z/facf751c-860e-4004-a206-6a0899b89597... When we detect a Range header with that format, we construct a query

SELECT
  *
FROM
  model
WHERE 
  ROW(created_at, id) <  ROW('2016-02-09T0730Z', 'facf751c-860e-4004-a206-6a0899b89597')
ORDER BY
  created_at, id
LIMIT 
  200;

Using this requires that we have a composite index on (created_at, id), and I'm assuming that if things are keyed off a user, for example, it'll have to be (user_id, created_at, id). According to the slide deck, this should perform like the wind.

@brandur @bjeanes What do you guys think?

@bjeanes
Copy link

bjeanes commented Feb 9, 2016

@gudmundur oh my god that is so elegant. I love it!

I'm assuming that if things are keyed off a user, for example, it'll have to be (user_id, created_at, id)

I'm thinking that in most cases your access isn't based on ownership so much as other access mechanisms (permissions, sharing, etc) which means we lose out on that extra tuple value. Nonetheless, that should do the trick in some cases. Really nice find @gudmundur!

attn @geemus too

@bjeanes
Copy link

bjeanes commented Feb 9, 2016

(Obviously, I say that without having tested this yet, but in theory it seems solid)

@bjeanes
Copy link

bjeanes commented Feb 9, 2016

It looks like the ROW keyword is optional for these compound cases anyway, so simply doing the filter as: (field1, field2) < ('val1', 'val2') gets us the same neat effect, I think!

Edit: which, incidentally, seems to be what slide 28 of the presentation you linked is doing

@brandur
Copy link
Member

brandur commented Feb 9, 2016

Thanks for the write-up Gudmundur! This looks good to me as well!

On Tuesday, February 9, 2016, Bo Jeanes notifications@github.com wrote:

It looks like the ROW keyword is optional for these compound cases
anyway, so simply doing the filter as: (field1, field2) < ('val1', 'val2')
gets us the same neat effect, I think!


Reply to this email directly or view it on GitHub
#234 (comment).

@geemus
Copy link
Member

geemus commented Feb 18, 2016

Sounds great, thanks for researching.

My only pause might be around the separator. I worry that forward slash might be part of a legit value (though it isn't in either of those) and/or if there is something clearer. Comma seems nice in some ways, for instance, though I'm not sure it helps at all wrt avoiding conflicts.

@ransombriggs
Copy link

@gudmundur @geemus what is the current status / blockers for this? I would like to be able to use this in my pliny app and if there is anything I can do to help, please let me know.

@gudmundur
Copy link
Member Author

@ransombriggs Implementing this. What specific part do you need? Pagination over uniques or over non-uniques? We'll be needing this on DevEx soon, so I'll allocate time to do this in the next couple of days.

@ransombriggs
Copy link

@gudmundur over uniques, I am just working on a pliny app and did not want to duplicate work if I could avoid it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants