Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async calls #105

Open
merlinm opened this issue Aug 22, 2019 · 14 comments
Open

async calls #105

merlinm opened this issue Aug 22, 2019 · 14 comments

Comments

@merlinm
Copy link

merlinm commented Aug 22, 2019

Hi. Right now I'm working on an entirely stored procedure based coordination framework called pgtask. Think of it as a centralized cron for controlling database task execution in complex environments; I believe there to be a technical niche there. Anyways, I'm interested in being able to tap web services in addition to target remote database calls. For this to work in my architecture, I'd need to have some kind of asynchronous API; something along the lines of dblink_send for asynchronous querying (which I use to send queries since the coordinator is single threaded).

An ideal API would be able to do something like:
*) http_send (name, ...) send a named http request
*) http_request_complete(name, ) is the http request resolved?
*) http_get_request (name, timeout, ...) get a response, blocking until timeout if not done
*) http_wait_for_request(timeout) wait for the first available request to complete or until timeout. (epoll like interface to prevent polling/looping in sql) -- dblink really needs this
*) http_cancel_request (name, ...) cancel an outstanding request
*) http_list_requests() get a list of outstanding http requests

Food for thought. I don't know if this api (esp wait_for_request call) is technically feasible, but it would make this a much more usable extension since database queries are essentially can be considered single threaded environment. Anyways, thanks for your work!

@pramsey
Copy link
Owner

pramsey commented Aug 22, 2019

MM, yep, that's been a gleam in my eye for a long while and thanks for the API sketch! There's some facility for this in libcurl, in terms of an async API, but I haven't fully figured how to tie it into the PgSQL scheme nicely, since once you start up an async request, the whole who-owns-it question comes up...

I was thinking about an API based on callbacks rather than a queue, so your idea bears some thought. Something for me to ponder on my ride tomorrow...

@merlinm
Copy link
Author

merlinm commented Aug 22, 2019

Great -- thanks for response. The challenge with callbacks is that there is no way (that I know of) to handle callbacks at the SQL level. The idea here is to implement some wrapper to something like epoll_wait() (http_wait_for_request) with a timeout. Upon receiving an interesting event, control would be returned to the SQL level with information so that you wouldn't have to loop the outstanding connections to determine which one is ready to return data.

@pramsey
Copy link
Owner

pramsey commented Aug 22, 2019

Which I guess would be OK, but users would have to two-step all their HTTP work... since you are building a scheduler that makes sense for you :) but other folks might find it a little fiddle to make use of, no?

@merlinm
Copy link
Author

merlinm commented Aug 23, 2019

Well, it's useful in any single threaded application. If you look at asynchronous stacks like node, you can see that most i/o works this way. This type of work is not super common in database coding but since you made this project I think we can both agree that this is an underutilized frontier of development.

In general, robust programming of i/o requests outside the system either require threads or asynchronous behavior. Threads are right out in stored procedures, so that leaves async. Synchronous requests are easier to code and many people prefer them, but they are dangerous; you're blocking the thread and you have no wait to deal with the situation other than a crude timeout. So even in scenarios where there would only be one outstanding query at a time you'd still be better off using an asynchronous call if you wanted precise error handling.

The perfect analogy here is dblink; I use it a lot to transfer data between databases; if the operation needs to meet certain guarantees in terms of robustness it usually ends up being run over the asynchronous call (mainly so I can exploit cancelling but there are other reasons).

dblink really out to have a timeout on the is_busy call and something like dblink_wait_for_query to push the polling out of the sql level, but the api is good enough to work. The synchronous model you've built is fine and will certainly be good enough for most users but really strong i/o libraries will almost always have some kind of contrivance so the calling code can maintain control of the execution state.

@merlinm
Copy link
Author

merlinm commented Aug 23, 2019

reading up on the libcurl api, the library supports async requests without threading (which would be required for this approach to be technically feasible). Here's in example I found:
https://curl.haxx.se/libcurl/c/multi-app.html

In the API sketch above, if you were to do this, http_wait_for_request would probe into the 'select' and continue to load up the result until the SQL level timeout was hit or the still_running flag was set so that the request was complete. 'select' is of course polling underneath the hood, so we would not want to scale to huge number of concurrent requests, but this ought to be much more efficient than sql level poll. Thanks for listening, curious your thoughts.

@slavanap
Copy link

slavanap commented Jun 25, 2020

Just wanted to ask why http_get SQL function does not use PARALLEL keyword in it's definition, but found this.

@merlinm
Copy link
Author

merlinm commented Jun 25, 2020

yeah, PARALLEL != async. PARALLEL means you can spawn multiple similar option, in query, in situations where query planner thinks it's a good idea to do so. async means you kick the operation off in the background and do other things while the operation is in progress. workarounds today would be to use dblink or pg_background.

@deem0n
Copy link

deem0n commented Jul 7, 2020

Hi all,
I just want to share my expirience with PARALLEL execution. You can try to do

set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
explain select * from http(...) UNION ALL select * from  http(...);

It should show something like (look for Parallel Append)

 Gather  (cost=0.25..26.96 rows=4000 width=104)
   Workers Planned: 3
   ->  Parallel Append  (cost=0.25..26.96 rows=1292 width=104)
         ->  Function Scan on mltv  (cost=0.25..10.25 rows=1000 width=104)
         ->  Function Scan on mltv mltv_1  (cost=0.25..10.25 rows=1000 width=104)
         ->  Function Scan on mltv mltv_2  (cost=0.25..10.25 rows=1000 width=104)
         ->  Function Scan on mltv mltv_3  (cost=0.25..10.25 rows=1000 width=104)

But when http get called from Parallel plan you might get error:

DEBUG:  pgsql-http: queried 'http://127.0.0.1:8200/my/url/path'
DEBUG:  pgsql-http: http_return '56'
ERROR:  Failure when receiving data from the peer

It works fine in 10% of cases and I see this error in 90% of cases.

Is it possible to make http PARALLEL SAFE?

@pramsey
Copy link
Owner

pramsey commented Jul 7, 2020

Two confusing things. None of the functions are marked as PARALLEL SAFE, I wonder why you get a parallel plan? The http calls happen in distinct worker processes, I wonder why the calls fail, it's not like they are running in the same memory space on different threads or something. Should be no different than firing off multiple copies of curl at once.

@deem0n
Copy link

deem0n commented Jul 8, 2020

Hi, I am trying to reproduce my prod env and isolate the problem. Please check the gist how to make parrallel HTTP requests in PostgreSQL. Seems it works fine for GET requests! https://gist.github.com/deem0n/37f63978773d12a8de61289900c5cc03

Basically, I found glitch, when

  1. we make POST requests
  2. curl called from the parallel plan by postgresql engine
  3. curl getting 500 status and error 56 (when normally we should be getting 200 Ok)

instead of returning normal response with status 500 it produce Exception.

DEBUG:  pgsql-http: http_return '56'
ERROR:  Failure when receiving data from the peer

And finally, if we disable parrallel plan, then we getting 200 Ok responses.

@pramsey
Copy link
Owner

pramsey commented Oct 27, 2020

So, thank you for that gist. I tried it out, and it worked fine... parallel plan, and the 7-wide parallel query ran and returned 7 results nice and fast. Could there be something else going on, in your system?

@deem0n
Copy link

deem0n commented Oct 28, 2020

Thanks for verification!

Actually it could be something with the network settings, it is very hard to debug. I assume that pgsql-http works fine, and if not I will post detailed report why.

@pramsey
Copy link
Owner

pramsey commented Apr 15, 2022

For people interested in this topic, it's possible to get async HTTP queries by combining this extension with the pg_background extension.

@riderx
Copy link

riderx commented Jul 17, 2022

@pramsey do we have example somewhere ? i struggle to understand how to mix them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants