-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify input/output field modifiers of Part 3 collections #426
Comments
I fully agree this needs (a lot) more examples. Adding to the complexity, in scenarios where the APIs involved support those parameters at both the input or output level, the extra processing could be performed at either the "output" or at the "input" level.
We should review that, currently it's defined directly within the top-level "process" object, or a nested "process" or "collection" object. We need to work out examples with processes returning multiple outputs and how that would work. We had already discussed how that works generally with Part 3 - nested processes in one or more of the existing issues... In the coastal erosion workflow example, the use of The "properties" at the same level as the "Slope" and "Aspect" processes could be considered as either "Input" or "Output" field modifiers.
The When they are at the top-level process, they are necessarily "output field modifiers". When they are within inputs to another process, they can be supported in the workflow either as input and/or as output field modifiers. For a Collection, they would always be considered "Input Field Modifiers". The remote API supports resolving them (e.g., Otherwise, the Processes - Part 3 implementation for the local process invoking those remote collections / processes can perform the extra processing after retrieving the collection or process output data and before passing it to that local process -- here we could think of it either as modifying the output of the nested process, or modifying the input to the parent process. The general idea is that the extra processing can be performed at the end that supports it and is most optimal. |
I think that might be where most confusion comes from. To my understanding (correct me if I'm wrong), the following example is what would be expected with Part 3 to filter the Collection Input by cloud cover, filter the Collection Output of {
"process": "https://server.com/processes/MainProcess",
"inputs": {
"process": "https://server.com/processes/NestedProcess",
"inputs": {
"collection": "https://server.com/collections/TheCollection",
"filter": {
"op": "lte",
"args": [{"property": "eo:cloud_cover"}, 10]
}
},
"filter": "T_AFTER(datetime,TIMESTAMP('2024-01-01T00:00:00Z'))"
},
"sortBy": "eo:cloud_cover"
} It somewhat makes sense for processes that have only 1 output, since we can infer that the However, I think embedding the filters in The above workflow would become the following with the verbose form, where {
"process": "https://server.com/processes/MainProcess",
"inputs": {
"process": "https://server.com/processes/NestedProcess",
"inputs": {
"collection": "https://server.com/collections/TheCollection",
"filter": {
"op": "lte",
"args": [{"property": "eo:cloud_cover"}, 10]
}
},
"outputs": {
"output": {
"filter": "T_AFTER(datetime,TIMESTAMP('2024-01-01T00:00:00Z'))",
"response": "collection"
}
}
},
"outputs": {
"result": {
"sortBy": "eo:cloud_cover",
"response": "collection"
}
}
} |
The example looks good. We should clarify requirements regarding CQL2-JSON vs. CQL2-Text since you're using both there -- I would assume that both should be allowed if the server declares support for both.
This is really intentional, allowing the implementation of MainProcess to decide how to invoke the nested process (using Part 1: Core, or Part 3: Workflows collection output if it's available, and using Tiles, DGGS, Coverages as supported by the client/server on each end...). While I understand from a "reproducibility" perspective the more explicit the workflow is the more likely the results are going to be identical, from a "reusability" perspective the simpler the workflow is expressed, the more likely it is to be re-usable on another set of deployments which may support a slightly different set of API / requirement classes, without requiring any modification. And if everything is done right (important if), the results should still be largely reproducible (within a small very acceptable threshold). It's also about the endpoints along the workflow knowing best what will be more optimal than the client building the workflow, and allows the immediate server that the clients submits the workflow to to act as an orchestrator and/or optimize the workflow. |
Indeed. A
I'm not sure to like this idea (for that specific case). You nailed exactly my concern. From a reproducibility perspective, we have basically no idea what is going to happen. It could even be an issue if the referenced process lives in a remote OAP that does not support Part 3. Execution would "work" on the end of that remote process execution, but the received "output" would not be a collection, causing the |
I don't think that's absolutely necessary, unless we want to consider alternative languages, since it could easily be distinguished by being a JSON object vs. a JSON string.
That is actually already considered in the specification and is exactly what the "Remote Core Processes" requirement class is all about. The Part 3 implementation supporting "Collection Output" is also able to make the overall output work like a collection, even though the Part 1 process part of the workflow does not support that, as long as the process has a way to somehow injecting the bounding box (and ideally resolution and time as well) filters for the part 1 process. This is the functionality which we currently partially support with our MOAWAdapter, though in theory this MOAWAdapter process should not be necessary and it should "just work". It is tricky mostly because there is no clear concept of identifying parameters for bounding box / time / resolution of interest in Part 1, except for the special "bbox" input type which could be assumed to serve that purpose. Even if the process has no "bbox", the server could still present it as a collection by processing the whole thing first, but this would not scale with a large dataset and registering that workflow would either fail or need to wait until the whole thing is processed to know whether the processing succeeds or not, and obtain the extent information for the output of that Part1 process etc. For implementations using nested remote processes that already support "Collection Output", it is not necessary that the Part 3 implementation supports "Remote Core Processes" if it supports "Remote Collection Input" instead. That is because it can treat the remote process just like a remote collection, with the exception of submitting the partial ad-hoc workflow to initially activate that remote virtual collection. This is the ideal way to chain these remote collections, which is also great for caching in combinations with OGC API - Tiles or OGC API - DGGS.
The key thing with ad-hoc workflow is when the client initially submits the ad-hoc workflow (and in all hops within the workflow there is a client and a server, so it is a recursive process), the workflows are immediately validated based on the available capabilities, and a validation result is returned whether the workflow will succeed or not. If there is a missing capability e.g. , no support for "Remote Core Process" on a client side and no "Collection Output" on a server side on a particular hop, then that workflow might fail. However, an interesting thing is that if any hop higher up in the workflow chain has a little bit of orchestration functionality and itself does support "Remote Core Process", it could detect such mismatch ahead of time, by querying the conformance declaration of the services involved deeper in the chain, and could re-organize the workflow to itself act as a client for that Part 1 process execution, and submit the input to the parent process by providing either through a virtual input collection, or by executing the parent process in the regular sync/async way with an "href" or embedded "value" within an execution request. So that orchestrator process up above would save the day and the workflow could still validate successfully. I really think of this kind of flexibility, which again you might well point out as introducing reproducibility issues, as a feature rather than a bug! :) I strongly believe that the kind of increased interoperability and re-usability that will emerge out of this vastly outweighs the reproducibility concerns, which I think can easily be addressed on a case-by-case basis to ensure that regardless of which path is taken, the results are within a very small margin of difference, if not identical. An example of reproducibility difference is the use of data tiles or DGGS zone data queries (and the use of a particular 2D Tile Matrix Set or Discrete Global Grid Reference System). This involves a particular way to partition and up/down sample data, so some small differences are to be expected. But both of these approaches brings significant performance/caching advantages which are well worth these issues, and if the data is sampled right and always ensuring that the up/down sampling does not significantly degrade the correctness of the data, the final results should be well within the acceptable margins compared to using e.g., a Coverages subsetting request for the entire area. Being able to use almost identical workflows with different combination of deployments and servers, even if they support diffect OGC API data access mechanisms, 2DTMS, DGGRS, encoding formats etc., will actually help to validate and compare outputs of the same workflow with more implementations, datasets, AoIs etc, which I believe in the end will actually help reproducibility. |
Indeed. This is exactly what I'm doing ;) (https://github.com/crim-ca/weaver/pull/685/files#diff-d25d3121a794cd4fb10b0d700f8df011035c957d4a19ef79d051b3c70bdefbc3R1501-R1502)
To my understanding, Remote Core Processes only indicates that the Nested I can see that adding Another issue I just realized is that Part 3 adds
I think this is a strong assumption that it can be accomplished. Because there is a chance of ambiguity (and for which the API could refuse to execute "just in case"), Part 3 must allow parameters such as |
That is correct.
That is also correct.
Why do you arrive at that conclusion? After invoking the remote Part 1 process, the Part 3 implementation supporting Input/Output field modifiers can perform the additional filtering/deriving/sorting operations itself.
The idea was to not specify the execution mode (collection output, sync, async) in the execution request, for the same reason that a Part 1 execution request can be executed sync or async with the I believe "response" (raw/document) is gone from the execution body in 1.1/2.0.
When a Part 3 / Collection Output implementation handles an initial workflow registration, it needs to validate that the nested processes will work as expected, and this is largely why the validation capability would be very helpful for this purpose. In our MOAWAdapter implementation, what we do at the moment is submit a small portion of the BBOX to do a quick processing test to know whether the execution will succeed and what it will return before we successfully return a collection with a level of confidence that things will work. But in general, it is the Part 3 Collection Output implementation that creates a collection. Regardless of what is returned by the processes underneath, it presents the final output as a collection. |
You're right. It simply requires the server to return "something" that is filterable. I can see an issue however regarding that situation. If a server wants to support
I think this is mixing things here. The execution mode sync/async is irrelevant IMO. The server interrogates the remote process however needed, and obtains the value directly or monitors the job and retrieves it after. This is not important for chaining the steps. However, the response structure (ie I must say I find it extremely ironic that we went through all this issue about I also strongly believe that it is not enough to simply pass
I'm not sure how to explain this differently, but I don't think it is always possible once a certain number of workflow steps is reached, especially if some steps imply data manipulations such as merging bands, aggregating items or conditional logic based on inputs, which can change the outcome of how the workflow should execute at runtime. If those steps are not very limited in the workflow, you simply have no idea what the actual result will look like much later on until it is executed, because they depend on what will be produced on the previous steps. Therefore, you cannot "validate" the workflow as a whole. You can end up in situations where two processes that would seemingly be chain-able with a matching subset of media-type/format I/O do not work anymore once reaching their actual step execution because I/Os to chain were modified by previous logic conditions. Even when explicitly indicating the desired |
Correct, and feature collections and coverages are both filterable and derivable (properties of features correspond to the range fields in coverages, where individual cells can be filtered out).
As it stands, support for input / output fields modifiers is always (not only for Collection Output -- it is an orthogonal capability), regardless of how the input was received. If the remote process or collection has some matching "filter" or "properties" parameter to do the work, the implementation is encouraged to use it as it may speed things up by transfering less data overall, but there is no expectation that it needs to do so. It can be thought of as an optimization compared to always doing the work itself, as if executing a remote process or fetching a file from an href the Part 1 way.
For me "collection output" is a third execution mode just like sync and asyc, and is also irrelevant (except it does make a lot of things easier when using it, such as filling in the AoI/ToI/RoI on-demand).
Why does that matter? As long as you have a clear way how to retrieve the output once things are ready...
For me this goes against the design. There is no reason why the client/servers along each hop couldn't negotiate with them the best way to do things. The principle here (which you might disagree with) is that the implementations knows best -- not the user. All the user should be doing is expressing their workflow in the simplest and most natural way possible, and leave it up to the implementations to figure out the best way to do it (at each hop).
I am really not following how minimal vs. representation (raw vs. document) matters at all here. These things conceptually do not modify at all the information being returned. They only change how a Processes client goes about retrieving the information. If the client gets a link back, it has to perform an extra GET operation to get to the actual data. This really has no impact at all on
For 2+ outputs we have Regarding using Retruning a single landing page
The Part 3 workflows are really rooted in the concept of emergence:
where I really strongly believe that we will be able to achieve very powerful things with it, as long as we keep to the concept of each individual part having a well defined interface which are assembled together in a simplistic way. The OGC APIs provide the connectivity between distributed data and processes, and we will have as a result a system-of-system where all deployed OGC API implementations can work together as a single global distributed geospatial computing powerhouse. The goal is instant integration, analytics and visualization of data and processes available from anywhere. Each process has a process description, including the supported output formats, and each deployment declares its conformance to the different supported OGC APIs and requirement classes, so it should be possible for each hop for the implementation acting as a client to know exactly what it can get back from the server it will invoke. There is always a "simple" way to execute a Part 3 workflow, where you simply pass along the nested process object to the remote process, and only do your part. Alternatively, a server higher up in the chain could decide to take on an orchestration role and re-organize things to improve efficiency. In any case, validating should be possible:
The use of GeoDataClass would also greatly help in terms of knowing exactly what to expect in the response (a GeoDataClass implies a particular schema i.e., which fields you will get back that you will be able to filter or derive from, including the semantic definition information), which are things that the process descriptions might not otherwise cover in terms of which bands you will find in a GeoTIFF etc. |
The issue with this is that using Collection Inputs/Outputs without
For this I 100% agree, it is very easy to "handoff" the work to the remote location, but here we were referring to the case where the remote process or collection does not support it, and therefore it must be handled by the local process instead. Because that case is a possibility, it places a bigger implementation burden on the local process, since it must try to handle all combinations. And no server is ever going to handle all combinations.
That doesn't make much sense to me. Since the process can run either in
If a JSON representation If the workflow was executed using
The thing is that not all processes return links that makes sense as a "Collection". {
"outputs": {
"result": {"response": "collection", "filter": "<points I care about"},
"report": {"transmissionMode": "reference"}
} And now, I can get both HTTP 303
Maybe, but in the end, I just want the process to succeed execution 😅 I reiterate, parameters like
It is a noble goal, but I highly doubt that is a fact. |
There is really only one combination, which is the ability to filter (or derive) data (of all the data types supported by the processes). There could potentially be some exceptions allowing to return a not implemented HTTP code for specific cases, if you only support filtering on Feature collections and gridded Coverages, but not point clouds for example. It also means to support this for all input/output formats that are supported by the engine, but if some format conversion engine is in place, that should not be a big burden. The ability to pass along a "filter=" or "properties=" to a remote end-point to handle the filtering is there if the remote server supports it, but not all Processes implementation will support input/output field modifiers, and not all Features and Coverages implementations will support it. Therefore implementations will need to support doing the filtering on their own anyways, so that it can be applied to local processes and collections, and to remote collections and processes that do not implement this filtering / deriving. So this local modifiers capability is necessary anyways, and can be used for any fallback scenario where it can't be done on the remote side.
The most important distinction with "Collection Output" is that it can support "on-demand" processing of a particular Area/Time/Resolution of interest. So in the majority of cases where I expect this to be used, where localized processing for an ATRoI is possible, it would not be the same as a sync/async execution that produces the whole collection.
An "href" to some data automatically implies a specific area / time / resolution of interest. The Collection concept leaves that open: "here's the input I want to use", implying to use the relevant parts based on what is currently being processed.
I'm confused about what you're saying here. As we said earlier, the processing mechanics is aware of these distinctions and knows at which point it has the actual data. The filtering on the features is always applied to the actual data, whether it's in GeoJSON, Shapefile, GeoPackage... It would of course never be applied on the JSON "document" (results.yaml) response which is just links to the process results. Is that what you are concerned about?
If it makes sense to execute this process in a localized manner (which really is what makes for the perfect scenario for collection output), we should consider that the actual processing would be done several times for different ATRoIs. I imagine the issue here is that it is a "summary" report only for that particular subset? Would that really be useful? If it contained per-point information, then this information could be embedded as properties of the point features. If useful, summary information could also be added as additional metadata in the feature collection subsets, but that would not fit so well across different formats. The report would be different for every If this is really not a localized process, and the purpose of the collection output is not so much the on-demand processing, but just the convenience of having an OGC API collection supporting OGC API - Features as a result, then this is slightly different than the main use case for "Collection Output". We have a similar scenario with our OSMERE routing engine, where the route is not an on-demand thing but must be fully calculated before we can return the resulting feature collection for the calculated route. If we still want to return this summary report and execute the process using Collection Output, one solution might be to provide the summary report as metadata linked from the collection description, rather than as a separate output.
I understand that this may be the case, and I admit that I might be overly optimistic and idealistic.
When you request a collection output from a process, it validates the execution request, and if everything is good it also submits the immediate sub-workflow to any external process, which will return it a collection description if it itself validates, and sees if there is a match in terms of OGC API data access mechanisms and supported formats, and make sure that the returned collection descriptions are a match for the inputs they are used for in terms of GeoDataClasses / schemas of the content. This all happens before any actual processing is done (at least for processes which can be localized to an ATRoi). If no match is found anywhere within the workflow, the validation of the problematic hop(s) will immediately fail, causing the validation of the whole workflow to fail. If the workflow fails given all the flexibility of using any supported formats and OGC API data access mechanisms to satisfy the request, there is nothing the user could do to make it work. The main reason this would fail is because the servers are not compliant, or the user picked incompatible processes and/or data sources. When using a workflow editor aware of part 3 and certified implementations, this should be a very rare occurrence due to resources temporarily unavailable or the occasional bug to be filed and fixed. Of course this is still quite theoretical at this point given the limited experimentation done with Part 3 so far, but I hope we can prove this year that it can, in fact, work like a charm at least most of the time! ;) Thanks a lot for the deep dive into Part 3 and helping validate all this! |
There are different ways to execute each process, and what would be the most appropriate/efficient way to represent their outputs. What is considered the best for one, might be the worst for another. The question is not really whether it is ideal or not, but rather that there is sometimes a need to handle some atypical cases. When the process is "some unknown docker/script" that someone developed and just wants to run to test something out, sometimes the auto-resolution of the server is not what the user wants. Sometimes, the expected resolution by the user cannot be predicted correctly by the server. In some situations, the reality of the workflow design procedure, such as the rapid development for AI to publish papers, makes it such that, if the user was told they need to redesign everything because it does not fit the auto-resolution pattern, they would simply give up and move on elsewhere, as they cannot be bothered or don't have the time. They might not care about portability or reuse, they just want the raw data out. This is why I'm pushing strongly to have options like One example I can think of is some processing workflow that needs STAC Item representations to exact specific assets. Because STAC is also OGC Features compliant, a remote collection could be resolved as either case. Depending on which server I send this workflow, some implementations could prefer the STAC resolution, while others could prefer OGC Features. Neither is "wrong", it's just a matter of preference/design for each server. Similarly, depending on their capabilities, any In 99% of cases, I would expect the pre-filtering and auto-resolution to be the desired and most useful way to go. But, for specific edge case where it doesn't do what is desired, hacking workflows to make them "behave" is tedious, or needs to introduce special logic in my generic processing engine to handle these uncommon cases. |
When looking at the definitions (and subsections) of:
All the same terminology, field names (
filter
,properties
,sortBy
), and intention for each are reused. This makes it very hard to understand "where" each of those fields are expected.Furthermore, the only available example (https://docs.ogc.org/DRAFTS/21-009.html#_coastal_erosion_susceptibility_example_workflow) mentions that the input/output modifiers are both used, which doesn't help disambiguate them.
Adding to the ambiguity, the Input Field Modifiers are used to request/filter a certain collection (possibly remote), for which the (output) resulting items/features/tiles/coverages from the search are used as input for the process. On the other hand, (to my understanding), Output Field Modifiers would be used to perform further filtering/sorting/derived-values from a resulting collection from the processing, to be made available for another step, or final workflow result. Since each of these pre/post-filters could be interchanged in some cases, or can be seen (implemented) as sub-processes themselves with inputs/outputs, the operations and applicable requirement classes rapidly become (in their current state) confusing and undistinguishable.
Explicit examples (on their own) demonstrating how Input Field Modifiers and Output Field Modifiers must be submitted in an execution request would help understand the intention.
Something to validate (I'm assuming here):
Are the Output Field Modifiers supposed to be provided under the
outputs
of the execution request?The text was updated successfully, but these errors were encountered: