Improve predictability of DataQuery, DataID, and dependency tree #3018

djhoese · 2024-12-13T21:46:54Z

Disclaimer: This PR's implementation is very rough at the time of writing.

This PR includes major changes to key parts of Satpy in order to resolve some inconsistencies noticed by users over the years. The high-level concepts that have been changed or updated are:

DataQuery objects are now only equal to DataID objects that match all of the queries keys. Previously only the shared keys were compared. This old way meant that a DataQuery could match a DataID that didn't contain all the necessary information requested by the query.
The "resolution" DataID key was not not transitive for all ID key sets. So for "default" ID keys it was False and for coordinate and minimal sets it was transitive. It made the most sense to set it to False. That is, a modified dataset is not required to have all dependencies be of the same resolution.
Add and refactor a lot of tests regarding DataQuery and DataID comparisons.
It should be possible to load a composite with two different sets of inputs (ex. DataQuery(name="comp", resolution=500), DataQuery(name="comp", resolution=1000)).

Remaining work

There are a lot of edge cases that need to be worked out. The biggest one is what happens when a DataQuery has a list of possible options. That is not handled in a lot of my dependency tree stuff.
See "Hindsight" below.
Refactoring
More explicit tests

Hindsight

For high-level change 1 above, I'm starting to think this was the wrong change or at least that the previous behavior had a good point. That is, if a user creates a query with a lot of keys to apply to many DataIDs from different sources, then not all DataIDs should be required to have all those keys. For example, if I specify a polarization in my query, then I don't think all composites or rather composite dependencies would be able to match that. There are currently no tests to verify this.

Closes #xxxx
Tests added
Fully documented
Add your name to AUTHORS.md if not there already

codecov · 2024-12-13T21:54:02Z

Codecov Report

Attention: Patch coverage is 96.96312% with 14 lines in your changes missing coverage. Please review.

Project coverage is 96.08%. Comparing base (5984c29) to head (c45ed8d).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
satpy/dataset/id_keys.py	96.15%	5 Missing ⚠️
satpy/dependency_tree.py	83.33%	5 Missing ⚠️
satpy/dataset/dataid.py	97.50%	2 Missing ⚠️
satpy/tests/test_dependency_tree.py	98.19%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3018      +/-   ##
==========================================
- Coverage   96.10%   96.08%   -0.02%     
==========================================
  Files         377      378       +1     
  Lines       55163    55213      +50     
==========================================
+ Hits        53012    53050      +38     
- Misses       2151     2163      +12

Flag	Coverage Δ
behaviourtests	`3.98% <24.72%> (+0.04%)`	⬆️
unittests	`96.17% <96.96%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coveralls · 2024-12-14T03:22:59Z

Pull Request Test Coverage Report for Build 12381834226

Details

447 of 461 (96.96%) changed or added relevant lines in 25 files are covered.
9 unchanged lines in 4 files lost coverage.
Overall coverage decreased (-0.02%) to 96.188%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
satpy/dataset/dataid.py	78	80	97.5%
satpy/tests/test_dependency_tree.py	109	111	98.2%
satpy/dataset/id_keys.py	125	130	96.15%
satpy/dependency_tree.py	25	30	83.33%

Files with Coverage Reduction	New Missed Lines	%
satpy/dependency_tree.py	1	95.77%
satpy/tests/utils.py	2	93.16%
satpy/tests/reader_tests/gms/test_gms5_vissr_l1b.py	3	98.67%
satpy/tests/reader_tests/gms/test_gms5_vissr_navigation.py	3	97.18%

Totals
Change from base Build 12299617024:	-0.02%
Covered Lines:	53294
Relevant Lines:	55406

💛 - Coveralls

Includes changes to loading compositors to a DataID with all query parameters

Closes pytroll#1806 Closes pytroll#1807

djhoese · 2024-12-19T16:36:13Z

satpy/dataset/dataid.py

+    new_id_dict = orig_id.to_dict()
+    orig_id_keys = orig_id.id_keys
+    for query_key, query_val in query_dict.items():
+        # XXX: What if the query_val is a list?


This is my main task remaining. I'm really not sure how I want to treat this. If you ask for a composite as DataQuery(name="some_comp", resolution=[500, 1000]), what do you set in the DataID? This is before we know what dependencies were found. So does the composite DataID become DataID(name="some_comp") with no resolution, resolution 500, or resolution 1000?

imo it should get the final resolution. So if the generated composite has a 500m resolution, the dataid should carry that.

And that may happen eventually, but we don't know that when this function is called. It is called here:

satpy/satpy/dependency_tree.py

Line 448 in c45ed8d

new_id = update_id_with_query(compositor.id, dataset_key)

There are the two separate processes of getting the compositor (the class instance that will be called) and the composite DataArray (generated by calling the compositor). This function is called to give the compositor an identity so it can be referred to for generating, but also if it is needed by some other compositor configuration (YAML prerequisites) or the user requested it. So overall the DataID generated by this function is "temporary" in that once composites are generated this DataID may be overwritten by the __call__ of the compositor.

Additionally, at least with the current calling order of things, we don't even know which prerequisites are going to be used when this function is called. And it seems the CompositorNode where this is DataID is used is needed before prerequisites and optionals are determined because we add the Nodes of those prereqs/optionals to the CompositorNode's children. There might be a way to get or hack around that, but I'm not sure.

So back to the original question:

If you ask for a composite as DataQuery(name="some_comp", resolution=[500, 1000]), what do you set in the DataID?

We don't know the prereqs or optionals and we're just updating the original request of "some_comp" with some resolution, but what one?

mraspaud

Ok, I made a first pass. Overall, thanks a lot for the clarifications and refactorings, the code reads better now.
I have comments inline, or rather mostly questions because I’m not following everything :)

mraspaud · 2025-01-14T13:51:15Z

satpy/dataset/dataid.py

+        """
+        return self.equal(other, shared_keys=False)
+
+    def equal(self, other: Any, shared_keys: bool = False) -> bool:


I’m not found of passing a bool here. Could it be passed something like keys_to_match instead, i.e. the list of keys to use for the matching?

Not really as the ability to get shared keys (keys to match) only happens after we've extracted the dict version of each object. Additionally, this is a top-level public method now (I don't remember if it needs to be) so doing low-level stuff with the keys seems like too much work for a high-level user-facing method and would have to be repeated anywhere shared_keys=True is used. I could split it I suppose into two methods that call one shared method, but then again it is a lot of extra code for a single use.

satpy/dataset/dataid.py

mraspaud · 2025-01-14T13:59:25Z

satpy/dataset/dataid.py

-def _create_id_dict_from_any_key(dataset_key):
-    try:
+def _create_id_dict_from_any_key(dataset_key: DataQuery | DataID | str | numbers.Number) -> dict[str, Any]:
+    if hasattr(dataset_key, "to_dict"):


"Easier to ask for forgiveness than permission"?

If I remember correctly I changed this because mypy was getting mad about the typing and the try/except being hard to parse. It might have also been CodeScene. I knew you would prefer try/except here, but the linter's didn't like it so I thought it was OK. I can do some type ignoring if you'd still prefer the try/except.

that’s fine…

mraspaud · 2025-01-14T14:03:00Z

satpy/dataset/dataid.py

+    new_id_dict = orig_id.to_dict()
+    orig_id_keys = orig_id.id_keys
+    for query_key, query_val in query_dict.items():
+        # XXX: What if the query_val is a list?


imo it should get the final resolution. So if the generated composite has a 500m resolution, the dataid should carry that.

mraspaud · 2025-01-14T14:03:53Z

satpy/dataset/dataid.py

+    return new_id
+
+
+def _keys_to_compare(sdict: dict, odict: dict, o_is_id: bool, shared_keys: bool) -> set:


what does o_is_id stand for?

other is DataID or at least looks like one. I could give the variable a longer name for sure. I went back and forth on how to handle the different cases, but at the end of the day my sanity was preserved by doing simple if/else statements on the DataID-ness of "other". Now that most of the logic is stabilizing I could revisit this. Ideas welcome.

mraspaud · 2025-01-14T14:07:08Z

satpy/dataset/dataid.py

+def _compare_key_equality(sdict: dict, odict: dict, key: str, o_is_id: bool) -> bool:
+    if key not in sdict:
+        return False
+    sval = sdict[key]
+    if sval == "*":
+        return True
+
+    if key not in odict:
+        return False
+    oval = odict[key]
+    if oval == "*":
+        # Gotcha: if a DataID contains a "*" this could cause
+        #    unexpected matches. A DataID is not expected to use "*"
+        return True
+
+    return _compare_values(sval, oval, o_is_id)
+
+
+def _compare_values(sval: Any, oval: Any, o_is_id: bool) -> bool:
+    if isinstance(sval, list) or isinstance(oval, list):
+        # multiple options to match
+        if not isinstance(sval, list):
+            # query to query comparison, make a list to iterate over
+            sval = [sval]
+        if o_is_id:
+            return oval in sval
+
+        # we're matching against a DataQuery who could have its own list
+        if not isinstance(oval, list):
+            oval = [oval]
+        s_in_o = any(_sval in oval for _sval in sval)
+        o_in_s = any(_oval in sval for _oval in oval)
+        return s_in_o or o_in_s
+    return oval == sval


It’s probably my doing, but could we take the opportunity to find better/longer names for sval, oval, sdict, odict, etc?

I really thought I answered a bunch of these comments already, but apparently not...

Yes, I can try to find better names. I think s is self or maybe source? And o is other. I'm not entirely upset with them as they are now, but yeah not super clear.

mraspaud · 2025-01-14T14:10:22Z

satpy/dependency_tree.py

@@ -424,6 +430,7 @@ def _find_compositor(self, dataset_key, query):
        # one or more modifications if it has modifiers see if we can find
        # the unmodified version first

+        orig_query = dataset_key


are we sure this is a query, not an id?

should we rename the function argument instead?

I'm honestly not sure, but I think so. This is reminding me that I wanted to add more type annotations to this stuff for this reason, but forgot after needing some last minute changes in other parts of the code. If I remember correctly the information passed by the user gets turned into a DataQuery in one of the top-level tree methods and gets passed around from there.

mraspaud · 2025-01-14T14:13:14Z

satpy/dependency_tree.py

+        if key.get("name", default="*") != "*" and len(key.to_dict()) == 1:
+            # the query key is just the name and still couldn't be found
+            raise KeyError("Could not find compositor '{}'".format(key))
+
+        # Get the generic version of the compositor (by name only)
+        # then save our new version under the new name
+
+        return self._get_compositor_by_name(key)


What use cases in this covering?

Ah the name of this method is incorrect after more refactoring was done. So the code above this is when the compositor exactly matches what the user asked for. This is usually the common case of just a compositor by name but could also include other filtering parameters like resolution or calibration or whatever else was asked for.

The method _get_compositor_by_name first gets the base compositor by just the name and then if more than one compositor matches it filters down using the other properties of the provided DataQuery. The cases where a compositor name returns more than one compositor would be if there are varying resolutions or calibrations for a single compositor. So again, not a common use case, but it was in the tests already.

mraspaud · 2025-01-14T14:19:27Z

satpy/tests/test_dataset.py

+            (DataQuery(name="1", resolution=[250, 500]), DataQuery(name="1", resolution=[500, 750])),  # opposite order
+            (DataQuery(name="1", resolution=500), DataQuery(name="1", resolution=[500, 750])),  # opposite order
+            (DataQuery(name="1", resolution=[250, 500]), DataQuery(name="1", resolution=500)),  # opposite order


These look surprising?

This is a case of the code written first then the tests second. More importantly, this is testing code that I didn't write, but I could tell this functionality was intended. That is, I believe we support users specifying a list of things in their DataQuery (and in the filter kwargs passed to Scene.load) and the results should/could match any of the things in the list.

Or is there something else surprising about these. I'm mostly making sure that regardless of which DataQuery object's __eq__ is called (the first or the second) that things are still matched. I'm not sure if the # opposite order comment at the end of these 3 lines is valid. Most likely it is an artifact of copy/pasting the earlier test. Or maybe the comment doesn't apply to the 3rd to last case, but does for the last 2 cases.

mraspaud · 2025-01-14T14:20:54Z

satpy/tests/test_dataset.py

+            (DataQuery(name="1", resolution="*"), dict(name="1")),
+            (DataQuery(name="1", resolution="*"), dict(name="1", resolution=500)),
+            # DataID shouldn't use * but we still test it:
+            (DataQuery(name="1", resolution=500), dict(name="1", resolution="*")),


is * valid for dataids?

It is technically allowed since it is a string and DataID does not do any validation against it. That's why the comment is there. There is also a comment somewhere in the equality method of the DataQuery about it. There is nothing stopping it in the code from happening (a "*" in a DataID), but it also isn't expected or wanted or supported entirely.

djhoese · 2025-01-23T20:37:16Z

Ok I've been spending some time trying to get a better understanding of composite and modifier ID handling. Basically I wanted to be more confident in my fixes, add tests for edge cases, and resolve my TODO I mentioned in this comment. It comes down to compositors (and the Node objects wrapping them) and composites (the loaded DataArray) being separate identifiers, but treated as the same thing.

Series of events

User calls Scene.load and the Scene determines what from the request is still needed. I'm still not sure how this code works between strings/DataIDs/DataQuerys and the DataIDs in the self._datasets but I'll figure that out.
The Scene updates the dependency tree so it knows where and how to create the datasets requested by the user.
The dependency tree, in the case of a compositor, will find the compositor usually by name, and let the compositor say what DataID it has by accessing the .id property of the Compositor object (compositor.id - see usage here). The dependency tree now stores the compositor by that DataID.

The Scene gets the compositor nodes to be loaded (trunk nodes):

satpy/satpy/scene.py

Lines 1541 to 1542 in 3990f37

    
           trunk_nodes = self._dependency_tree.trunk(limit_nodes_to=self.missing_datasets, 
        
                                                     limit_children_to=self._datasets.keys())

The Scene checks if the dataset in the DependencyTree already exists in the Scene's container of created datasets:

satpy/satpy/scene.py

Lines 1533 to 1537 in 3990f37

    
           loaded_data_ids = self._datasets.keys() 
        
           for trunk_node in trunk_nodes: 
        
               if trunk_node.name in loaded_data_ids: 
        
                   continue 
        
               yield trunk_node

Any compositor nodes from the dependency tree still needing loading is generated, but we first check if it is contained in the Scene's container...again:

satpy/satpy/scene.py

Lines 1564 to 1566 in 3990f37

    
           if self._datasets.contains(comp_node.name): 
        
               # already loaded 
        
               return

Same check as step 5 above, but more strict as .contains must match the DataID exactly while step 5 is a "best match".

After the DataArray is generated from the compositor we create a DataID from it's .attrs and update the Scene's wishlist and the dependency trees nodes with this new ID:

satpy/satpy/scene.py

Lines 1611 to 1618 in 3990f37

    
           cid = DataID.new_id_from_dataarray(composite) 
        
           self._datasets[cid] = composite 
        
           # update the node with the computed DataID 
        
           if comp_node.name in self._wishlist: 
        
               self._wishlist.remove(comp_node.name) 
        
               self._wishlist.add(cid) 
        
           self._dependency_tree.update_node_name(comp_node, cid)

Problems and Difficulties

In step 3 above, the compositor's DataID and the resulting DataArray might not be the same. The compositor object's .id is not able to know what version of it's prerequisites will be used (the dependency tree hasn't determined that yet) and therefore can't use it to influence the DataID of the compositor. Historically we've allowed this so the compositor developer can do at-runtime updates to the DataID parameters to make them more accurate (ex. update "resolution") in the compositor's __call__.
Due to problem 1, the checks in step 1, 5, and 6 may not be accurate depending on how the compositor updates the DataID for the resulting DataArray.
In main, the compositor's ID does not take into account the extra query parameters passed by the user (either from the DataQuery from the user, or the **kwargs passed like resolution=[500, 1000]). This means a user could request the same composite multiple ways that should result in different inputs being used, but the DependencyTree would use the same compositor for all of them. In this PR I try to solve that, but there are problems with that as well.
The whole compositor ID re-assigning is very unsettling.

Solutions?

🤷‍♂️

Overall (I think in main and this PR) if a compositor's __call__ modifies the resulting DataID in a way that conflicts with the previous compositor ID (changes a value that already existed) or makes it not match the users original request, then the resulting behavior is not well defined. This might just be something we have to deal with.

djhoese added bug component:compositors component:dep_tree Dependency tree and dataset loading labels Dec 13, 2024

djhoese added 21 commits December 17, 2024 15:20

Cleanup dependency tree tests

5db7d99

Update dep tree tests to be more realistic

66b714a

Convert combine metadata tests to pytest

cab16b9

Fix inconsistency with DataID "resolution" transitive property

1e88269

Convert dataquery tests to pytest

dd46a4e

Fix DataQuery equality to require query keys to match

251524b

Fix inconsistency with DataID "resolution" transitive property

ef07c55

Split data query tests to be more explicit

438de2c

Cleanup satpy internals documentation regarding DataQuery equality

ce01aa6

Change DataQuery equality to require all query keys to be equal

ed2867a

Includes changes to loading compositors to a DataID with all query parameters

Add test querying for a wavelength on no-wavelength DataIDs

d223f10

Closes pytroll#1806 Closes pytroll#1807

Merge combine times test cases

aba5dc5

Refactor ID key types to separate module

24a1068

Extract ID update with query function

2085b50

Remove unused _match_query_value method

6afc26d

Add shared_key option to DataQuery equality checks

633802f

Refactor get_keys_from_config for simpler conditionals

0f77b4e

Attempt to refactor DataQuery equality

67cb550

Refactor DataQuery equality checks

52469d4

Another try at making CodeScene happy with get_keys_from_config

68880a4

Remove duplicated test code in test_satpy_cf_nc.py

de36fb8

djhoese force-pushed the bugfix-greedy-dataid branch from 46eafc6 to de36fb8 Compare December 17, 2024 21:20

Remove accidental holoviews import in dependency tree

c45ed8d

djhoese commented Dec 19, 2024

View reviewed changes

mraspaud reviewed Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve predictability of DataQuery, DataID, and dependency tree #3018

Improve predictability of DataQuery, DataID, and dependency tree #3018

djhoese commented Dec 13, 2024 •

edited

Loading

codecov bot commented Dec 13, 2024 •

edited

Loading

coveralls commented Dec 14, 2024 •

edited

Loading

djhoese Dec 19, 2024

mraspaud Jan 14, 2025

djhoese Jan 15, 2025

mraspaud left a comment •

edited

Loading

mraspaud Jan 14, 2025

djhoese Jan 15, 2025

mraspaud Jan 14, 2025

djhoese Jan 14, 2025

mraspaud Jan 15, 2025 •

edited

Loading

mraspaud Jan 14, 2025

mraspaud Jan 14, 2025

djhoese Jan 14, 2025

mraspaud Jan 14, 2025

djhoese Jan 15, 2025

mraspaud Jan 14, 2025

djhoese Jan 14, 2025

mraspaud Jan 14, 2025

djhoese Jan 15, 2025

mraspaud Jan 14, 2025

djhoese Jan 14, 2025

mraspaud Jan 14, 2025

djhoese Jan 14, 2025

djhoese commented Jan 23, 2025

		return new_id


		def _keys_to_compare(sdict: dict, odict: dict, o_is_id: bool, shared_keys: bool) -> set:

Improve predictability of DataQuery, DataID, and dependency tree #3018

Are you sure you want to change the base?

Improve predictability of DataQuery, DataID, and dependency tree #3018

Conversation

djhoese commented Dec 13, 2024 • edited Loading

Remaining work

Hindsight

codecov bot commented Dec 13, 2024 • edited Loading

Codecov Report

coveralls commented Dec 14, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12381834226

Details

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mraspaud left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mraspaud Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djhoese commented Jan 23, 2025

Series of events

Problems and Difficulties

Solutions?

djhoese commented Dec 13, 2024 •

edited

Loading

codecov bot commented Dec 13, 2024 •

edited

Loading

coveralls commented Dec 14, 2024 •

edited

Loading

mraspaud left a comment •

edited

Loading

mraspaud Jan 15, 2025 •

edited

Loading