-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve predictability of DataQuery, DataID, and dependency tree #3018
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3018 +/- ##
==========================================
- Coverage 96.10% 96.08% -0.02%
==========================================
Files 377 378 +1
Lines 55163 55213 +50
==========================================
+ Hits 53012 53050 +38
- Misses 2151 2163 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Pull Request Test Coverage Report for Build 12381834226Details
💛 - Coveralls |
Includes changes to loading compositors to a DataID with all query parameters
46eafc6
to
de36fb8
Compare
new_id_dict = orig_id.to_dict() | ||
orig_id_keys = orig_id.id_keys | ||
for query_key, query_val in query_dict.items(): | ||
# XXX: What if the query_val is a list? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my main task remaining. I'm really not sure how I want to treat this. If you ask for a composite as DataQuery(name="some_comp", resolution=[500, 1000])
, what do you set in the DataID
? This is before we know what dependencies were found. So does the composite DataID
become DataID(name="some_comp")
with no resolution, resolution 500, or resolution 1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo it should get the final resolution. So if the generated composite has a 500m resolution, the dataid should carry that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that may happen eventually, but we don't know that when this function is called. It is called here:
satpy/satpy/dependency_tree.py
Line 448 in c45ed8d
new_id = update_id_with_query(compositor.id, dataset_key) |
There are the two separate processes of getting the compositor (the class instance that will be called) and the composite DataArray (generated by calling the compositor). This function is called to give the compositor an identity so it can be referred to for generating, but also if it is needed by some other compositor configuration (YAML prerequisites) or the user requested it. So overall the DataID generated by this function is "temporary" in that once composites are generated this DataID may be overwritten by the __call__
of the compositor.
Additionally, at least with the current calling order of things, we don't even know which prerequisites are going to be used when this function is called. And it seems the CompositorNode where this is DataID is used is needed before prerequisites and optionals are determined because we add the Nodes of those prereqs/optionals to the CompositorNode's children. There might be a way to get or hack around that, but I'm not sure.
So back to the original question:
If you ask for a composite as DataQuery(name="some_comp", resolution=[500, 1000]), what do you set in the DataID?
We don't know the prereqs or optionals and we're just updating the original request of "some_comp" with some resolution, but what one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I made a first pass. Overall, thanks a lot for the clarifications and refactorings, the code reads better now.
I have comments inline, or rather mostly questions because I’m not following everything :)
""" | ||
return self.equal(other, shared_keys=False) | ||
|
||
def equal(self, other: Any, shared_keys: bool = False) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not found of passing a bool here. Could it be passed something like keys_to_match
instead, i.e. the list of keys to use for the matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really as the ability to get shared keys (keys to match) only happens after we've extracted the dict version of each object. Additionally, this is a top-level public method now (I don't remember if it needs to be) so doing low-level stuff with the keys seems like too much work for a high-level user-facing method and would have to be repeated anywhere shared_keys=True
is used. I could split it I suppose into two methods that call one shared method, but then again it is a lot of extra code for a single use.
def _create_id_dict_from_any_key(dataset_key): | ||
try: | ||
def _create_id_dict_from_any_key(dataset_key: DataQuery | DataID | str | numbers.Number) -> dict[str, Any]: | ||
if hasattr(dataset_key, "to_dict"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Easier to ask for forgiveness than permission"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly I changed this because mypy was getting mad about the typing and the try/except
being hard to parse. It might have also been CodeScene. I knew you would prefer try/except here, but the linter's didn't like it so I thought it was OK. I can do some type ignoring if you'd still prefer the try/except.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that’s fine…
new_id_dict = orig_id.to_dict() | ||
orig_id_keys = orig_id.id_keys | ||
for query_key, query_val in query_dict.items(): | ||
# XXX: What if the query_val is a list? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo it should get the final resolution. So if the generated composite has a 500m resolution, the dataid should carry that.
return new_id | ||
|
||
|
||
def _keys_to_compare(sdict: dict, odict: dict, o_is_id: bool, shared_keys: bool) -> set: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does o_is_id
stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other is DataID or at least looks like one. I could give the variable a longer name for sure. I went back and forth on how to handle the different cases, but at the end of the day my sanity was preserved by doing simple if/else statements on the DataID-ness of "other". Now that most of the logic is stabilizing I could revisit this. Ideas welcome.
def _compare_key_equality(sdict: dict, odict: dict, key: str, o_is_id: bool) -> bool: | ||
if key not in sdict: | ||
return False | ||
sval = sdict[key] | ||
if sval == "*": | ||
return True | ||
|
||
if key not in odict: | ||
return False | ||
oval = odict[key] | ||
if oval == "*": | ||
# Gotcha: if a DataID contains a "*" this could cause | ||
# unexpected matches. A DataID is not expected to use "*" | ||
return True | ||
|
||
return _compare_values(sval, oval, o_is_id) | ||
|
||
|
||
def _compare_values(sval: Any, oval: Any, o_is_id: bool) -> bool: | ||
if isinstance(sval, list) or isinstance(oval, list): | ||
# multiple options to match | ||
if not isinstance(sval, list): | ||
# query to query comparison, make a list to iterate over | ||
sval = [sval] | ||
if o_is_id: | ||
return oval in sval | ||
|
||
# we're matching against a DataQuery who could have its own list | ||
if not isinstance(oval, list): | ||
oval = [oval] | ||
s_in_o = any(_sval in oval for _sval in sval) | ||
o_in_s = any(_oval in sval for _oval in oval) | ||
return s_in_o or o_in_s | ||
return oval == sval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s probably my doing, but could we take the opportunity to find better/longer names for sval, oval, sdict, odict, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really thought I answered a bunch of these comments already, but apparently not...
Yes, I can try to find better names. I think s
is self or maybe source? And o
is other. I'm not entirely upset with them as they are now, but yeah not super clear.
@@ -424,6 +430,7 @@ def _find_compositor(self, dataset_key, query): | |||
# one or more modifications if it has modifiers see if we can find | |||
# the unmodified version first | |||
|
|||
orig_query = dataset_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- are we sure this is a query, not an id?
- should we rename the function argument instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure, but I think so. This is reminding me that I wanted to add more type annotations to this stuff for this reason, but forgot after needing some last minute changes in other parts of the code. If I remember correctly the information passed by the user gets turned into a DataQuery in one of the top-level tree methods and gets passed around from there.
if key.get("name", default="*") != "*" and len(key.to_dict()) == 1: | ||
# the query key is just the name and still couldn't be found | ||
raise KeyError("Could not find compositor '{}'".format(key)) | ||
|
||
# Get the generic version of the compositor (by name only) | ||
# then save our new version under the new name | ||
|
||
return self._get_compositor_by_name(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What use cases in this covering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah the name of this method is incorrect after more refactoring was done. So the code above this is when the compositor exactly matches what the user asked for. This is usually the common case of just a compositor by name but could also include other filtering parameters like resolution
or calibration
or whatever else was asked for.
The method _get_compositor_by_name
first gets the base compositor by just the name and then if more than one compositor matches it filters down using the other properties of the provided DataQuery
. The cases where a compositor name returns more than one compositor would be if there are varying resolutions or calibrations for a single compositor. So again, not a common use case, but it was in the tests already.
(DataQuery(name="1", resolution=[250, 500]), DataQuery(name="1", resolution=[500, 750])), # opposite order | ||
(DataQuery(name="1", resolution=500), DataQuery(name="1", resolution=[500, 750])), # opposite order | ||
(DataQuery(name="1", resolution=[250, 500]), DataQuery(name="1", resolution=500)), # opposite order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look surprising?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a case of the code written first then the tests second. More importantly, this is testing code that I didn't write, but I could tell this functionality was intended. That is, I believe we support users specifying a list of things in their DataQuery (and in the filter kwargs passed to Scene.load
) and the results should/could match any of the things in the list.
Or is there something else surprising about these. I'm mostly making sure that regardless of which DataQuery object's __eq__
is called (the first or the second) that things are still matched. I'm not sure if the # opposite order
comment at the end of these 3 lines is valid. Most likely it is an artifact of copy/pasting the earlier test. Or maybe the comment doesn't apply to the 3rd to last case, but does for the last 2 cases.
(DataQuery(name="1", resolution="*"), dict(name="1")), | ||
(DataQuery(name="1", resolution="*"), dict(name="1", resolution=500)), | ||
# DataID shouldn't use * but we still test it: | ||
(DataQuery(name="1", resolution=500), dict(name="1", resolution="*")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is *
valid for dataids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is technically allowed since it is a string and DataID does not do any validation against it. That's why the comment is there. There is also a comment somewhere in the equality method of the DataQuery about it. There is nothing stopping it in the code from happening (a "*" in a DataID), but it also isn't expected or wanted or supported entirely.
Ok I've been spending some time trying to get a better understanding of composite and modifier ID handling. Basically I wanted to be more confident in my fixes, add tests for edge cases, and resolve my TODO I mentioned in this comment. It comes down to compositors (and the Node objects wrapping them) and composites (the loaded DataArray) being separate identifiers, but treated as the same thing. Series of events
Problems and Difficulties
Solutions?🤷♂️ Overall (I think in |
Disclaimer: This PR's implementation is very rough at the time of writing.
This PR includes major changes to key parts of Satpy in order to resolve some inconsistencies noticed by users over the years. The high-level concepts that have been changed or updated are:
DataQuery(name="comp", resolution=500), DataQuery(name="comp", resolution=1000)
).Remaining work
Hindsight
For high-level change 1 above, I'm starting to think this was the wrong change or at least that the previous behavior had a good point. That is, if a user creates a query with a lot of keys to apply to many DataIDs from different sources, then not all DataIDs should be required to have all those keys. For example, if I specify a polarization in my query, then I don't think all composites or rather composite dependencies would be able to match that. There are currently no tests to verify this.
AUTHORS.md
if not there already