-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify meaning of slots that exist both at mapping-level and at mapping set-level #305
Comments
To further clarify: the issue above matters because it dictates whether metadata found at the set level should be propagated down to the mapping level, or not. If setting On the contrary, if |
@gouttegd I am glad you raise this discussion, it irked me since the inception of SSSOM. I have no good idea how to solve this. Lets start with the two most obvious ways: meta-modelling and default range assumption.
Any other ideas? |
Let’s get an answer to the following question then, about the intentions when the model was first designed:
If that is the case, then your second proposition is fine, it just needs to be explicitly stated somewhere in the spec. On the contrary, if at least some of the slots that exist both at the mapping level and at the mapping set level are intended to have a different meaning depending on the level they are used (for example, |
Independently of the question of which slots should be propagated down to the mapping level (all of them or only some of them), the spec also needs to specify how values should be propagated. In particular, we must know what to do when an individual mapping already has a different value for a slot that is also present at the set level. For example:
What is the desired behaviour here? I can see several options:
|
I intuitively prefer 1, but I am not certain. The advantage of 3 is that it is simpler to maintain in cases where, for example, a global field only makes sense in conjunction with certain justifications. It does not make sense to say: I would be clearly against 2. |
I see two problems with option 1, which is why I tend to prefer option 3. The first problem is: what about multi-valued fields? For example, if we have this:
Under option 1), it is clear that the value of The second problem is about mappings with empty values. In the example above, what if the EXA:0003 has an empty value because we don’t actually know who created it? Under option 1), we can’t distinguish between a field that is empty because it is supposed to be empty and a field that is empty because it is supposed to take the default value propagated from the set level. Option 3) solves those two problems by basically disallowing the mixing of a default value set at the set level and explicit values set at the individual mapping level. To put it differently, option 3) means that setting a value at the set level as a “shortcut” to avoid setting it on each mapping is only allowed in the case where all mappings in a set have the same value, and the field is not even present in the TSV columns. So even if you have 99 mappings with the same value I believe this is the safest and least confusing option. |
I think you are right - this is what we should do! Any other opinions? |
To be clear, I slightly prefer option 3) but I think all three options are more or less equally reasonable, so I don’t mind which one is chosen in the end. What matters to me is that one option is decided and that the corresponding expected behaviour is described in the spec, so that the behaviour is not left to the implementations to decide. |
We can ask other for their opinion, but lets move to the second issue: how to solve which slots should be propagated at all. I made a quick analysis: Slot currently not globally specifiable but maybe should be?
Slots exist on both levels, and should be propagated
Slot only exist on mapping level, and should not exist globally:
Slot exists on both levels and should not be propagated
Slot exists on both levels, but should probably only exist on mapping set levelEDITS after @gouttegd comments below. |
How likely it is that all mappings in a set will have the same semantic similarity score, so that the value could be set only once at the set level? I would have no objection to that, but it seems of dubious value. (Contrary to
Makes sense. I note you did not mention the My suggestion would be to make So we would have
Unless I’m missing something those slots do not exist on both levels, they are on the mapping level only. And I think they should stay like that. Even if a mapping set only contains mappings that have the same predicate, I’d argue that the predicate is much too important to be set once and for all at the set level, and should always be set explicitly for each mapping.
This one is tricky. Propagating it would mean that the set-level In fact, if we follow the logic above about No opinion for all the other slots. |
I updated all the suggestions @gouttegd according to your comments! I agree with all of them, thanks. |
It just occurred to me that there would be a third way, though it would involve a breaking change 😱 with the current version of the spec, so I don’t expect you to see it favourably. I mention it for completeness, though. Add a new slot at the set level only called something like That is, instead of this:
we would have this:
This would have two benefits:
Which is something that would not be possible with the current system. The obvious slightly minor inconvenient of this approach is, at noted above, that it is completely different from the current version of the spec and therefore would make several people unhappy (myself included, since the change would obviously affect SSSOM-Java!), at least for the beginning. |
This is not a bad idea! I am assuming it will cause some upheaval. I will run it by Chris in my next call with him and see what he thinks. |
Certainly. And there is a reasonable argument to be made that if it has taken so much efforts to convince people to adopt SSSOM, pushing an incompatible change now would not be wise and could lead to those people deciding to drop the standard. |
Any update on this? If there is no objection to this comment, I’d like to start turning it into actual changes both in the specification (e.g., removing the |
I think the following we can enact right now as suggested by the comment you linked:
The question of how to deal with default values is still open. I have now added it to my agenda with @cmungall to contemplate (forgot it last time), and see what we agree on. EDIT: I actually did discuss this with him, and he said he had a "51% towards metamodel solution but open to other". The metamodel solution would be to add a flag like "propagates: true" to all the slots that can propagate and define a default behaviour for the possible conflicts you outlined above. |
OK, can prepare a PR to update the model to move slots around and update their description as needed.
I am fine with that solution, but I could use some help with the implementation, because it’s unclear to me how you can add an arbitrary flag to a slot in LinkML. |
You can probably reach out to @sierra-moxon (also on OBO slack) who is one of our LinkML master wizards. |
Resolves [#305] - [x] `docs/` have been added/updated if necessary - [x] `make test` has been run locally - [ ] tests have been added/updated (not applicable) - [x] [CHANGELOG.md](https://github.com/mapping-commons/sssom/blob/master/CHANGELOG.md) has been updated. If you are proposing a change to the SSSOM metadata model, you must - [ ] provide a full, working and valid example in `examples/` (**not applicable**: no new example needed as the change only affects how some slots should be interpreted; it does not add or remove slots, nor does it change how the propagated slots are used) - [x] provide a link to the related GitHub issue in the `see_also` field of the linkml model - [ ] provide a link to a valid example in the `see_also` field of the linkml model (**not applicable**, same reason as above) This PR finalises the fix to #305, by explicitly specifying, directly within the LinkML model, which slots are considered “propagatable” (previously this was only informally described in the spec, since #368). This is done by: * adding a “metamodel extension class“ (`sssom:Propagatable`) with a single boolean-ranged attributed `propagated`; * amending the slots that must be considered propagatable by making them instantiate the `sssom:Propagatable` extension.
Done with #368 (in the spec/doc) and #371 (in the model). The one thing that would remain to be decided is whether propagation/condensation is allowed when using other formats than SSSOM/TSV (e.g. JSON-LD), but that will need to be done as part of the efforts to actually specify those formats. For what it’s worth, I believe condensation (and therefore propagation) does not make much sense for both the RDF and the JSON-LD serialisations. Those serialisations are very ”verbose” anyway, trying to save space by condensing slots with the same value to the level of the set would be pointless. That being said, I am not (strongly) opposed to the idea. |
Regarding this:
Propagation and condensation are characteristics of model interpretation (driven by settings within the mapping metadata, or maybe I should say metamodel), are they not? Their should not be any difference in how the model is interpreted whether it is TSV, JSON-LD, or RDF, should there? Under the metamodel solution, the interpretation is explicitly defined for SSSOM content. I can't see any obvious issue in representing the same characteristics in those other formats. The top-level metadata will define what propagates and can be expressed in any of the formats. And the biggest value of this capability is not really about saving space, even in TSV you're not saving much space. It's about cognitive load—you don't want someone to have to manually review every mapping to make sure the metadata attribute has been added to that particular mapping and is exactly the same. Just scanning (or searching) for exceptions is way easier, in any format. |
There could be, if we decided to.
I don’t see any issue either. I just also don’t see the real value in doing so. But at the same time, I do see the value of not having to treat the different serialisation formats differently, which is why I said that I am not strongly opposed to it. If we decide to treat condensation/propagation as a general characteristic of the model that is independent of the serialisation, then all that needs to happen is to move the description of the condensation/propagation operations from the section about the TSV format (where it currently resides) to the general section about the data model. |
The mapping set class seems to have two very different types of metadata slots:
It is obvious that the metadata slots that are only allowed at the set level belong to the first category (e.g.,
mapping_set_id
,mapping_set_title
).But for the slots that are allowed both at the set level and at the individual mapping level, it is not obvious how they should be interpreted when they are found at the set level.
If a set has a
creator_id
metadata for example, should that be understood to mean that the referenced creator is the creator of the set (which may not be the same as the creator(s) of the individual mappings that make up the set), or that it is the default creator of all the mappings of the set (at least all those that do not have an explicit value for that slot)?The documentation of
creator_id
seems to suggest the latter because it refers to “mapping” in singular (“Identifies the persons or groups responsible for the creation of the mapping“), but I’d be wary of using English grammar to infer whether a metadata slot is intended to refer to an individual mapping or to a set of mappings… especially since the documentation ofauthor_id
, which is a slot that only exists for individual mappings (and so cannot refer to the “author” of a mapping set) uses “mappings” in plural!For now, my assumption is that all the slots that exist both at the mapping level and at the mapping set level are all about the individual mapping, and that when they are used at the set level, it is only as a “shortcut” to avoid repeating the same metadata for all individual mappings (said otherwise: a slot never has a different meaning depending on whether it is used on an individual mapping or on a mapping set). But this is something that should be explicitly specified. All the spec says is:
which in my opinion is wholly inadequate.
The text was updated successfully, but these errors were encountered: