Clarify meaning of slots that exist both at mapping-level and at mapping set-level #305

gouttegd · 2023-07-25T17:17:21Z

The mapping set class seems to have two very different types of metadata slots:

Slots that are really for metadata of the set itself, not of the mappings;
Slots that are really for metadata of the individual mappings, but that are allowed to be used once and for all at the level of the mapping set – presumably to avoid repeating the same metadata for all mappings, if said metadata is the same for all individual mappings in a set.

It is obvious that the metadata slots that are only allowed at the set level belong to the first category (e.g., mapping_set_id, mapping_set_title).

But for the slots that are allowed both at the set level and at the individual mapping level, it is not obvious how they should be interpreted when they are found at the set level.

If a set has a creator_id metadata for example, should that be understood to mean that the referenced creator is the creator of the set (which may not be the same as the creator(s) of the individual mappings that make up the set), or that it is the default creator of all the mappings of the set (at least all those that do not have an explicit value for that slot)?

The documentation of creator_id seems to suggest the latter because it refers to “mapping” in singular (“Identifies the persons or groups responsible for the creation of the mapping“), but I’d be wary of using English grammar to infer whether a metadata slot is intended to refer to an individual mapping or to a set of mappings… especially since the documentation of author_id, which is a slot that only exists for individual mappings (and so cannot refer to the “author” of a mapping set) uses “mappings” in plural!

For now, my assumption is that all the slots that exist both at the mapping level and at the mapping set level are all about the individual mapping, and that when they are used at the set level, it is only as a “shortcut” to avoid repeating the same metadata for all individual mappings (said otherwise: a slot never has a different meaning depending on whether it is used on an individual mapping or on a mapping set). But this is something that should be explicitly specified. All the spec says is:

Some SSSOM metadata elements apply only to one element or the other, but many can be applied to both.

which in my opinion is wholly inadequate.

The text was updated successfully, but these errors were encountered:

gouttegd · 2023-07-25T17:48:46Z

To further clarify: the issue above matters because it dictates whether metadata found at the set level should be propagated down to the mapping level, or not.

If setting creator_id at the set level is merely a shortcut to avoid explicitly setting creator_id on each mapping (as I’m currently assuming it is), then it is fine to propagate the value of that slot to all the mappings of the set when de-serializing a set – in fact, not only it is fine, but arguably that is what should be done.

On the contrary, if creator_id at the set level is intended to represent the creator(s) of the set itself, then its value should not be propagated down to the individual mappings.

matentzn · 2023-07-27T09:04:52Z

@gouttegd I am glad you raise this discussion, it irked me since the inception of SSSOM.

I have no good idea how to solve this. Lets start with the two most obvious ways: meta-modelling and default range assumption.

meta-modelling. We introduce a meta property like propagates_to_individual_mappings: true that can be added constraint on the mapping set slot using the slot_usage section in LinkML. This only need to be specified for slots at mapping set level.
Default range assumption: if the exact same slot exists in both mapping set and mapping set level, the above feature (propagates_to_individual_mappings: true) is assumed by default. This is documented as a global feature of the model.

Any other ideas?

gouttegd · 2023-07-27T09:27:14Z

Let’s get an answer to the following question then, about the intentions when the model was first designed:

Is my current assumption (stated in the first message) correct? Are all the slots that exist both at the mapping level and the mapping set level intended to be really about individual mappings, and never about the mapping set?

If that is the case, then your second proposition is fine, it just needs to be explicitly stated somewhere in the spec.

On the contrary, if at least some of the slots that exist both at the mapping level and at the mapping set level are intended to have a different meaning depending on the level they are used (for example, creator_id on a mapping refers to the creator of that particular mapping, but creator_id on a set refers to the creator of the entire set – with the understanding that “creator of the set” is not the same thing as “creator of all the mappings in the set”), then obviously your second proposition is not applicable. In that case the specification needs to state explicitly what is the meaning of such slots for each level.

gouttegd · 2023-07-27T10:01:15Z

Independently of the question of which slots should be propagated down to the mapping level (all of them or only some of them), the spec also needs to specify how values should be propagated.

In particular, we must know what to do when an individual mapping already has a different value for a slot that is also present at the set level.

For example:

#curie_map:
#  EXA: http://example.org/EXA_
#  EXB: http://example.org/EXB_
#mapping_tool: "Microsoft MappingMaker 2000"
subject_id  predicate_id     object_id  mapping_tool
EXA:0001    skos:exactMatch  EXB:0001   
EXA:0002    skos:exactMatch  EXB:0002   Oracle Mapping Generator 3.87
EXA:0003    skos:exactMatch  EXB:0003

What is the desired behaviour here? I can see several options:

Probably the most obvious one is, propagate the value of mapping_tool to the mappings where the slot is empty, but for the mappings that already have a value there, the value at the mapping level takes precedence. In this example, the mappings for EXA:0001 and EXA:0003 would get the value “Microsoft MappingMaker 2000”, but the mapping for EXA:0002 would keep its own value “Oracle Mapping Generator 3.87”.
The value at the mapping set always takes precedence, regardless of what individual mappings say. In this case, all three mappings would get the value “Microsoft MappingMaker 2000”.
The value at the mapping level always takes precedence, even when it is empty. That would mean that propagation should only happen if the slot is completely absent from the mapping table (no column in the TSV file for that slot). In this example, the mappings for EXA:0001 and EXA:0003 would have no value set for mapping_tool, while the mapping for EXA:0002 would keep its own value.

matentzn · 2023-07-29T14:17:56Z

I intuitively prefer 1, but I am not certain. The advantage of 3 is that it is simpler to maintain in cases where, for example, a global field only makes sense in conjunction with certain justifications. It does not make sense to say: mapping_tool for mapping_justification: semapv:ManualMappingCuration.

I would be clearly against 2.

gouttegd · 2023-07-29T18:11:14Z

I intuitively prefer 1, but I am not certain.

I see two problems with option 1, which is why I tend to prefer option 3.

The first problem is: what about multi-valued fields? For example, if we have this:

#creator_id:
#  - ORCID:1
#  - ORCID:2
subject_id  predicate_id     object_id  creator_id
EXA:0001    skos:exactMatch  EXB:0001   
EXA:0002    skos:exactMatch  EXB:0002   ORCID:3|ORCID:4
EXA:0003    skos:exactMatch  EXB:0003

Under option 1), it is clear that the value of creator_id for the EXA:0001 and EXA:0003 mappings should be ORCID:1|ORCID:2, but what about the EXA:0002 mapping? It is almost certain that in this situation, some people would expect that the final creator_id value for EXA:0002 should be the union of the two lists (ORCID:1|ORCID:2|ORCID:3|ORCID:4) – and I don’t think that is an unreasonable expectation, so that’s another detail for which we would need to make a clear decision and to explain in the spec.

The second problem is about mappings with empty values. In the example above, what if the EXA:0003 has an empty value because we don’t actually know who created it? Under option 1), we can’t distinguish between a field that is empty because it is supposed to be empty and a field that is empty because it is supposed to take the default value propagated from the set level.

Option 3) solves those two problems by basically disallowing the mixing of a default value set at the set level and explicit values set at the individual mapping level.

To put it differently, option 3) means that setting a value at the set level as a “shortcut” to avoid setting it on each mapping is only allowed in the case where all mappings in a set have the same value, and the field is not even present in the TSV columns.

So even if you have 99 mappings with the same value A and just one mapping with a different value B, you MUST explicitly set the values on each individual mapping. You CANNOT set the value A at the set level and leave it empty on the 99 corresponding mappings – well, you can, but if you do that the value from the set level will not be taken into account: all 99 mappings will be treated as if they had no value.

I believe this is the safest and least confusing option.

matentzn · 2023-07-30T14:19:33Z

I think you are right - this is what we should do! Any other opinions?

gouttegd · 2023-07-30T14:26:35Z

To be clear, I slightly prefer option 3) but I think all three options are more or less equally reasonable, so I don’t mind which one is chosen in the end.

What matters to me is that one option is decided and that the corresponding expected behaviour is described in the spec, so that the behaviour is not left to the implementations to decide.

matentzn · 2023-07-30T14:44:59Z

We can ask other for their opinion, but lets move to the second issue: how to solve which slots should be propagated at all. I made a quick analysis:

Slot currently not globally specifiable but maybe should be?

mapping_justification
object_category
subject_category
semantic_similarity_measure
author_id
author_label

Slots exist on both levels, and should be propagated

mapping_date
mapping_provider
mapping_tool
mapping_tool_version
object_match_field
object_preprocessing
object_source
object_source_version
object_type
subject_match_field
subject_preprocessing
subject_source
subject_source_version
subject_type

Slot only exist on mapping level, and should not exist globally:

confidence
issue_tracker_item
mapping_cardinality
match_string
object_id
object_label
reviewer_id
reviewer_label
subject_id
subject_label
semantic_similarity_score (fixed after @gouttegd comment, was oversight)
predicate_id (originally suggested by @matentzn as candidate for globally specifyable, but contested by @gouttegd as too important; @matentzn concurs after review)
predicate_label (originally suggested by @matentzn as candidate for globally specifyable, but contested by @gouttegd as too important; @matentzn concurs after review)
predicate_modifier (originally suggested by @matentzn as candidate for globally specifyable, but contested by @gouttegd as too important; @matentzn concurs after review)

Slot exists on both levels and should not be propagated

comment
see_also
other
creator_id
creator_label

Slot exists on both levels, but should probably only exist on mapping set level

publication_date (thanks @gouttegd for suggestion)
license (@gouttegd suggestion)

EDITS after @gouttegd comments below.

gouttegd · 2023-07-30T15:13:40Z

Slot currently not globally specifiable but maybe should be?

semantic_similarity_score

How likely it is that all mappings in a set will have the same semantic similarity score, so that the value could be set only once at the set level?

I would have no objection to that, but it seems of dubious value. (Contrary to semantic_similarity_measure: it makes sense – though it would not always be the case – that the same measure has been used for all mappings in a set.)

Slots exist on both levels, and should be propagated

mapping_date

Makes sense. I note you did not mention the publication_date field, which also exists on both level.

My suggestion would be to make publication_date a mapping set value only. After all, individual mappings are never published on their own, only the mapping set they belong to is.

So we would have mapping_date which says when an individual mapping was asserted (and which MAY be set at the global level if all mappings in a set were asserted at the same time, and propagation will occur), and publication_date which says when a mapping set was published.

predicate_id

predicate_label

predicate_modifier

Unless I’m missing something those slots do not exist on both levels, they are on the mapping level only. And I think they should stay like that. Even if a mapping set only contains mappings that have the same predicate, I’d argue that the predicate is much too important to be set once and for all at the set level, and should always be set explicitly for each mapping.

license

This one is tricky. Propagating it would mean that the set-level license is not the license of the set, it is the license of all the mappings in the set (which is not the same thing).

In fact, if we follow the logic above about publication_date, where only a mapping set can ever said to be published (and thereby needs a publication date), then similarly only a mapping set needs a license.

No opinion for all the other slots.

matentzn · 2023-07-30T15:29:54Z

I updated all the suggestions @gouttegd according to your comments! I agree with all of them, thanks.

gouttegd · 2023-07-30T15:35:58Z

I have no good idea how to solve this. Lets start with the two most obvious ways: meta-modelling and default range assumption. […] Any other ideas?

It just occurred to me that there would be a third way, though it would involve a breaking change 😱 with the current version of the spec, so I don’t expect you to see it favourably. I mention it for completeness, though.

Add a new slot at the set level only called something like defaults (or default_values), where the mapping-level values that are the same for all the mappings in the set can be specified.

That is, instead of this:

#mapping_tool: "Microsoft MappingMaker 2000"
subject_id  predicate_id     object_id
EXA:0001    skos:exactMatch  EXB:0001
EXA:0002    skos:exactMatch  EXB:0002
EXA:0003    skos:exactMatch  EXB:0003

we would have this:

#defaults:
#  mapping_tool: "Microsoft MappingMaker 2000"
subject_id  predicate_id     object_id
EXA:0001    skos:exactMatch  EXB:0001
EXA:0002    skos:exactMatch  EXB:0002
EXA:0003    skos:exactMatch  EXB:0003

This would have two benefits:

It makes immediately obvious which values in the metadata block are really metadata about the set, and which ones are (default) metadata for the individual mappings, without having to refer to the spec to find the propagation rules. Everything that is under the defaults slot are individual mapping metadata and should be propagated, everything that is outside of defaults are mapping set metadata and should never be propagated.
It allows to use the same slot name for a metadata that we want to exist both at the set level and at the individual mapping level. For example, if we decide that the mapping set should have its own creation_date (independently of the creation date of the mappings), we could have

#creation_date: ... (date when the mapping SET was created)
#defaults:
#  creation_date: ... (default creation date for all individual mappings)

Which is something that would not be possible with the current system.

The obvious slightly minor inconvenient of this approach is, at noted above, that it is completely different from the current version of the spec and therefore would make several people unhappy (myself included, since the change would obviously affect SSSOM-Java!), at least for the beginning.

matentzn · 2023-07-31T14:31:34Z

This is not a bad idea! I am assuming it will cause some upheaval. I will run it by Chris in my next call with him and see what he thinks.

gouttegd · 2023-07-31T14:53:01Z

I am assuming it will cause some upheaval.

Certainly. And there is a reasonable argument to be made that if it has taken so much efforts to convince people to adopt SSSOM, pushing an incompatible change now would not be wise and could lead to those people deciding to drop the standard.

gouttegd · 2023-09-15T22:21:30Z

Any update on this?

If there is no objection to this comment, I’d like to start turning it into actual changes both in the specification (e.g., removing the publication_date slot from the Mapping class to keep it only in the MappingSet class) and in the documentation (to explain which slots should be propagated).

matentzn · 2023-09-20T09:10:10Z

I think the following we can enact right now as suggested by the comment you linked:

Slot currently not globally specifiable but maybe should be?
Slot only exist on mapping level, and should not exist globally:
Slot exists on both levels, but should probably only exist on mapping set level

The question of how to deal with default values is still open. I have now added it to my agenda with @cmungall to contemplate (forgot it last time), and see what we agree on.

EDIT: I actually did discuss this with him, and he said he had a "51% towards metamodel solution but open to other". The metamodel solution would be to add a flag like "propagates: true" to all the slots that can propagate and define a default behaviour for the possible conflicts you outlined above.

gouttegd · 2023-09-20T14:52:25Z

I think the following we can enact right now as suggested by the comment you linked:

OK, can prepare a PR to update the model to move slots around and update their description as needed.

The question of how to deal with default values is still open. […] he had a "51% towards metamodel solution but open to other". The metamodel solution would be to add a flag like "propagates: true" to all the slots that can propagate

I am fine with that solution, but I could use some help with the implementation, because it’s unclear to me how you can add an arbitrary flag to a slot in LinkML.

matentzn · 2023-09-21T17:00:42Z

You can probably reach out to @sierra-moxon (also on OBO slack) who is one of our LinkML master wizards.

Resolves [#305] - [x] `docs/` have been added/updated if necessary - [x] `make test` has been run locally - [ ] tests have been added/updated (not applicable) - [x] [CHANGELOG.md](https://github.com/mapping-commons/sssom/blob/master/CHANGELOG.md) has been updated. If you are proposing a change to the SSSOM metadata model, you must - [ ] provide a full, working and valid example in `examples/` (**not applicable**: no new example needed as the change only affects how some slots should be interpreted; it does not add or remove slots, nor does it change how the propagated slots are used) - [x] provide a link to the related GitHub issue in the `see_also` field of the linkml model - [ ] provide a link to a valid example in the `see_also` field of the linkml model (**not applicable**, same reason as above) This PR finalises the fix to #305, by explicitly specifying, directly within the LinkML model, which slots are considered “propagatable” (previously this was only informally described in the spec, since #368). This is done by: * adding a “metamodel extension class“ (`sssom:Propagatable`) with a single boolean-ranged attributed `propagated`; * amending the slots that must be considered propagatable by making them instantiate the `sssom:Propagatable` extension.

gouttegd · 2024-07-19T16:00:11Z

Done with #368 (in the spec/doc) and #371 (in the model).

The one thing that would remain to be decided is whether propagation/condensation is allowed when using other formats than SSSOM/TSV (e.g. JSON-LD), but that will need to be done as part of the efforts to actually specify those formats.

For what it’s worth, I believe condensation (and therefore propagation) does not make much sense for both the RDF and the JSON-LD serialisations. Those serialisations are very ”verbose” anyway, trying to save space by condensing slots with the same value to the level of the set would be pointless. That being said, I am not (strongly) opposed to the idea.

graybeal · 2024-07-19T19:41:33Z

Regarding this:

The one thing that would remain to be decided is whether propagation/condensation is allowed when using other formats than SSSOM/TSV (e.g. JSON-LD), but that will need to be done as part of the efforts to actually specify those formats.

Propagation and condensation are characteristics of model interpretation (driven by settings within the mapping metadata, or maybe I should say metamodel), are they not? Their should not be any difference in how the model is interpreted whether it is TSV, JSON-LD, or RDF, should there? Under the metamodel solution, the interpretation is explicitly defined for SSSOM content.

I can't see any obvious issue in representing the same characteristics in those other formats. The top-level metadata will define what propagates and can be expressed in any of the formats.

And the biggest value of this capability is not really about saving space, even in TSV you're not saving much space. It's about cognitive load—you don't want someone to have to manually review every mapping to make sure the metadata attribute has been added to that particular mapping and is exactly the same. Just scanning (or searching) for exceptions is way easier, in any format.

gouttegd · 2024-07-19T20:50:35Z

Their should not be any difference in how the model is interpreted whether it is TSV, JSON-LD, or RDF, should there?

There could be, if we decided to.

I can't see any obvious issue in representing the same characteristics in those other formats.

I don’t see any issue either. I just also don’t see the real value in doing so.

But at the same time, I do see the value of not having to treat the different serialisation formats differently, which is why I said that I am not strongly opposed to it.

If we decide to treat condensation/propagation as a general characteristic of the model that is independent of the serialisation, then all that needs to happen is to move the description of the condensation/propagation operations from the section about the TSV format (where it currently resides) to the general section about the data model.

gouttegd mentioned this issue Nov 9, 2023

The spec/doc should be reorganised #330

Closed

gouttegd mentioned this issue Jul 2, 2024

Reorganise docs #368

Merged

4 tasks

gouttegd mentioned this issue Jul 11, 2024

Model propagatable slots #371

Merged

7 tasks

gouttegd closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify meaning of slots that exist both at mapping-level and at mapping set-level #305

Clarify meaning of slots that exist both at mapping-level and at mapping set-level #305

gouttegd commented Jul 25, 2023

gouttegd commented Jul 25, 2023

matentzn commented Jul 27, 2023

gouttegd commented Jul 27, 2023

gouttegd commented Jul 27, 2023

matentzn commented Jul 29, 2023

gouttegd commented Jul 29, 2023

matentzn commented Jul 30, 2023

gouttegd commented Jul 30, 2023

matentzn commented Jul 30, 2023 •

edited

Loading

gouttegd commented Jul 30, 2023

matentzn commented Jul 30, 2023

gouttegd commented Jul 30, 2023

matentzn commented Jul 31, 2023

gouttegd commented Jul 31, 2023

gouttegd commented Sep 15, 2023

matentzn commented Sep 20, 2023 •

edited

Loading

gouttegd commented Sep 20, 2023 •

edited

Loading

matentzn commented Sep 21, 2023

gouttegd commented Jul 19, 2024 •

edited

Loading

graybeal commented Jul 19, 2024

gouttegd commented Jul 19, 2024

Clarify meaning of slots that exist both at mapping-level and at mapping set-level #305

Clarify meaning of slots that exist both at mapping-level and at mapping set-level #305

Comments

gouttegd commented Jul 25, 2023

gouttegd commented Jul 25, 2023

matentzn commented Jul 27, 2023

gouttegd commented Jul 27, 2023

gouttegd commented Jul 27, 2023

matentzn commented Jul 29, 2023

gouttegd commented Jul 29, 2023

matentzn commented Jul 30, 2023

gouttegd commented Jul 30, 2023

matentzn commented Jul 30, 2023 • edited Loading

Slot currently not globally specifiable but maybe should be?

Slots exist on both levels, and should be propagated

Slot only exist on mapping level, and should not exist globally:

Slot exists on both levels and should not be propagated

Slot exists on both levels, but should probably only exist on mapping set level

gouttegd commented Jul 30, 2023

matentzn commented Jul 30, 2023

gouttegd commented Jul 30, 2023

matentzn commented Jul 31, 2023

gouttegd commented Jul 31, 2023

gouttegd commented Sep 15, 2023

matentzn commented Sep 20, 2023 • edited Loading

gouttegd commented Sep 20, 2023 • edited Loading

matentzn commented Sep 21, 2023

gouttegd commented Jul 19, 2024 • edited Loading

graybeal commented Jul 19, 2024

gouttegd commented Jul 19, 2024

matentzn commented Jul 30, 2023 •

edited

Loading

matentzn commented Sep 20, 2023 •

edited

Loading

gouttegd commented Sep 20, 2023 •

edited

Loading

gouttegd commented Jul 19, 2024 •

edited

Loading