Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify meaning of slots that exist both at mapping-level and at mapping set-level #305

Closed
gouttegd opened this issue Jul 25, 2023 · 21 comments

Comments

@gouttegd
Copy link
Contributor

The mapping set class seems to have two very different types of metadata slots:

  • Slots that are really for metadata of the set itself, not of the mappings;
  • Slots that are really for metadata of the individual mappings, but that are allowed to be used once and for all at the level of the mapping set – presumably to avoid repeating the same metadata for all mappings, if said metadata is the same for all individual mappings in a set.

It is obvious that the metadata slots that are only allowed at the set level belong to the first category (e.g., mapping_set_id, mapping_set_title).

But for the slots that are allowed both at the set level and at the individual mapping level, it is not obvious how they should be interpreted when they are found at the set level.

If a set has a creator_id metadata for example, should that be understood to mean that the referenced creator is the creator of the set (which may not be the same as the creator(s) of the individual mappings that make up the set), or that it is the default creator of all the mappings of the set (at least all those that do not have an explicit value for that slot)?

The documentation of creator_id seems to suggest the latter because it refers to “mapping” in singular (“Identifies the persons or groups responsible for the creation of the mapping“), but I’d be wary of using English grammar to infer whether a metadata slot is intended to refer to an individual mapping or to a set of mappings… especially since the documentation of author_id, which is a slot that only exists for individual mappings (and so cannot refer to the “author” of a mapping set) uses “mappings” in plural!

For now, my assumption is that all the slots that exist both at the mapping level and at the mapping set level are all about the individual mapping, and that when they are used at the set level, it is only as a “shortcut” to avoid repeating the same metadata for all individual mappings (said otherwise: a slot never has a different meaning depending on whether it is used on an individual mapping or on a mapping set). But this is something that should be explicitly specified. All the spec says is:

Some SSSOM metadata elements apply only to one element or the other, but many can be applied to both.

which in my opinion is wholly inadequate.

@gouttegd
Copy link
Contributor Author

To further clarify: the issue above matters because it dictates whether metadata found at the set level should be propagated down to the mapping level, or not.

If setting creator_id at the set level is merely a shortcut to avoid explicitly setting creator_id on each mapping (as I’m currently assuming it is), then it is fine to propagate the value of that slot to all the mappings of the set when de-serializing a set – in fact, not only it is fine, but arguably that is what should be done.

On the contrary, if creator_id at the set level is intended to represent the creator(s) of the set itself, then its value should not be propagated down to the individual mappings.

@matentzn
Copy link
Collaborator

@gouttegd I am glad you raise this discussion, it irked me since the inception of SSSOM.

I have no good idea how to solve this. Lets start with the two most obvious ways: meta-modelling and default range assumption.

  1. meta-modelling. We introduce a meta property like propagates_to_individual_mappings: true that can be added constraint on the mapping set slot using the slot_usage section in LinkML. This only need to be specified for slots at mapping set level.
  2. Default range assumption: if the exact same slot exists in both mapping set and mapping set level, the above feature (propagates_to_individual_mappings: true) is assumed by default. This is documented as a global feature of the model.

Any other ideas?

@gouttegd
Copy link
Contributor Author

Let’s get an answer to the following question then, about the intentions when the model was first designed:

Is my current assumption (stated in the first message) correct? Are all the slots that exist both at the mapping level and the mapping set level intended to be really about individual mappings, and never about the mapping set?

If that is the case, then your second proposition is fine, it just needs to be explicitly stated somewhere in the spec.

On the contrary, if at least some of the slots that exist both at the mapping level and at the mapping set level are intended to have a different meaning depending on the level they are used (for example, creator_id on a mapping refers to the creator of that particular mapping, but creator_id on a set refers to the creator of the entire set – with the understanding that “creator of the set” is not the same thing as “creator of all the mappings in the set”), then obviously your second proposition is not applicable. In that case the specification needs to state explicitly what is the meaning of such slots for each level.

@gouttegd
Copy link
Contributor Author

Independently of the question of which slots should be propagated down to the mapping level (all of them or only some of them), the spec also needs to specify how values should be propagated.

In particular, we must know what to do when an individual mapping already has a different value for a slot that is also present at the set level.

For example:

#curie_map:
#  EXA: http://example.org/EXA_
#  EXB: http://example.org/EXB_
#mapping_tool: "Microsoft MappingMaker 2000"
subject_id  predicate_id     object_id  mapping_tool
EXA:0001    skos:exactMatch  EXB:0001   
EXA:0002    skos:exactMatch  EXB:0002   Oracle Mapping Generator 3.87
EXA:0003    skos:exactMatch  EXB:0003

What is the desired behaviour here? I can see several options:

  1. Probably the most obvious one is, propagate the value of mapping_tool to the mappings where the slot is empty, but for the mappings that already have a value there, the value at the mapping level takes precedence. In this example, the mappings for EXA:0001 and EXA:0003 would get the value “Microsoft MappingMaker 2000”, but the mapping for EXA:0002 would keep its own value “Oracle Mapping Generator 3.87”.

  2. The value at the mapping set always takes precedence, regardless of what individual mappings say. In this case, all three mappings would get the value “Microsoft MappingMaker 2000”.

  3. The value at the mapping level always takes precedence, even when it is empty. That would mean that propagation should only happen if the slot is completely absent from the mapping table (no column in the TSV file for that slot). In this example, the mappings for EXA:0001 and EXA:0003 would have no value set for mapping_tool, while the mapping for EXA:0002 would keep its own value.

@matentzn
Copy link
Collaborator

I intuitively prefer 1, but I am not certain. The advantage of 3 is that it is simpler to maintain in cases where, for example, a global field only makes sense in conjunction with certain justifications. It does not make sense to say: mapping_tool for mapping_justification: semapv:ManualMappingCuration.

I would be clearly against 2.

@gouttegd
Copy link
Contributor Author

I intuitively prefer 1, but I am not certain.

I see two problems with option 1, which is why I tend to prefer option 3.

The first problem is: what about multi-valued fields? For example, if we have this:

#creator_id:
#  - ORCID:1
#  - ORCID:2
subject_id  predicate_id     object_id  creator_id
EXA:0001    skos:exactMatch  EXB:0001   
EXA:0002    skos:exactMatch  EXB:0002   ORCID:3|ORCID:4
EXA:0003    skos:exactMatch  EXB:0003

Under option 1), it is clear that the value of creator_id for the EXA:0001 and EXA:0003 mappings should be ORCID:1|ORCID:2, but what about the EXA:0002 mapping? It is almost certain that in this situation, some people would expect that the final creator_id value for EXA:0002 should be the union of the two lists (ORCID:1|ORCID:2|ORCID:3|ORCID:4) – and I don’t think that is an unreasonable expectation, so that’s another detail for which we would need to make a clear decision and to explain in the spec.

The second problem is about mappings with empty values. In the example above, what if the EXA:0003 has an empty value because we don’t actually know who created it? Under option 1), we can’t distinguish between a field that is empty because it is supposed to be empty and a field that is empty because it is supposed to take the default value propagated from the set level.

Option 3) solves those two problems by basically disallowing the mixing of a default value set at the set level and explicit values set at the individual mapping level.

To put it differently, option 3) means that setting a value at the set level as a “shortcut” to avoid setting it on each mapping is only allowed in the case where all mappings in a set have the same value, and the field is not even present in the TSV columns.

So even if you have 99 mappings with the same value A and just one mapping with a different value B, you MUST explicitly set the values on each individual mapping. You CANNOT set the value A at the set level and leave it empty on the 99 corresponding mappings ­– well, you can, but if you do that the value from the set level will not be taken into account: all 99 mappings will be treated as if they had no value.

I believe this is the safest and least confusing option.

@matentzn
Copy link
Collaborator

I think you are right - this is what we should do! Any other opinions?

@gouttegd
Copy link
Contributor Author

To be clear, I slightly prefer option 3) but I think all three options are more or less equally reasonable, so I don’t mind which one is chosen in the end.

What matters to me is that one option is decided and that the corresponding expected behaviour is described in the spec, so that the behaviour is not left to the implementations to decide.

@matentzn
Copy link
Collaborator

matentzn commented Jul 30, 2023

We can ask other for their opinion, but lets move to the second issue: how to solve which slots should be propagated at all. I made a quick analysis:

Slot currently not globally specifiable but maybe should be?

  • mapping_justification
  • object_category
  • subject_category
  • semantic_similarity_measure
  • author_id
  • author_label

Slots exist on both levels, and should be propagated

  • mapping_date
  • mapping_provider
  • mapping_tool
  • mapping_tool_version
  • object_match_field
  • object_preprocessing
  • object_source
  • object_source_version
  • object_type
  • subject_match_field
  • subject_preprocessing
  • subject_source
  • subject_source_version
  • subject_type

Slot only exist on mapping level, and should not exist globally:

  • confidence
  • issue_tracker_item
  • mapping_cardinality
  • match_string
  • object_id
  • object_label
  • reviewer_id
  • reviewer_label
  • subject_id
  • subject_label
  • semantic_similarity_score (fixed after @gouttegd comment, was oversight)
  • predicate_id (originally suggested by @matentzn as candidate for globally specifyable, but contested by @gouttegd as too important; @matentzn concurs after review)
  • predicate_label (originally suggested by @matentzn as candidate for globally specifyable, but contested by @gouttegd as too important; @matentzn concurs after review)
  • predicate_modifier (originally suggested by @matentzn as candidate for globally specifyable, but contested by @gouttegd as too important; @matentzn concurs after review)

Slot exists on both levels and should not be propagated

  • comment
  • see_also
  • other
  • creator_id
  • creator_label

Slot exists on both levels, but should probably only exist on mapping set level

EDITS after @gouttegd comments below.

@gouttegd
Copy link
Contributor Author

Slot currently not globally specifiable but maybe should be?

  • semantic_similarity_score

How likely it is that all mappings in a set will have the same semantic similarity score, so that the value could be set only once at the set level?

I would have no objection to that, but it seems of dubious value. (Contrary to semantic_similarity_measure: it makes sense – though it would not always be the case – that the same measure has been used for all mappings in a set.)

Slots exist on both levels, and should be propagated

  • mapping_date

Makes sense. I note you did not mention the publication_date field, which also exists on both level.

My suggestion would be to make publication_date a mapping set value only. After all, individual mappings are never published on their own, only the mapping set they belong to is.

So we would have mapping_date which says when an individual mapping was asserted (and which MAY be set at the global level if all mappings in a set were asserted at the same time, and propagation will occur), and publication_date which says when a mapping set was published.

  • predicate_id
  • predicate_label
  • predicate_modifier

Unless I’m missing something those slots do not exist on both levels, they are on the mapping level only. And I think they should stay like that. Even if a mapping set only contains mappings that have the same predicate, I’d argue that the predicate is much too important to be set once and for all at the set level, and should always be set explicitly for each mapping.

  • license

This one is tricky. Propagating it would mean that the set-level license is not the license of the set, it is the license of all the mappings in the set (which is not the same thing).

In fact, if we follow the logic above about publication_date, where only a mapping set can ever said to be published (and thereby needs a publication date), then similarly only a mapping set needs a license.

No opinion for all the other slots.

@matentzn
Copy link
Collaborator

I updated all the suggestions @gouttegd according to your comments! I agree with all of them, thanks.

@gouttegd
Copy link
Contributor Author

I have no good idea how to solve this. Lets start with the two most obvious ways: meta-modelling and default range assumption. […] Any other ideas?

It just occurred to me that there would be a third way, though it would involve a breaking change 😱 with the current version of the spec, so I don’t expect you to see it favourably. I mention it for completeness, though.

Add a new slot at the set level only called something like defaults (or default_values), where the mapping-level values that are the same for all the mappings in the set can be specified.

That is, instead of this:

#mapping_tool: "Microsoft MappingMaker 2000"
subject_id  predicate_id     object_id
EXA:0001    skos:exactMatch  EXB:0001
EXA:0002    skos:exactMatch  EXB:0002
EXA:0003    skos:exactMatch  EXB:0003

we would have this:

#defaults:
#  mapping_tool: "Microsoft MappingMaker 2000"
subject_id  predicate_id     object_id
EXA:0001    skos:exactMatch  EXB:0001
EXA:0002    skos:exactMatch  EXB:0002
EXA:0003    skos:exactMatch  EXB:0003

This would have two benefits:

  1. It makes immediately obvious which values in the metadata block are really metadata about the set, and which ones are (default) metadata for the individual mappings, without having to refer to the spec to find the propagation rules. Everything that is under the defaults slot are individual mapping metadata and should be propagated, everything that is outside of defaults are mapping set metadata and should never be propagated.

  2. It allows to use the same slot name for a metadata that we want to exist both at the set level and at the individual mapping level. For example, if we decide that the mapping set should have its own creation_date (independently of the creation date of the mappings), we could have

#creation_date: ... (date when the mapping SET was created)
#defaults:
#  creation_date: ... (default creation date for all individual mappings)

Which is something that would not be possible with the current system.

The obvious slightly minor inconvenient of this approach is, at noted above, that it is completely different from the current version of the spec and therefore would make several people unhappy (myself included, since the change would obviously affect SSSOM-Java!), at least for the beginning.

@matentzn
Copy link
Collaborator

This is not a bad idea! I am assuming it will cause some upheaval. I will run it by Chris in my next call with him and see what he thinks.

@gouttegd
Copy link
Contributor Author

I am assuming it will cause some upheaval.

Certainly. And there is a reasonable argument to be made that if it has taken so much efforts to convince people to adopt SSSOM, pushing an incompatible change now would not be wise and could lead to those people deciding to drop the standard.

@gouttegd
Copy link
Contributor Author

Any update on this?

If there is no objection to this comment, I’d like to start turning it into actual changes both in the specification (e.g., removing the publication_date slot from the Mapping class to keep it only in the MappingSet class) and in the documentation (to explain which slots should be propagated).

@matentzn
Copy link
Collaborator

matentzn commented Sep 20, 2023

I think the following we can enact right now as suggested by the comment you linked:

  1. Slot currently not globally specifiable but maybe should be?
  2. Slot only exist on mapping level, and should not exist globally:
  3. Slot exists on both levels, but should probably only exist on mapping set level

The question of how to deal with default values is still open. I have now added it to my agenda with @cmungall to contemplate (forgot it last time), and see what we agree on.

EDIT: I actually did discuss this with him, and he said he had a "51% towards metamodel solution but open to other". The metamodel solution would be to add a flag like "propagates: true" to all the slots that can propagate and define a default behaviour for the possible conflicts you outlined above.

@gouttegd
Copy link
Contributor Author

gouttegd commented Sep 20, 2023

I think the following we can enact right now as suggested by the comment you linked:

OK, can prepare a PR to update the model to move slots around and update their description as needed.

The question of how to deal with default values is still open. […] he had a "51% towards metamodel solution but open to other". The metamodel solution would be to add a flag like "propagates: true" to all the slots that can propagate

I am fine with that solution, but I could use some help with the implementation, because it’s unclear to me how you can add an arbitrary flag to a slot in LinkML.

@matentzn
Copy link
Collaborator

You can probably reach out to @sierra-moxon (also on OBO slack) who is one of our LinkML master wizards.

@gouttegd gouttegd mentioned this issue Jul 2, 2024
4 tasks
@gouttegd gouttegd mentioned this issue Jul 11, 2024
7 tasks
gouttegd added a commit that referenced this issue Jul 19, 2024
Resolves [#305]

- [x] `docs/` have been added/updated if necessary
- [x] `make test` has been run locally
- [ ] tests have been added/updated (not applicable)
- [x]
[CHANGELOG.md](https://github.com/mapping-commons/sssom/blob/master/CHANGELOG.md)
has been updated.

If you are proposing a change to the SSSOM metadata model, you must 

- [ ] provide a full, working and valid example in `examples/` (**not
applicable**: no new example needed as the change only affects how some
slots should be interpreted; it does not add or remove slots, nor does
it change how the propagated slots are used)
- [x] provide a link to the related GitHub issue in the `see_also` field
of the linkml model
- [ ] provide a link to a valid example in the `see_also` field of the
linkml model (**not applicable**, same reason as above)

This PR finalises the fix to #305, by explicitly specifying, directly
within the LinkML model, which slots are considered “propagatable”
(previously this was only informally described in the spec, since #368).
This is done by:

* adding a “metamodel extension class“ (`sssom:Propagatable`) with a
single boolean-ranged attributed `propagated`;
* amending the slots that must be considered propagatable by making them
instantiate the `sssom:Propagatable` extension.
@gouttegd
Copy link
Contributor Author

gouttegd commented Jul 19, 2024

Done with #368 (in the spec/doc) and #371 (in the model).

The one thing that would remain to be decided is whether propagation/condensation is allowed when using other formats than SSSOM/TSV (e.g. JSON-LD), but that will need to be done as part of the efforts to actually specify those formats.

For what it’s worth, I believe condensation (and therefore propagation) does not make much sense for both the RDF and the JSON-LD serialisations. Those serialisations are very ”verbose” anyway, trying to save space by condensing slots with the same value to the level of the set would be pointless. That being said, I am not (strongly) opposed to the idea.

@graybeal
Copy link

Regarding this:

The one thing that would remain to be decided is whether propagation/condensation is allowed when using other formats than SSSOM/TSV (e.g. JSON-LD), but that will need to be done as part of the efforts to actually specify those formats.

Propagation and condensation are characteristics of model interpretation (driven by settings within the mapping metadata, or maybe I should say metamodel), are they not? Their should not be any difference in how the model is interpreted whether it is TSV, JSON-LD, or RDF, should there? Under the metamodel solution, the interpretation is explicitly defined for SSSOM content.

I can't see any obvious issue in representing the same characteristics in those other formats. The top-level metadata will define what propagates and can be expressed in any of the formats.

And the biggest value of this capability is not really about saving space, even in TSV you're not saving much space. It's about cognitive load—you don't want someone to have to manually review every mapping to make sure the metadata attribute has been added to that particular mapping and is exactly the same. Just scanning (or searching) for exceptions is way easier, in any format.

@gouttegd
Copy link
Contributor Author

Their should not be any difference in how the model is interpreted whether it is TSV, JSON-LD, or RDF, should there?

There could be, if we decided to.

I can't see any obvious issue in representing the same characteristics in those other formats.

I don’t see any issue either. I just also don’t see the real value in doing so.

But at the same time, I do see the value of not having to treat the different serialisation formats differently, which is why I said that I am not strongly opposed to it.

If we decide to treat condensation/propagation as a general characteristic of the model that is independent of the serialisation, then all that needs to happen is to move the description of the condensation/propagation operations from the section about the TSV format (where it currently resides) to the general section about the data model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants