Add score for how close two terms are when using broad/narrow/close predicates #36

sbello · 2023-08-04T20:12:57Z

A the last meeting (8/3/23) @matentzn proposed adding a score to the manual mappings for how close two terms are when mapped using broad/narrow/close. Using this ticket to write up initial thoughts and track proposals for implementing this.

My initial proposal is to estimate how often the given match would result in inclusion of unwanted data when traversing from the narrower term to the broad term. Basically, how much noise is inherent in the match. The consideration has to be from narrow to broad as all annotations to the narrower term are or should be applicable to the broad term. If this is not the case then you should use related.

Proposed scale is 0-1, where 1 is an exact match; you should never use 1 as those should use the skos:exactMatch predicate.

I've essentially be treating close as almost but not quite an exact match so these should have a high score on the scale.

Given that this is at best a rough estimate I'm going to stick with 1 decimal place for now. So a score of:
0.9 = little noise, almost everything should useful, I think these should be mostly skos:closeMatch
0.5 = moderately noisy, should be broad/narrow/related
0.1 = very noisy, still broad/narrow but several steps away from each other in the hierarchy of the ontologies, often question the value of even making the mapping I would not make 'related' mappings that were this noisy

matentzn · 2023-08-05T11:13:31Z

I think this could be a very valuable addition. Thank you @sbello for taking the time to writing this up.

There are a few alternatives we could use to handle this:

S1: predicate_id + semantic_similarity_score - basically the semantic similarity score expresses the "distance", while the predicate_id expresses the "direction".
S2: predicate_id + distance (or something similar in numeric scale), same as above, but using a different property
S3: predicate_id + distance (or something similar in categorical scale, e.g. semantic_similarity_category). Basically due to the subjectivity in determining the "distance", you could translate the scale you provide "little noise/low", "moderate", "high" and use them as categorical variables, rather than arbitrary numeric ones.

I think I like S3 and S1.

S1 has the advantage that we don't need a new metadata field, and the disadvantage that there will be a lot of "noise" in the subjective numeric field, which will make it harder for data scientists to process the mapping.
S3 has the advantage that you can process it much more easily, and there is going to be much less noise, but.. you need a new metadata field.

I think after some contemplation, I am tending to S3, but I can convinced to do S1. I am a bit less enthusiastic about S2 because it has the disadvantages of both, and the only advantage is that the "semantic_similarity_score" metadata element is not repurposed slightly.. ideosyncractically.

joeflack4 · 2023-08-05T23:43:53Z

I wonder if there's any relevant literature on this topic for precisely this case. I feel there must be some existing thinking on this topic, and I feel like there are a variety of different algorithms you could use to derive such a distance score.

Algorithms

S1. Semantic similarity

I think that S1 certainly sounds the easiest, but definitely semantic similarity alone doesn't really get at the extend of broadness/narrowness.

S2-3.a. Distance as a flattening of existing narrow/broad hierarchies

Using existing ontologizations
I'm thinking that sometimes, you could get at how broad/narrow something is based on whether there is any existing ontologization. E.g. if you have an existing ontology where A narrower B narrower C, you could say A narrower C in w/ 'distance 2' in this respect. Maybe A, B, and C are in the same ontology, or maybe they are in different ontologies but have these connections; I suppose the distance would still be the same.

On the fly ontologization
Maybe there are existing terms A and D in some ontology(s). Term A is narrower than a term D, and there are not existing terms B or C in between, but we know that they exist. Perhaps in this process, rather than saying there is a distance of 1 simply because A and D currently exist, we say there is a distance of 3 because we know B and C do exist and should ideally be part of the ontology.

S2-3b. Distance as a composite of other properties

As in 'depth of blueness' or 'condition of vehicle' in example below.

Multi-branch averaging; car example

Maybe terms are connected in more than one respect / via more than one branch. And maybe the branches have different lengths. Like A is narrower than B in 2 different respects, and 1 of those respects is very narrow, and the other is narrow but not as narrow. I don't know how often this actually happens. Just thinking of an example. Let's say we have 2 cars. A is 'average blue' and in general poor condition but functional. B is dark blue Like perhaps it has a flat tire. Via the 'blueness' branch/vector, maybe we could say that there is a short narrow distance. But maybe on the condition branch/vector, we could say the narrowness is farther. For example, 'flat tire' narrower ('damaged tires' or 'low tire pressure') narrower 'poor condition'. Maybe not a perfect example, but it's just to illustrate that things can be broad/narrow in more than one way, and an ideal distance metric would take that into account.

Something that would make this even more complicated is if the 'weights' of the branches differ. Like, if we're comparing the two cars above, maybe for some reason in our classification system, the color of the car is more important, so we would care about the narrowness/broadness along the branch of color as more important than the branch of condition.

I think 0-1 is good. I think that the most robust distance metric would be S2 (continuous) rather than S3 (discrete), though I think that S3 could get "quantized", e.g. low=0.33, moderate=0.5, high=0.66.

After thinking about this, I think semantic similarity is really just a proxy for distance. If we used semantic similarity as distance, I would recommend 2 fields: 'distance' and 'distance_algorithm', where distance algorithm would be 'semantic similarity'. Or, we could simply have 2 different fields; 1 for 'distance' (we would use some other algorithm), and we could also include a 2nd 'semantic similarity' field.

And I can't think of another way to think of too many ways to think of distance. I can think of it as being representative discrete; of explicit relationships between entities, as in S2-3a/b above. And I can think of distance too on a continuous spectrum. Like in the 'average' vs 'dark blue' car example; the real color spectrum is really not discrete like that. But I think for our use cases, things are likely to be discrete.

I would be interested to know if anyone has any other ideas as to what 'distance' really can represent here.

matentzn · 2023-08-06T12:53:09Z

Thank you @joeflack4 for your thoughts! I think this is all reasonable thinking; however, the main problem is that we are not that interested in "technical" distance as such, we are interested in real-world distance. This means that even if the two aligned ontologies contains only contain a single term each, both these terms could be more or less distant from each other, and there is no semantic similarity here at all. Indeed, when aligning two semantic spaces, there is not really a proper semantic similarity score ala Jaccard in most cases. There is only the conceptual model in the head of the experts (embedding space in the brain), and how distant both terms are according to that. This is what the issuer @sbello is trying to capture here.

sbello self-assigned this Aug 4, 2023

sbello added the enhancement New feature or request label Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add score for how close two terms are when using broad/narrow/close predicates #36

Add score for how close two terms are when using broad/narrow/close predicates #36

sbello commented Aug 4, 2023 •

edited

Loading

matentzn commented Aug 5, 2023

joeflack4 commented Aug 5, 2023 •

edited

Loading

matentzn commented Aug 6, 2023

Add score for how close two terms are when using broad/narrow/close predicates #36

Add score for how close two terms are when using broad/narrow/close predicates #36

Comments

sbello commented Aug 4, 2023 • edited Loading

matentzn commented Aug 5, 2023

joeflack4 commented Aug 5, 2023 • edited Loading

Algorithms

S1. Semantic similarity

S2-3.a. Distance as a flattening of existing narrow/broad hierarchies

S2-3b. Distance as a composite of other properties

Multi-branch averaging; car example

matentzn commented Aug 6, 2023

sbello commented Aug 4, 2023 •

edited

Loading

joeflack4 commented Aug 5, 2023 •

edited

Loading