Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add score for how close two terms are when using broad/narrow/close predicates #36

Open
sbello opened this issue Aug 4, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@sbello
Copy link
Collaborator

sbello commented Aug 4, 2023

A the last meeting (8/3/23) @matentzn proposed adding a score to the manual mappings for how close two terms are when mapped using broad/narrow/close. Using this ticket to write up initial thoughts and track proposals for implementing this.

My initial proposal is to estimate how often the given match would result in inclusion of unwanted data when traversing from the narrower term to the broad term. Basically, how much noise is inherent in the match. The consideration has to be from narrow to broad as all annotations to the narrower term are or should be applicable to the broad term. If this is not the case then you should use related.

Proposed scale is 0-1, where 1 is an exact match; you should never use 1 as those should use the skos:exactMatch predicate.

I've essentially be treating close as almost but not quite an exact match so these should have a high score on the scale.

Given that this is at best a rough estimate I'm going to stick with 1 decimal place for now. So a score of:
0.9 = little noise, almost everything should useful, I think these should be mostly skos:closeMatch
0.5 = moderately noisy, should be broad/narrow/related
0.1 = very noisy, still broad/narrow but several steps away from each other in the hierarchy of the ontologies, often question the value of even making the mapping I would not make 'related' mappings that were this noisy

@sbello sbello self-assigned this Aug 4, 2023
@sbello sbello added the enhancement New feature or request label Aug 4, 2023
@matentzn
Copy link
Contributor

matentzn commented Aug 5, 2023

I think this could be a very valuable addition. Thank you @sbello for taking the time to writing this up.

There are a few alternatives we could use to handle this:

  • S1: predicate_id + semantic_similarity_score - basically the semantic similarity score expresses the "distance", while the predicate_id expresses the "direction".
  • S2: predicate_id + distance (or something similar in numeric scale), same as above, but using a different property
  • S3: predicate_id + distance (or something similar in categorical scale, e.g. semantic_similarity_category). Basically due to the subjectivity in determining the "distance", you could translate the scale you provide "little noise/low", "moderate", "high" and use them as categorical variables, rather than arbitrary numeric ones.

I think I like S3 and S1.

  • S1 has the advantage that we don't need a new metadata field, and the disadvantage that there will be a lot of "noise" in the subjective numeric field, which will make it harder for data scientists to process the mapping.
  • S3 has the advantage that you can process it much more easily, and there is going to be much less noise, but.. you need a new metadata field.

I think after some contemplation, I am tending to S3, but I can convinced to do S1. I am a bit less enthusiastic about S2 because it has the disadvantages of both, and the only advantage is that the "semantic_similarity_score" metadata element is not repurposed slightly.. ideosyncractically.

@joeflack4
Copy link

joeflack4 commented Aug 5, 2023

I wonder if there's any relevant literature on this topic for precisely this case. I feel there must be some existing thinking on this topic, and I feel like there are a variety of different algorithms you could use to derive such a distance score.

Algorithms

S1. Semantic similarity

I think that S1 certainly sounds the easiest, but definitely semantic similarity alone doesn't really get at the extend of broadness/narrowness.

S2-3.a. Distance as a flattening of existing narrow/broad hierarchies

Using existing ontologizations
I'm thinking that sometimes, you could get at how broad/narrow something is based on whether there is any existing ontologization. E.g. if you have an existing ontology where A narrower B narrower C, you could say A narrower C in w/ 'distance 2' in this respect. Maybe A, B, and C are in the same ontology, or maybe they are in different ontologies but have these connections; I suppose the distance would still be the same.

On the fly ontologization
Maybe there are existing terms A and D in some ontology(s). Term A is narrower than a term D, and there are not existing terms B or C in between, but we know that they exist. Perhaps in this process, rather than saying there is a distance of 1 simply because A and D currently exist, we say there is a distance of 3 because we know B and C do exist and should ideally be part of the ontology.

S2-3b. Distance as a composite of other properties

As in 'depth of blueness' or 'condition of vehicle' in example below.

Multi-branch averaging; car example

Maybe terms are connected in more than one respect / via more than one branch. And maybe the branches have different lengths. Like A is narrower than B in 2 different respects, and 1 of those respects is very narrow, and the other is narrow but not as narrow. I don't know how often this actually happens. Just thinking of an example. Let's say we have 2 cars. A is 'average blue' and in general poor condition but functional. B is dark blue Like perhaps it has a flat tire. Via the 'blueness' branch/vector, maybe we could say that there is a short narrow distance. But maybe on the condition branch/vector, we could say the narrowness is farther. For example, 'flat tire' narrower ('damaged tires' or 'low tire pressure') narrower 'poor condition'. Maybe not a perfect example, but it's just to illustrate that things can be broad/narrow in more than one way, and an ideal distance metric would take that into account.

Something that would make this even more complicated is if the 'weights' of the branches differ. Like, if we're comparing the two cars above, maybe for some reason in our classification system, the color of the car is more important, so we would care about the narrowness/broadness along the branch of color as more important than the branch of condition.


I think 0-1 is good. I think that the most robust distance metric would be S2 (continuous) rather than S3 (discrete), though I think that S3 could get "quantized", e.g. low=0.33, moderate=0.5, high=0.66.

After thinking about this, I think semantic similarity is really just a proxy for distance. If we used semantic similarity as distance, I would recommend 2 fields: 'distance' and 'distance_algorithm', where distance algorithm would be 'semantic similarity'. Or, we could simply have 2 different fields; 1 for 'distance' (we would use some other algorithm), and we could also include a 2nd 'semantic similarity' field.

And I can't think of another way to think of too many ways to think of distance. I can think of it as being representative discrete; of explicit relationships between entities, as in S2-3a/b above. And I can think of distance too on a continuous spectrum. Like in the 'average' vs 'dark blue' car example; the real color spectrum is really not discrete like that. But I think for our use cases, things are likely to be discrete.

I would be interested to know if anyone has any other ideas as to what 'distance' really can represent here.

@matentzn
Copy link
Contributor

matentzn commented Aug 6, 2023

Thank you @joeflack4 for your thoughts! I think this is all reasonable thinking; however, the main problem is that we are not that interested in "technical" distance as such, we are interested in real-world distance. This means that even if the two aligned ontologies contains only contain a single term each, both these terms could be more or less distant from each other, and there is no semantic similarity here at all. Indeed, when aligning two semantic spaces, there is not really a proper semantic similarity score ala Jaccard in most cases. There is only the conceptual model in the head of the experts (embedding space in the brain), and how distant both terms are according to that. This is what the issuer @sbello is trying to capture here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants