Addressed situation when assign_default_confidence() returns only dataframe with all NaN confidence values #548

hrshdhgd · 2024-06-25T02:02:15Z

Ok, so here was the problem:

When the dataframe whose redundant rows had to be filtered out had all NaN values for confidence, the line

sssom-py/src/sssom/util.py

Line 441 in 5502067

df, nan_df = assign_default_confidence(df)

returned df = Empty dataframe and the entire source data frame = nan_df.

Due to this, the following line:

sssom-py/src/sssom/util.py

Line 447 in 5502067

    
           dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates()

result in dfmax = {} which is of type pandas.Series. Hence the confusion.

The correct way to handle this is simple adding an if statement:

sssom-py/src/sssom/util.py

Lines 447 to 469 in ffa2109

    
           if not df.empty: 
        
               dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates() 
        
               max_conf: Dict[Tuple[str, ...], float] = {} 
        
               for _, row in dfmax.iterrows(): 
        
                   if ignore_predicate: 
        
                       max_conf[(row[SUBJECT_ID], row[OBJECT_ID])] = row[CONFIDENCE] 
        
                   else: 
        
                       max_conf[(row[SUBJECT_ID], row[OBJECT_ID], row[PREDICATE_ID])] = row[CONFIDENCE] 
        
               if ignore_predicate: 
        
                   df = df[ 
        
                       df.apply( 
        
                           lambda x: x[CONFIDENCE] >= max_conf[(x[SUBJECT_ID], x[OBJECT_ID])], 
        
                           axis=1, 
        
                       ) 
        
                   ] 
        
               else: 
        
                   df = df[ 
        
                       df.apply( 
        
                           lambda x: x[CONFIDENCE] 
        
                           >= max_conf[(x[SUBJECT_ID], x[OBJECT_ID], x[PREDICATE_ID])], 
        
                           axis=1, 
        
                       ) 
        
                   ]

I've added an explicit test and it passes. Fixes #546

…aframe with all NaN confidence values

src/sssom/util.py

matentzn

LGTM, thank you!

Addressed situation when assign_default_confidence() returns only dat…

ffa2109

…aframe with all NaN confidence values

hrshdhgd requested a review from matentzn June 25, 2024 02:02

hrshdhgd added 2 commits June 24, 2024 21:05

fixed test

7b95ba6

cleanup

78eacb3

hrshdhgd mentioned this pull request Jun 25, 2024

Add confidence fill conditional in lexmatch compare monarch-initiative/mondo-ingest#581

Closed

9 tasks

twhetzel reviewed Jun 25, 2024

View reviewed changes

src/sssom/util.py Show resolved Hide resolved

matentzn reviewed Jun 26, 2024

View reviewed changes

src/sssom/util.py Show resolved Hide resolved

matentzn approved these changes Jun 26, 2024

View reviewed changes

hrshdhgd merged commit e0dfcb3 into master Jun 26, 2024
6 checks passed

hrshdhgd deleted the issue-546 branch June 26, 2024 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addressed situation when assign_default_confidence() returns only dataframe with all NaN confidence values #548

Addressed situation when assign_default_confidence() returns only dataframe with all NaN confidence values #548

hrshdhgd commented Jun 25, 2024 •

edited

Loading

matentzn left a comment

	if not df.empty:
	dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates()
	max_conf: Dict[Tuple[str, ...], float] = {}
	for _, row in dfmax.iterrows():
	if ignore_predicate:
	max_conf[(row[SUBJECT_ID], row[OBJECT_ID])] = row[CONFIDENCE]
	else:
	max_conf[(row[SUBJECT_ID], row[OBJECT_ID], row[PREDICATE_ID])] = row[CONFIDENCE]
	if ignore_predicate:
	df = df[
	df.apply(
	lambda x: x[CONFIDENCE] >= max_conf[(x[SUBJECT_ID], x[OBJECT_ID])],
	axis=1,
	)
	]
	else:
	df = df[
	df.apply(
	lambda x: x[CONFIDENCE]
	>= max_conf[(x[SUBJECT_ID], x[OBJECT_ID], x[PREDICATE_ID])],
	axis=1,
	)
	]

Addressed situation when assign_default_confidence() returns only dataframe with all NaN confidence values #548

Addressed situation when assign_default_confidence() returns only dataframe with all NaN confidence values #548

Conversation

hrshdhgd commented Jun 25, 2024 • edited Loading

matentzn left a comment

Choose a reason for hiding this comment

hrshdhgd commented Jun 25, 2024 •

edited

Loading