cocoeval GT-DT matching implementation is wrong #564

bertsky · 2021-12-08T23:19:33Z

I don't know how this could go undetected for so long (cannot find any mention in the issues), but I am certain there is a bug in the algorithm that is used to match DT (detection) and GT (ground-truth) objects. It gets more severe when you use a lower iouThr, but you can always have pathological cases even at high degrees of overlap. Both the Python and the Matlab implementation are affected.

The basic idea of evaluateImg(imgId, catId) algorithm is to (for each iouThr value independently)

iterate through all DT objects
- iterate through all GT objects
  - if that pair's IoU is at least iouThr and
  - if that pair's IoU is larger than that of all the other GT objects (within the inner loop)
  - then mark/keep that as a match

Now, the outer-inner loop structure already dictates asymmetry: If there are multiple potential matches, then we obviously cannot get the best DT match per GT object – only the best GT match per DT object. That's already a problem in itself (e.g. 2 DTs could "share" 1 GT, but the GT would have to "choose"; ideally the data structure and the algorithm for filling it should be rich enough to completely map multiple matches, symmetrically). But one could live with that asymmetry (and if need be for the other preference, just call COCOeval in reverse direction).

However, it gets worse by an additional criterion that seemingly addresses this – but in the wrong way:

iterate through all DT objects
- iterate through all GT objects
  - if that GT object does not already have some DT match (within the outer loop) and
  - if that pair's IoU is at least iouThr and
  - if that pair's IoU is larger than that of all the other GT objects (within the inner loop)
  - then mark/keep that as a match

cocoapi/PythonAPI/pycocotools/cocoeval.py

Lines 279 to 281 in 8c9bcc3

    
           # if this gt already matched, and not a crowd, continue 
        
           if gtm[tind,gind]>0 and not iscrowd[gind]: 
        
               continue

cocoapi/MatlabAPI/CocoEval.m

Lines 383 to 384 in 8c9bcc3

    
           % if this gt already matched, and not a crowd, continue 
        
           if( gtm(t,g)>0 && ~iscrowd(g) ), continue; end

That effectively ruins even the other criterion (of selecting the best GT match per DT object), replacing it with an arbitrary choice (of whichever DT came first to "reserve" the GT). We end up with up to 1 GT per DT (the "best free") and up to 1 DT per GT (the "first best"). It's only "up to" 1, because obviously we can now miss the only match for a pair (even if it is actually better IoU-wise) when its GT is already reserved from some earlier (i.e. higher-scoring) DT.

I wonder what motivation the authors had for that additional criterion. Could it be speed (i.e. shortcutting the combinatorial explosion of comparisons)? Or was it consistency (i.e. ensuring choices in gtMatches always mirror those in dtMatches)?

IMO one should modify that criterion to at least compare the IoU of the previous DT match of the GT object with the current pair's IoU: iff the former is actually larger than the latter, then keep it, but otherwise drop it (i.e. reassign the GT) or add it (i.e. allow dtMatches to contain the same GT id multiple times, while gtMatches only contains the largest DT id).

Please correct me if I am getting it all wrong! (Also, if you have better ideas, or different perspectives.)

The text was updated successfully, but these errors were encountered:

andreaceruti · 2022-03-09T19:20:12Z

@bertsky I do not think this repo is maintained from the authors, anyway can you give an implementation of your solution to the problem? I think you are right in your point

bertsky · 2022-03-09T21:03:27Z

@andreaceruti, no I did not bother with pycocotools for that purpose (but it would not be difficult to do based on the above analysis I believe). I went for my own algorithm here ff. to get truly n:m matches (and FP and FN and pixel-wise measures and over/undersegmentation measures).

andreaceruti · 2022-03-09T22:48:46Z

@bertsky Wow, really nice work! If i would like to evaluate my custom dataset on instance segmentation and object detection task using your implementation could I just use the evaluate_coco function? Or do you think I should do some other changes? Anyway I will go more in depth with your code in these days since it is documented better than this repo :)

bertsky · 2022-03-09T23:39:22Z

@andreaceruti Yes, evaluate_coco if you already have COCO datasets – but there is no CLI for that (yet). If you have PAGE-XML, you can use standalone_cli (which gets installed as page-segment-evaluate via entry_points).

andreaceruti · 2022-03-10T16:05:34Z

@bertsky I am almost there. Yes I have 2 json files reresenting coco_gt and coco_dt. I have applied your method and then I just look at the dict constructed by the method.
So the first part summarizes all the dataset, and then you have added one dict for each image. So for example I have this dict for the entire dataset

" 'by-category': {'grape bunch': {'IoDT': 0.8433501965488758,
'IoGT': 0.8414118188031777,
'IoU': 0.7247137091559454,
'oversegmentation': 0.00469693580854395,
'pixel-iou': 0.9503520788133711,
'pixel-precision': 0.9917968516288882,
'pixel-recall': 0.9578813700668292,
'segment-precision': 0.9752851711026616,
'segment-recall': 0.8496732026143791,
'undersegmentation': 0.007393324883819181}}"

And this for one sample image
"'CDY_2043.jpg': {'oversegmentation': {'grape bunch': 0.018181818181818184},
'pixel_iou': {'grape bunch': 1.0},
'pixel_precision': {'grape bunch': 1.0},
'pixel_recall': {'grape bunch': 1.0},
'precision': {'grape bunch': 1.0},
'recall': {'grape bunch': 1.0},
'true_positives': {'grape bunch': [{'DT.ID': 344,
'DT.area': 37258,
'GT.ID': 2031,
'GT.area': 47830,
'I.area': 36497,
'IoDT': 0.9795748564066777,
'IoGT': 0.7630566590006272,
'IoU': 0.7511061719248421},
....
{'DT.ID': 354,
'DT.area': 4067,
'GT.ID': 2036,
'GT.area': 11346,
'I.area': 3794,
'IoDT': 0.9328743545611016,
'IoGT': 0.33439097479287855,
'IoU': 0.32653412513985713}]},
'undersegmentation': {'grape bunch': 0.0}}"

If I understand correctly the idea behind these metrics are taken from "rethinking semantic segmentation evaluation" paper, but could you explain to me how could I obtain AP,TPs,FPs,FNs for instance segmentation task?

andreaceruti · 2022-03-10T16:17:04Z

@bertsky this is the image I have used as example. On the left you can see the ground truth and on the right you can see the detections

bertsky · 2022-03-10T16:32:34Z

@andreaceruti this is off topic here and too verbose – let's discuss in ocrd_segment

volcanolee4 · 2022-03-12T01:55:21Z

@bertsky Hi! I don't think it is an arbitrary choice in cocoeval.evaluate() function. (as you say ,the arbitrary choice means "of whichever DT came first to "reserve" the GT"). Before a dt find its correct gt,the list of dt has already been sorted by their class confidence score, so if a dt can "come first",that means this dt has a higher class confidence score.In conclusion,if a gt meets more then one dt,the dt who has the highest class confidence score can match this gt.

bertsky · 2022-03-14T11:58:51Z

@volcanolee4 I did not say the choice was random though. (In fact, I did allude to the fact the candidates are sorted by confidence with my formulation GT is already reserved from some earlier – i.e. higher-scoring – DT.)

But (given the actual problem to solve here – how to adequately compare a given prediction with the true segmentation) I insist that choice is arbitrary: any DT candidate with just slightly higher confidence but much lower relative or absolute overlap can replace the natural best fit for a GT region. (And that assignment then not only becomes exclusive for the GT, it can also prevent the DT from matching at all.) It is one thing to factor in the confidence of a predictor, but another to properly assess its accuracy. The current implementation not only conflates the two, it makes both impossible.

bertsky mentioned this issue Mar 10, 2022

evaluate: explain/document metrics OCR-D/ocrd_segment#57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cocoeval GT-DT matching implementation is wrong #564

cocoeval GT-DT matching implementation is wrong #564

bertsky commented Dec 8, 2021

andreaceruti commented Mar 9, 2022

bertsky commented Mar 9, 2022

andreaceruti commented Mar 9, 2022

bertsky commented Mar 9, 2022

andreaceruti commented Mar 10, 2022

andreaceruti commented Mar 10, 2022

bertsky commented Mar 10, 2022

volcanolee4 commented Mar 12, 2022 •

edited

Loading

bertsky commented Mar 14, 2022

cocoeval GT-DT matching implementation is wrong #564

cocoeval GT-DT matching implementation is wrong #564

Comments

bertsky commented Dec 8, 2021

andreaceruti commented Mar 9, 2022

bertsky commented Mar 9, 2022

andreaceruti commented Mar 9, 2022

bertsky commented Mar 9, 2022

andreaceruti commented Mar 10, 2022

andreaceruti commented Mar 10, 2022

bertsky commented Mar 10, 2022

volcanolee4 commented Mar 12, 2022 • edited Loading

bertsky commented Mar 14, 2022

volcanolee4 commented Mar 12, 2022 •

edited

Loading