Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval means inputparams #26

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bluque
Copy link
Contributor

@bluque bluque commented Mar 24, 2017

This script outputs the metrics of each batch of 128 images, but I have added the computation of the mean of these metrics in order to have a global evaluation of the model.

I also propose an improvement over #25 where the model and dataset names are passed as arguments when calling the function. This way, we don't need to modify the script for every new evaluation.
python eval_detection_fscore.py model_name dataset_name weights_file path_to_images

The only reason why I changed the way the model is built when distinguishing between yolo and tiny yolo is so it's easier to add more models in a future (just add another elif model_name == ...), but it's not a relevant change, the result is the same.

@lluisgomez
Copy link
Collaborator

@bluque Thanks for the pull request!

Overall I like the changes you propose for the input arguments of model and dataset names.

However I do not see the point of the "averaged metrics". See, in the original code what is printed on lines from 125 to 128 is the "running" precision, recall, and f-score. Not the metrics for each batch.

For example the variable "ok" is defined and set to zero on line 66, outside the loop, and never set to zero again. We only increase its value every time we find a correct detection.

The same for variables "total_pred" and "total_true".

So the metrics that are shown are the metrics for all the images evaluated so far. Thus when the script finishes they are the metrics for the whole dataset. right?

On the other hand, be careful with one thing: an averaged metric (per batch) as you propose is not always meaningful. Imagine we evaluate only two "batches" of 128 images each. In the first batch there is only one object in one of the images (all other images contain no objects) and the model we are evaluating misses it, so the recall for this "batch" is zero. Then imagine In the second batch there are 200 objects and the model detects correctly all of them, so the recall for the second batch is 100%. If you do the mean of these two recall values you get a final recall of 50% while the model had correctly detected 200 objects out of 201 :) so final recall must be 99.5%. Do you see the point? We must calculate the average over the total objects in the ground truth, not over the total images or batches.

Please, let me know if this is clear ... I've double checked the code and I think it's correct as it is. Anyway, it's always good to check things that are not clear, and be sure they are correct.

Also I'm open to change the code for example to print "Running precision" instead of "Precision" etc. and then print the final precision at the end when the main loop is finished. Maybe this helps to avoid confusions.

@bluque
Copy link
Contributor Author

bluque commented Mar 25, 2017

Yes, you are right! I didn't check with detail how the metrics where computed because I thought they were related to the batch. In any case, it is true that the average of these metrics wouldn't be precise anyway. I will make the modifications you propose :)

I removed the means of the precision, recall and F score but I did leave the mean of the fps, as this one is computed on every batch independently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants