-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use memory bank for images not belonging to the same video ? #352
Comments
There was a related discussion in issue #210, though that was for using an image as a starting prompt for a video. In general it can work, but it's probably not capable of exact matching (i.e. I wouldn't expect it to pick out a specific person from a crowd, though I haven't tried). It seems to match based on multiple factors, not just the 'semantic' label we would assign but also color and positioning (of the original prompt) for example. Here's a simple example of prompting on a picture of a bird and applying that to another similar picture along with an unrelated pictures (these photos are all from the BIG dataset): Transfer image (i.e. no prompt used, instead it uses the memory bank from the result above). It may prefer the central bird because the original prompt is centered, even though the left bird seems like a better match. Two more transfer images, but unrelated to the original. Here it is picking stuff on the left, maybe because of the sizing of the objects (not really sure). |
Just as a follow-up for anyone interested, I've set up an interactive script to make this easier to try. Here's a quick example video: bird_cross_seg_example.mp4On the left is an image being prompted with a single foreground point (the green dot), while the image on the right is segmented based on the 'video memory bank' from the first image result. The model seems to intutively segment things like the eye, beak and feet as well as the branch/platform the birds stand on. It gives lots of weird and often bad results too, and is very sensitive to which mask is being selected. For anyone trying to use the model this way, it seems important to encode the second image in the memory bank several times (i.e. as if it appears repeatedly as a short video) for the segmentation to come out cleanly. |
That is interesting. Do you mean that the segmentation on the right becomes better (e.g., less fuzzy). the more often the segmentation is repeated? Would you suspect this to be some kind of averaging/smoothing effect? Would be glad if you could share some example images. |
It does frequently have a smoothing effect. It often seems to converge on a better mask of the area that was initially (uncleanly) masked. This clean up trick also works when cross-prompting an image with itself (though maybe simpler to just adjust the prompt). Here's an example where the initial (bird) prompt causes both sheep to get messy masks, but repeating the frames cleans up the mask on the left sheep and removes the right sheep: repeat_encoding_example.mp4Interestingly, it also improves the object score which is normally an indicator for the video tracking that it 'found' the object in the next frame. It also converges on the legs of the sheep in the second mask output (far right, second from the top) and sort of converges on the sheep's face on the last mask output, so the behavior varies based on the initial mask. |
Thanks @heyoeyo . Very interesting insight! Do you have any clue if this works similarly for SAMv1? |
Unfortunately, the SAMv1 models can't be used in the same way since they don't have the video capability/memory bank of the v2 models. |
Yeah, I know. Rather meant only repeatedly feeding in a mask to improve it. Was hoping to use it for improvement of solutions from mask input prompt - which to my knowledge gives worse results as compared to other prompt types. But that is slightly off topic here. |
Oh I see what you mean. I haven't experimented with the mask prompts too much, since (as you say) they tend to give disappointing results compared to other prompts. From what little I have seen, the mask prompts (v1 or v2) don't seem to show the same convergence behavior as using the v2 memory bank. Surprisingly, the v2.1 mask prompts do show some cross-image prompting capability though (could be a coincidence, I haven't tried it on many images). That being said, in the SAMv1 paper they mention (on page 17 under the Training Algorithm section) feeding the output of the model back in as a mask prompt as if it helps, so maybe there is a way to make it work. The automatic mask generator even includes this sort of processing as a option. |
Has anyone tried to leverage the SAMv2 memory bank for other purpose than video segmentation ? For instance if I have 10 images of a given object (like a table, chair etc..), I only click on the object in image 1 and want sam to automatically segment similar objects in all the other images ?
The text was updated successfully, but these errors were encountered: