Use memory bank for images not belonging to the same video ? #352

tcourat · 2024-10-03T12:12:28Z

Has anyone tried to leverage the SAMv2 memory bank for other purpose than video segmentation ? For instance if I have 10 images of a given object (like a table, chair etc..), I only click on the object in image 1 and want sam to automatically segment similar objects in all the other images ?

heyoeyo · 2024-10-03T14:10:27Z

There was a related discussion in issue #210, though that was for using an image as a starting prompt for a video. In general it can work, but it's probably not capable of exact matching (i.e. I wouldn't expect it to pick out a specific person from a crowd, though I haven't tried). It seems to match based on multiple factors, not just the 'semantic' label we would assign but also color and positioning (of the original prompt) for example.

Here's a simple example of prompting on a picture of a bird and applying that to another similar picture along with an unrelated pictures (these photos are all from the BIG dataset):

Prompt image:

Transfer image (i.e. no prompt used, instead it uses the memory bank from the result above). It may prefer the central bird because the original prompt is centered, even though the left bird seems like a better match.

Two more transfer images, but unrelated to the original. Here it is picking stuff on the left, maybe because of the sizing of the objects (not really sure).

heyoeyo · 2024-10-07T21:56:02Z

Just as a follow-up for anyone interested, I've set up an interactive script to make this easier to try. Here's a quick example video:

bird_cross_seg_example.mp4

On the left is an image being prompted with a single foreground point (the green dot), while the image on the right is segmented based on the 'video memory bank' from the first image result. The model seems to intutively segment things like the eye, beak and feet as well as the branch/platform the birds stand on. It gives lots of weird and often bad results too, and is very sensitive to which mask is being selected.

For anyone trying to use the model this way, it seems important to encode the second image in the memory bank several times (i.e. as if it appears repeatedly as a short video) for the segmentation to come out cleanly.

mgrewe · 2024-10-08T11:41:06Z

That is interesting. Do you mean that the segmentation on the right becomes better (e.g., less fuzzy). the more often the segmentation is repeated? Would you suspect this to be some kind of averaging/smoothing effect? Would be glad if you could share some example images.

heyoeyo · 2024-10-08T15:02:09Z

Would you suspect this to be some kind of averaging/smoothing effect?

It does frequently have a smoothing effect. It often seems to converge on a better mask of the area that was initially (uncleanly) masked. This clean up trick also works when cross-prompting an image with itself (though maybe simpler to just adjust the prompt).

Here's an example where the initial (bird) prompt causes both sheep to get messy masks, but repeating the frames cleans up the mask on the left sheep and removes the right sheep:

repeat_encoding_example.mp4

Interestingly, it also improves the object score which is normally an indicator for the video tracking that it 'found' the object in the next frame. It also converges on the legs of the sheep in the second mask output (far right, second from the top) and sort of converges on the sheep's face on the last mask output, so the behavior varies based on the initial mask.

mgrewe · 2024-10-09T05:02:41Z

Thanks @heyoeyo . Very interesting insight! Do you have any clue if this works similarly for SAMv1?

heyoeyo · 2024-10-09T13:29:20Z

Unfortunately, the SAMv1 models can't be used in the same way since they don't have the video capability/memory bank of the v2 models.

mgrewe · 2024-10-10T05:59:46Z

Yeah, I know. Rather meant only repeatedly feeding in a mask to improve it. Was hoping to use it for improvement of solutions from mask input prompt - which to my knowledge gives worse results as compared to other prompt types. But that is slightly off topic here.

heyoeyo · 2024-10-10T15:52:59Z

Rather meant only repeatedly feeding in a mask to improve it

Oh I see what you mean. I haven't experimented with the mask prompts too much, since (as you say) they tend to give disappointing results compared to other prompts. From what little I have seen, the mask prompts (v1 or v2) don't seem to show the same convergence behavior as using the v2 memory bank. Surprisingly, the v2.1 mask prompts do show some cross-image prompting capability though (could be a coincidence, I haven't tried it on many images).

That being said, in the SAMv1 paper they mention (on page 17 under the Training Algorithm section) feeding the output of the model back in as a mask prompt as if it helps, so maybe there is a way to make it work. The automatic mask generator even includes this sort of processing as a option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use memory bank for images not belonging to the same video ? #352

Use memory bank for images not belonging to the same video ? #352

tcourat commented Oct 3, 2024 •

edited

Loading

heyoeyo commented Oct 3, 2024

heyoeyo commented Oct 7, 2024

mgrewe commented Oct 8, 2024 •

edited

Loading

heyoeyo commented Oct 8, 2024

mgrewe commented Oct 9, 2024 •

edited

Loading

heyoeyo commented Oct 9, 2024

mgrewe commented Oct 10, 2024 •

edited

Loading

heyoeyo commented Oct 10, 2024

Use memory bank for images not belonging to the same video ? #352

Use memory bank for images not belonging to the same video ? #352

Comments

tcourat commented Oct 3, 2024 • edited Loading

heyoeyo commented Oct 3, 2024

heyoeyo commented Oct 7, 2024

mgrewe commented Oct 8, 2024 • edited Loading

heyoeyo commented Oct 8, 2024

mgrewe commented Oct 9, 2024 • edited Loading

heyoeyo commented Oct 9, 2024

mgrewe commented Oct 10, 2024 • edited Loading

heyoeyo commented Oct 10, 2024

tcourat commented Oct 3, 2024 •

edited

Loading

mgrewe commented Oct 8, 2024 •

edited

Loading

mgrewe commented Oct 9, 2024 •

edited

Loading

mgrewe commented Oct 10, 2024 •

edited

Loading