Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memory bank for images not belonging to the same video ? #352

Open
tcourat opened this issue Oct 3, 2024 · 8 comments
Open

Use memory bank for images not belonging to the same video ? #352

tcourat opened this issue Oct 3, 2024 · 8 comments

Comments

@tcourat
Copy link

tcourat commented Oct 3, 2024

Has anyone tried to leverage the SAMv2 memory bank for other purpose than video segmentation ? For instance if I have 10 images of a given object (like a table, chair etc..), I only click on the object in image 1 and want sam to automatically segment similar objects in all the other images ?

@heyoeyo
Copy link

heyoeyo commented Oct 3, 2024

There was a related discussion in issue #210, though that was for using an image as a starting prompt for a video. In general it can work, but it's probably not capable of exact matching (i.e. I wouldn't expect it to pick out a specific person from a crowd, though I haven't tried). It seems to match based on multiple factors, not just the 'semantic' label we would assign but also color and positioning (of the original prompt) for example.

Here's a simple example of prompting on a picture of a bird and applying that to another similar picture along with an unrelated pictures (these photos are all from the BIG dataset):

Prompt image:
Screenshot from 2024-10-03 09-49-03

Transfer image (i.e. no prompt used, instead it uses the memory bank from the result above). It may prefer the central bird because the original prompt is centered, even though the left bird seems like a better match.
Screenshot from 2024-10-03 09-49-34

Two more transfer images, but unrelated to the original. Here it is picking stuff on the left, maybe because of the sizing of the objects (not really sure).
Screenshot from 2024-10-03 09-49-08
Screenshot from 2024-10-03 10-04-28

@heyoeyo
Copy link

heyoeyo commented Oct 7, 2024

Just as a follow-up for anyone interested, I've set up an interactive script to make this easier to try. Here's a quick example video:

bird_cross_seg_example.mp4

On the left is an image being prompted with a single foreground point (the green dot), while the image on the right is segmented based on the 'video memory bank' from the first image result. The model seems to intutively segment things like the eye, beak and feet as well as the branch/platform the birds stand on. It gives lots of weird and often bad results too, and is very sensitive to which mask is being selected.

For anyone trying to use the model this way, it seems important to encode the second image in the memory bank several times (i.e. as if it appears repeatedly as a short video) for the segmentation to come out cleanly.

@mgrewe
Copy link

mgrewe commented Oct 8, 2024

That is interesting. Do you mean that the segmentation on the right becomes better (e.g., less fuzzy). the more often the segmentation is repeated? Would you suspect this to be some kind of averaging/smoothing effect? Would be glad if you could share some example images.

@heyoeyo
Copy link

heyoeyo commented Oct 8, 2024

Would you suspect this to be some kind of averaging/smoothing effect?

It does frequently have a smoothing effect. It often seems to converge on a better mask of the area that was initially (uncleanly) masked. This clean up trick also works when cross-prompting an image with itself (though maybe simpler to just adjust the prompt).

Here's an example where the initial (bird) prompt causes both sheep to get messy masks, but repeating the frames cleans up the mask on the left sheep and removes the right sheep:

repeat_encoding_example.mp4

Interestingly, it also improves the object score which is normally an indicator for the video tracking that it 'found' the object in the next frame. It also converges on the legs of the sheep in the second mask output (far right, second from the top) and sort of converges on the sheep's face on the last mask output, so the behavior varies based on the initial mask.

@mgrewe
Copy link

mgrewe commented Oct 9, 2024

Thanks @heyoeyo . Very interesting insight! Do you have any clue if this works similarly for SAMv1?

@heyoeyo
Copy link

heyoeyo commented Oct 9, 2024

Unfortunately, the SAMv1 models can't be used in the same way since they don't have the video capability/memory bank of the v2 models.

@mgrewe
Copy link

mgrewe commented Oct 10, 2024

Yeah, I know. Rather meant only repeatedly feeding in a mask to improve it. Was hoping to use it for improvement of solutions from mask input prompt - which to my knowledge gives worse results as compared to other prompt types. But that is slightly off topic here.

@heyoeyo
Copy link

heyoeyo commented Oct 10, 2024

Rather meant only repeatedly feeding in a mask to improve it

Oh I see what you mean. I haven't experimented with the mask prompts too much, since (as you say) they tend to give disappointing results compared to other prompts. From what little I have seen, the mask prompts (v1 or v2) don't seem to show the same convergence behavior as using the v2 memory bank. Surprisingly, the v2.1 mask prompts do show some cross-image prompting capability though (could be a coincidence, I haven't tried it on many images).

That being said, in the SAMv1 paper they mention (on page 17 under the Training Algorithm section) feeding the output of the model back in as a mask prompt as if it helps, so maybe there is a way to make it work. The automatic mask generator even includes this sort of processing as a option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants