Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Achieving Higher FPS with Multiple Object Tracking #367

Open
daniaFrenel opened this issue Oct 10, 2024 · 1 comment
Open

Achieving Higher FPS with Multiple Object Tracking #367

daniaFrenel opened this issue Oct 10, 2024 · 1 comment

Comments

@daniaFrenel
Copy link

Hey,
I am using SAM2 to analyze videos recorded at 30 FPS and currently tracking around 16 objects, achieving a tracking speed only of 2 FPS (propagating every 4 frames). I am interested in understanding the factors impacting this performance, for example the number of frames loaded for the video predictor.

Could you provide insights on optimizations or adjustments that might help improve performance? For example, would loading object IDs directly into tensors enhance processing speed? Any guidance on potential changes to reach my desired FPS would be greatly appreciated.

Dania

@heyoeyo
Copy link

heyoeyo commented Oct 11, 2024

There are a few changes that can be made to speed things up, but they'll generally come at the cost of accuracy. The time required per frame is (roughly) something like:

time per frame = E + M*n

Where:
  E is the image encoding time
  M is the masking + memory encoding + memory attention time
  n is the number of objects being tracked

The image encoding time can be decreased by switching to smaller models (e.g. using the tiny model) as well as running at a lower image resolution (see issue #257). Both of these changes can reduce segmentation quality/accuracy though.
The time required to load the image could also be considered part of this timing and could be reduced by loading images in parallel to running the model itself, though it should be a relatively small part of the total time either way.

The masking/memory time can be decreased by using fewer previous frames in the memory attention step as well as using a lower image resolution. Again, these changes can reduce the quality of the outputs.
Using fewer memory frames requires changes to the code unfortunately, if you wanted to try it a simple hack is to edit line 539 in sam_base.py:

num_prev_frames = 1  # Values between 0 and 6 are valid
for t_pos in range(1, 1 + num_prev_frames): #self.num_maskmem):

It's very situational, but it some scenes it might also be possible to decrease the number of objects by using a prompt that masks several objects together in a single mask, though you would have to separate the results after-the-fact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants