Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new backbones trained with registers #282

Merged
merged 2 commits into from
Oct 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 79 additions & 8 deletions MODEL_CARD.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,26 @@
# Model Card for DINOv2-S/B/L/g

These are Vision Transformer models trained following the method described in the paper:
These are Vision Transformer models trained following the method described in the papers:
"DINOv2: Learning Robust Visual Features without Supervision"
and
"Vision Transformers Need Registers".

We provide 4 models: 1 ViT-g trained from scratch, and 3 ViT-S/B/L models distilled from the ViT-g.
We provide 8 models:
- 1 ViT-g trained from scratch with 3 ViT-S/B/L models distilled from the ViT-g, without registers.
- 1 ViT-g trained from scratch with 3 ViT-S/B/L models distilled from the ViT-g, with registers.

## Model Details
The model takes an image as input and returns a class token and patch tokens.
The model takes an image as input and returns a class token and patch tokens, and optionally 4 register tokens.

The embedding dimension is:
- 384 for ViT-S.
- 768 for ViT-B.
- 1024 for ViT-L.
- 1536 for ViT-g.

The models follow a Transformer architecture, with a patch size of 14.
The models follow a Transformer architecture, with a patch size of 14. In the case of registers, we add 4 register tokens, learned during training, to the input sequence after the patch embedding.

For a 224x224 image, this results in 1 class token + 256 patch tokens.
For a 224x224 image, this results in 1 class token + 256 patch tokens, and optionally 4 register tokens.

The models can accept larger images provided the image shapes are multiples of the patch size (14).
If this condition is not verified, the model will crop to the closest smaller multiple of the patch size.
Expand Down Expand Up @@ -63,10 +67,18 @@ Use the code below to get started with the model.

```python
import torch

# DINOv2
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')

# DINOv2 with registers
dinov2_vits14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
dinov2_vitb14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
dinov2_vitl14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg')
dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')
```

## Training Details
Expand All @@ -92,11 +104,11 @@ dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')

## Evaluation

We refer users to the associated paper for the evaluation protocols.
We refer users to the associated papers for the evaluation protocols.

<table>
<tr>
<th>model</th>
<th colspan="2"></th>
<th colspan="3">ImageNet-1k</th>
<th>NYU-Depth v2</th>
<th>SUN-RGBD</th>
Expand All @@ -105,7 +117,8 @@ We refer users to the associated paper for the evaluation protocols.
<th>Oxford-H</th>
</tr>
<tr>
<th rowspan="2">task</th>
<th rowspan="2">model</th>
<th rowspan="2">with <br /> registers</th>
<th>classif. (acc)</th>
<th>classif. (acc)</th>
<th>classif. V2 (acc)</th>
Expand All @@ -128,6 +141,7 @@ We refer users to the associated paper for the evaluation protocols.
</tr>
<tr>
<td>ViT-S/14</td>
<td align="center">:x:</td>
<td align="right">79.0%</td>
<td align="right">81.1%</td>
<td align="right">70.8%</td>
Expand All @@ -137,8 +151,21 @@ We refer users to the associated paper for the evaluation protocols.
<td align="right">69.5%</td>
<td align="right">43.2</td>
</tr>
<tr>
<td>ViT-S/14</td>
<td align="center">:white_check_mark:</td>
<td align="right">79.1%</td>
<td align="right">80.9%</td>
<td align="right">71.0%</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">67.6%</td>
<td align="right">39.5</td>
</tr>
<tr>
<td>ViT-B/14</td>
<td align="center">:x:</td>
<td align="right">82.1%</td>
<td align="right">84.5%</td>
<td align="right">74.9%</td>
Expand All @@ -147,9 +174,21 @@ We refer users to the associated paper for the evaluation protocols.
<td align="right">51.3</td>
<td align="right">76.3%</td>
<td align="right">49.5</td>
</tr>
<td>ViT-B/14</td>
<td align="center">:white_check_mark:</td>
<td align="right">82.0%</td>
<td align="right">84.6%</td>
<td align="right">75.6%</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">73.8%</td>
<td align="right">51.0</td>
</tr>
<tr>
<td>ViT-L/14</td>
<td align="center">:x:</td>
<td align="right">83.5%</td>
<td align="right">86.3%</td>
<td align="right">77.6%</td>
Expand All @@ -159,8 +198,21 @@ We refer users to the associated paper for the evaluation protocols.
<td align="right">79.8%</td>
<td align="right">54.0</td>
</tr>
<tr>
<td>ViT-L/14</td>
<td align="center">:white_check_mark:</td>
<td align="right">83.8%</td>
<td align="right">86.7%</td>
<td align="right">78.5%</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">80.9%</td>
<td align="right">55.7</td>
</tr>
<tr>
<td>ViT-g/14</td>
<td align="center">:x:</td>
<td align="right">83.5%</td>
<td align="right">86.5%</td>
<td align="right">78.4%</td>
Expand All @@ -170,6 +222,19 @@ We refer users to the associated paper for the evaluation protocols.
<td align="right">81.6%</td>
<td align="right">52.3</td>
</tr>
<tr>
<tr>
<td>ViT-g/14</td>
<td align="center">:white_check_mark:</td>
<td align="right">83.7%</td>
<td align="right">87.1%</td>
<td align="right">78.8%</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">81.5%</td>
<td align="right">58.2</td>
</tr>
</table>

## Environmental Impact
Expand Down Expand Up @@ -198,4 +263,10 @@ xFormers 0.0.18
journal={arXiv:2304.07193},
year={2023}
}
@misc{darcet2023vitneedreg,
title={Vision Transformers Need Registers},
author={Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
journal={arXiv:2309.16588},
year={2023}
}
```
Loading