Spaces:
Runtime error
Runtime error
Explanation of the method
#1
by
giacomov
- opened
Cool space! @EduardoPacheco do you have any pointer that explains the methodology used?
I looked into the app.py expecting to see the extraction of the attention maps in the last layer, instead I find this rather obscure piece of code:
with torch.no_grad():
out = dino.forward_features(img_tensor)
features = out["x_prenorm"][:, 1:, :]
What does this last line do? what is "x_prenorm" and why are we skipping the first element of the second dimension? Is that the CLS token?
Thanks for your work!
Hey
@giacomov
, I'm using the original implementation that the authors provided in this Space through torch.hub
. You can take a look at forward_features
here
TL;DR
forward_features
passes the input tensor through the ViT modelx_prenorm
is the last hidden state from the ViT, but without passing it through theLayerNorm
- Precisely, we skip the first token because we need the image token embeddings to make the visualizations
- Regarding the double PCA method that I've used was something mentioned in the paper, but also people discussed about it in the repo issues here is a good discussion