Transformer2DModel
A Transformer model for image-like data from CompVis that is based on the Vision Transformer introduced by Dosovitskiy et al. The [Transformer2DModel
] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
When the input is continuous:
- Project the input and reshape it to
(batch_size, sequence_length, feature_dimension)
. - Apply the Transformer blocks in the standard way.
- Reshape to image.
When the input is discrete:
It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked.
- Convert input (classes of latent pixels) to embeddings and apply positional embeddings.
- Apply the Transformer blocks in the standard way.
- Predict classes of unnoised image.
Transformer2DModel
[[autodoc]] Transformer2DModel
Transformer2DModelOutput
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput