license: apache-2.0
datasets:
- HuggingFaceFW/fineweb
- PleIAs/YouTube-Commons
- allenai/WildChat-1M
- Salesforce/xlam-function-calling-60k
- ShareGPT4Video/ShareGPT4Video
- OpenGVLab/ShareGPT-4o
- TempoFunk/webvid-10M
- MBZUAI/VideoInstruct-100K
- Isaak-Carter/j.o.s.i.e.v4.0.1o
- NousResearch/dolma-v1_7-c4
- NousResearch/dolma-v1_7-cc_en_head
- nyu-visionx/Cambrian-10M
- LargeWorldModel/ultrachat_qa_mix_1M
- LargeWorldModel/ultrachat_qa_mix_512K
- LargeWorldModel/ultrachat_qa_mix_256K
- LargeWorldModel/ultrachat_qa_mix_128K
- nkp37/OpenVid-1M
- HuggingFaceFV/finevideo
language:
- de
- en
library_name: mlx
tags:
- moe
- multimodal
- vision
- audio
- endtoend
- j.o.s.i.e.
J.O.S.I.E. (Just a Smart and Intelligent Entity)
Welcome to the J.O.S.I.E. project repository! J.O.S.I.E. is a cutting-edge, super intelligent AI assistant designed to revolutionize the way we interact with smart home systems and general AI capabilities. This document provides an overview of J.O.S.I.E.'s features, capabilities, and development roadmap.
Table of Contents
Updates
I'm curerntly createing the multimodal-smart-home-management, and tool-calling dataset in german and englisch.
Introduction
J.O.S.I.E. stands for "Just a Smart and Intelligent Entity." It is not just a conversational AI assistant but a fully multimodal AI designed to understand and process images, videos, thermal images, depth, and audio in real-time. J.O.S.I.E. is built to autonomously manage smart homes and provide general-purpose assistance, with advanced capabilities accessible only to the main user.
Features
- Real-Time Processing: J.O.S.I.E. operates in real-time, ensuring quick and efficient responses.
- Tool Calling: Capable of calling various tools to perform tasks (only for the main user).
- Short/Long-Term Memory: Remembers past interactions and uses this data to provide a more personalized experience.
- Secure Information Access: Accesses top-secret information upon receiving a special password from the main user.
- Contextual Greetings: Greets users based on contextual data such as time of day, birthdays, and more.
- Voice Interaction: Will support real-time voice responses with a response time under 0.3 ms.
- Advanced Multimodal Capabilities: Initially uses Meta's image binding model, transitioning to a self-implemented encoder.
- Uncensored Interaction: Full, uncensored interaction capabilities are reserved for the main user.
- Autonomous Smart Home Management: Manages smart home devices and systems autonomously.
Training Stages
J.O.S.I.E.'s development is structured into several meticulously planned stages, each focusing on different aspects of its capabilities:
Stage 1: Genesis
- Objective: Fine-tune the Large Language Model (LLM) with a custom dataset and prompt format. The LLM used is Qwen2 7B and 0.5B.
- Outcome: A robust foundation for text-based interactions.
Stage 2: Fusion
- Objective: Train encoders separately using transfer learning to align input embeddings with text embeddings.
- Outcome: Harmonized multimodal input processing.
Stage 3: Synergy
- Objective: Fine-tune the LLM for multimodal reasoning using a custom dataset.
- Outcome: Enhanced reasoning capabilities across text and other modalities.
Stage 4: Vocalize
- Objective: Fine-tune the decoder for audio output, giving J.O.S.I.E. a voice.
- Outcome: Synchronized text and audio responses.
Stage 5: Convergence
- Objective: Perform full model fine-tuning for seamless integration of all components.
- Outcome: A fully multimodal, real-time interactive AI assistant.
Current Progress
J.O.S.I.E. is currently in its beta stage, specifically in Stage 1. The model is being actively developed, and the current version is focused on fine-tuning the LLM with custom datasets.
Latest Beta Version 4 of Stage 1:
For a sneak peek at the current progress, visit the GitHub Repo.
Source Code
To se the latest updates on J.O.S.I.E.v4o you can see my Github Repo
Contributing
I welcome contributions from the you! To contribute to J.O.S.I.E., please fork the repository and create a pull request with your changes. Ensure that your code adheres to my coding standards and includes appropriate tests and comments.
License
J.O.S.I.E. is licensed under the Apache2 License. See the LICENSE file for more details.
Big Updates!
I have finaly trained the Vision and Audio encoder part, big thangs to FaceBook Research for the ImageBind model, wich is what I have build it on top of.
What I did was, I copied the weights from the original ImageBind model into a second 'downcycled' ImageBindVisionAudioHuge model. After that I have continued to trained the model on a custom Vision and Audio dataset using the contrastive learning Algorythm introduced by Google with Pali Gemma with the text embeddings from the origional ImageBind model.
After mergind the encoder with the test reasoner (Qwen2-0.5B-Instruct), I got succesfull inference on both video, image and audio. I will slowly start writing the training scrypt, creating the new dataset, and optimizing the model and inference code a litle bit more, and lastly train the model.
Here are the actual model layers:
ImageBindModelAudioVision(
(modality_preprocessors): ModuleDict(
(vision): RGBDTPreprocessor(
(cls_token): tensor((1, 1, 1280), requires_grad=True)
(rgbt_stem): PatchEmbedGeneric(
(proj): Sequential(
(0): PadIm2Video()
(1): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
)
)
(pos_embedding_helper): SpatioTemporalPosEmbeddingHelper(
(pos_embed): tensor((1, 257, 1280), requires_grad=True)
)
)
(audio): AudioPreprocessor(
(cls_token): tensor((1, 1, 768), requires_grad=True)
(rgbt_stem): PatchEmbedGeneric(
(proj): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10), bias=False)
(norm_layer): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(pos_embedding_helper): SpatioTemporalPosEmbeddingHelper(
(pos_embed): tensor((1, 229, 768), requires_grad=True)
)
)
)
(modality_trunks): ModuleDict(
(vision): SimpleTransformer(
(pre_transformer_layer): Sequential(
(0): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(1): EinOpsRearrange()
)
(blocks): Sequential(
(0): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(1): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(2): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(3): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(4): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(5): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(6): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(7): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(8): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(9): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(10): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(11): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(12): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(13): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(14): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(15): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(16): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(17): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(18): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(19): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(20): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(21): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(22): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(23): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(24): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(25): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(26): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(27): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(28): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(29): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(30): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
(31): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
)
)
(post_transformer_layer): EinOpsRearrange()
)
(audio): SimpleTransformer(
(pre_transformer_layer): Sequential(
(0): Identity()
(1): EinOpsRearrange()
)
(blocks): Sequential(
(0): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): Identity()
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(1): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.009)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(2): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.018)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(3): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.027)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(4): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.036)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(5): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.045)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(6): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.055)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(7): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.064)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(8): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.073)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(9): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.082)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(10): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.091)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(11): BlockWithMasking(
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(drop_path): DropPath(drop_prob=0.100)
(norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
)
(post_transformer_layer): EinOpsRearrange()
)
)
(modality_heads): ModuleDict(
(vision): Sequential(
(0): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(1): SelectElement()
(2): Linear(in_features=1280, out_features=1024, bias=False)
)
(audio): Sequential(
(0): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(1): SelectElement()
(2): Linear(in_features=768, out_features=1024, bias=False)
)
)
(modality_postprocessors): ModuleDict(
(vision): Normalize()
(audio): Sequential(
(0): Normalize()
(1): LearnableLogitScaling(logit_scale_init=20.0,learnable=False, max_logit_scale=100)
)
)
)