--- library_name: transformers license: apache-2.0 language: - en tags: - image-text-to-text - text-to-text - image-text-to-image-text pipeline_tag: image-text-to-text BaseModel: - Mixtral_AI_Cyber_Matrix_2.0(7b) Decoder: - Locutusque/TinyMistral-248M-v2 ImageProcessor: - ikim-uk-essen/BiomedCLIP_ViT_patch16_224 - Lin-Chen/ShareGPT4V-7B_Pretrained_vit-large336-l12 Encoder: - google/vit-base-patch16-224-in21k --- # LeroyDyer/Mixtral_AI_Cyber_Q_Vision VisionEncoderDecoderModel is a generic model class that will be instantiated as a transformer architecture with one of the base vision model classes of the library as encoder and another one as decoder when created with the : ```python # class method for the encoder and : transformers.AutoModel.from_pretrained # class method for the decoder. transformers.AutoModelForCausalLM.from_pretrained ``` ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. Previous vision models have been 50/50 as the multimodel model actully requires a lot of memory and gpu and harddrive space to create; the past versions have been attempts to Merge the capabilitys into the main mistral model whilst still retaining its mistral tag! After reading many hugging face articles: The BackBone Issue is the main cause of creating multi modals !: with the advent of tiny models we are able to leverage the decoder abilitys as a single expert-ish... within the model : by reducing the size to a fully trainined tiny model! this will only produce decodings and not conversations so it needs to be smart and respond with defined answers: but in general it will produce captions: but as domain based it may be specialized in medical or art etc: The main llm still needs to retain these models within hence the back bone method of instigating a VisionEncoderDecoder model: istead of a llava model which still need wrangling to work correctly without spoiling the original transformers installation: Previous experiments proved that the mistral large model could be used as a decoder but the total model jumped to 13b so the when applying the tiny model it was only effected by the weight of the model 248M This is an experiment in vision - the model has been created as a mistral/VisionEncoder/Decoder Customized from: ```yaml BaseModel: - Mixtral_AI_Cyber_Matrix_2.0(7b) Decoder: - Locutusque/TinyMistral-248M-v2 ImageProcessor: - ikim-uk-essen/BiomedCLIP_ViT_patch16_224 - Lin-Chen/ShareGPT4V-7B_Pretrained_vit-large336-l12 Encoder: - google/vit-base-patch16-224-in21k ``` - **Developed by:** [LeroyDyer] - **Model type:** [image-text-to-image-text] - **Language(s) (NLP):** [English] ## How to Get Started with the Model ```python from transformers import AutoProcessor, VisionEncoderDecoderModel import requests from PIL import Image import torch processor = AutoProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision") model = VisionEncoderDecoderModel.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision") # load image from the IAM dataset url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg" image = Image.open(requests.get(url, stream=True).raw).convert("RGB") # training model.config.decoder_start_token_id = processor.tokenizer.eos_token_id model.config.pad_token_id = processor.tokenizer.pad_token_id model.config.vocab_size = model.config.decoder.vocab_size pixel_values = processor(image, return_tensors="pt").pixel_values text = "hello world" labels = processor.tokenizer(text, return_tensors="pt").input_ids outputs = model(pixel_values=pixel_values, labels=labels) loss = outputs.loss # inference (generation) generated_ids = model.generate(pixel_values) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` [More Information Needed] ## Training Details Currently inputs are raw and untrained ; ie: they NEED to be trained as the tensors are randomize maybe? despite using pretrained starting blocks. the encoder decoder modules are ready to be placed in train mode: The main model ie the LLM will need lora/Qlora/Peft etc: This model will stay in this state as a base training point ! so later versions will be trained; This model is fully usable and still expected to score well ; The small tiny mistral is also a great performer and a great block to begin a smaller experts model (later) or any multimodal project ie: its like a mini pretrined bert/llama(Mistral is a clone of llamaAlpaca! ```python from transformers import ViTImageProcessor, AutoTokenizer, VisionEncoderDecoderModel from datasets import load_dataset image_processor = ViTImageProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision") tokenizer = AutoTokenizer.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision") model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( "LeroyDyer/Mixtral_AI_Cyber_Q_Vision", "LeroyDyer/Mixtral_AI_Cyber_Q_Vision" ) model.config.decoder_start_token_id = tokenizer.cls_token_id model.config.pad_token_id = tokenizer.pad_token_id dataset = load_dataset("huggingface/cats-image") image = dataset["test"]["image"][0] pixel_values = image_processor(image, return_tensors="pt").pixel_values labels = tokenizer( "an image of two cats chilling on a couch", return_tensors="pt", ).input_ids # the forward function automatically creates the correct decoder_input_ids loss = model(pixel_values=pixel_values, labels=labels).loss ``` ### Model Architecture and Objective ``` python from transformers import MistralConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel # Initializing a ViT & Mistral style configuration config_encoder = ViTConfig() config_decoder = MistralConfig() config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder) # Initializing a ViTMistral model (with random weights) from a ViT & Mistral style configurations model = VisionEncoderDecoderModel(config=config) # Accessing the model configuration config_encoder = model.config.encoder config_decoder = model.config.decoder # set decoder config to causal lm config_decoder.is_decoder = True config_decoder.add_cross_attention = True # Saving the model, including its configuration model.save_pretrained("my-model") # loading model and config from pretrained folder encoder_decoder_config = VisionEncoderDecoderConfig.from_pretrained("my-model") model = VisionEncoderDecoderModel.from_pretrained("my-model", config=encoder_decoder_config) ```