somosnlp
/

GemmaColRAC-AeroExpert

Text Generation

Transformers

Safetensors

AI-Regulatory-Compliance

RAC-AI-Colombia

conversational

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

NickyNicky

ejbejaranos commited on Apr 24

Commit

4b54df0

•

1 Parent(s): 1a2fe68

Update README.md (#2)

Browse files

- Update README.md (85650469d6268508a974fe5cd7d12d99ae8a2529)

Co-authored-by: Edison Bejarano Sepulveda <[email protected]>

Files changed (1) hide show

README.md +198 -45

README.md CHANGED Viewed

@@ -14,21 +14,76 @@ widget:
 - text: >
     <bos><start_of_turn>system\n\nYou are a helpful AI assistant.\n\nResponde en formato json.\n\nEres un agente experto en la normativa aeronautica Colombiana.<end_of_turn>\n\n<start_of_turn>user\n\n¿Qué sucede con las empresas de servicios aéreos comerciales que no hayan actualizado su permiso de operación después del 31 de marzo de 2024?<end_of_turn>\n\n<start_of_turn>model
 ---
-# GemmaColRAC-AeroExpertV5 🛫
-Este documento ofrece una visión detallada de `GemmaColRAC-AeroExpertV5`, la quinta iteración de nuestro modelo especializado en regulaciones aeronáuticas colombianas. Presenta un salto cualitativo con respecto a las versiones previas, exhibiendo mejoras en precisión y un uso de recursos de GPU más eficiente, reflejando nuestro compromiso con el desarrollo sostenible y de calidad de tecnologías de IA para la aviación.
 <p align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/6419c2f6b4adb0e101b17b6c/0undo4kZc7OtfGI5nnAa8.png" alt="Imagen del Reglamento Aeronáutico Colombiano" style="width: 40%; max-height: 550px;">
 </p>
-## Metadatos del Nuevo Modelo
 - **Developed by:** [Edison Bejarano](https://huggingface.co/ejbejaranos), [Nicolai Potes](https://huggingface.co/NickyNicky) and [Santiago Pineda](https://huggingface.co/Sapinedamo) ✨
-- **Nombre del Modelo:** GemmaColRAC-AeroExpertV4
 - **Tipo de GPU:** NVIDIA GeForce RTX 3090
 - **Tiempo Total de Entrenamiento:** 12607 segundos
 - **Optimizador:** AdamW con Bitfitting y Neutrino Noise
@@ -40,73 +95,171 @@ Este documento ofrece una visión detallada de `GemmaColRAC-AeroExpertV5`, la qu
 - **Métodos de Cuantificación:** bf16 con gradient_accumulation_steps de 2
 - **Función de Activación:** gelu_pytorch_tanh
-## Comparación con la Versión Anterior
-La versión anterior de `GemmaColRAC-AeroExpertV4` utilizó una GPU NVIDIA A100-SXM4-40GB, con un tiempo de entrenamiento total de aproximadamente 50 minutos (3007 segundos). Operó con una tasa de aprendizaje de 0.00005 y utilizó un optimizador Paged AdamW 8bit. Además, se entrenó con un tamaño de lote por dispositivo de 1 y una versión de Transformers de 4.39.0.
-Las diferencias clave con la versión actual incluyen:
-- **Mejora en GPU:** Cambio de NVIDIA A100-SXM4-40GB a NVIDIA GeForce RTX 3090, ofreciendo un mejor rendimiento en el entrenamiento.
-- **Tiempo de Entrenamiento:** Se incrementó para permitir una mayor fine-tuning del modelo, lo que resulta en una mejora de la precisión.
-- **Tamaño de Lote:** Incremento del tamaño de lote por dispositivo de 1 a 2, permitiendo una optimización más eficiente.
-- **Actualización de Optimizador:** Introducción de técnicas avanzadas como Bitfitting y Neutrino Noise para mejorar la convergencia del modelo.
-- **Pasos Máximos:** Aumento significativo de los pasos máximos de 1638 a 4904, lo que sugiere una cobertura más amplia de los datos y un aprendizaje más profundo.
-Estos cambios han resultado en una versión más robusta y eficiente de nuestro modelo, fortaleciendo su capacidad para asistir y proveer orientación en la regulación aeronáutica colombiana.
-## Evaluación
-Para la evaluación de `GemmaColRAC-AeroExpertV4`, hemos habilitado plataformas para que expertos en el campo realicen pruebas. Estas plataformas proporcionan un entorno interactivo donde los usuarios pueden probar el modelo en varios escenarios de la normativa aeronáutica colombiana y verificar su rendimiento y precisión. Visita:
-- [Evaluación de GemmaColRAC-AeroExpertV4](https://somosnlp-rac-col-v1.hf.space)
-## Impacto Ambiental
-El desarrollo de `GemmaColRAC-AeroExpertV4` se ha llevado a cabo con un enfoque en la sostenibilidad. Hemos trabajado para optimizar la eficiencia y minimizar el impacto ambiental, lo que incluye una reducción en el consumo de energía y una disminución en la huella de carbono durante el proceso de entrenamiento de nuestro modelo. Esto no solo mejora la eficiencia operativa, sino que también apoya nuestros objetivos de responsabilidad ambiental.
-## Fine-Tuning del Modelo
-Para adaptar y mejorar `GemmaColRAC-AeroExpertV4` a tareas específicas o conjuntos de datos, proporcionamos un notebook de Jupyter que guía a los usuarios a través del proceso de fine-tuning.
-El notebook incluye los siguientes pasos:
-- Preparación del entorno: configuración de las librerías necesarias y verificación de la disponibilidad del hardware adecuado (por ejemplo, GPU).
-- Carga de los datos: instrucciones para importar tu conjunto de datos personalizado.
-- Preprocesamiento: técnicas para preparar y procesar los datos antes del entrenamiento.
-- Fine-Tuning: código detallado para realizar el fine-tuning del modelo `GemmaColRAC-AeroExpertV4`, incluyendo la configuración de hiperparámetros.
-- Evaluación: métodos para evaluar la eficacia del modelo fine-tuned en tu tarea específica.
-- Guardar y cargar el modelo: instrucciones para guardar el modelo fine-tuned y cargarlo para futuras predicciones o análisis.
-Puedes encontrar el notebook de fine-tuning en el siguiente enlace:
-[Notebook de Fine-Tuning para GemmaColRAC-AeroExpertV4](https://colab.research.google.com/drive/1VmcSVvkaXVe-ya5ATDxKilPY9kN-x2_I?usp=sharing)
-Este recurso está diseñado para ser accesible a usuarios de todos los niveles de habilidad técnica, desde principiantes hasta expertos en machine learning.
-## Environmental Impact
-Given the use of an NVIDIA V100 GPU for approximately 4.67 hours, the carbon emissions can be estimated using the Machine Learning Impact calculator. This tool accounts for the hardware type, runtime, and other factors to provide a comprehensive view of the environmental impact of training large AI models.
-- **Hardware Type:** NVIDIA 3090 GPU
-- **Hours used:** ~3.0
-- **Carbon Emitted:** 356.25
-# Constants
-power_consumption_kW = 0.25  # 250 watts in kW
-runtime_hours = 3.0
-carbon_intensity_gCO2eq_per_kWh = 475  # Global average carbon intensity
-# Calculate carbon emissions
-carbon_emitted_gCO2eq = power_consumption_kW * runtime_hours * carbon_intensity_gCO2eq_per_kWh
-carbon_emitted_gCO2eq = 356.25
-## Más Información
-Para obtener más detalles sobre `GemmaColRAC-AeroExpertV4`, incluyendo acceso al modelo y sus capacidades completas, visita nuestro [repositorio en Hugging Face](https://huggingface.co/ejbejaranos/GemmaColRAC-AeroExpertV4).

 - text: >
     <bos><start_of_turn>system\n\nYou are a helpful AI assistant.\n\nResponde en formato json.\n\nEres un agente experto en la normativa aeronautica Colombiana.<end_of_turn>\n\n<start_of_turn>user\n\n¿Qué sucede con las empresas de servicios aéreos comerciales que no hayan actualizado su permiso de operación después del 31 de marzo de 2024?<end_of_turn>\n\n<start_of_turn>model
 ---
+# Model Card for GemmaColRAC-AeroExpert Language Model: Gemma 2B for Colombian Aviation Regulations 🛫
+## Model Details
+### Model Description
+Este documento ofrece una visión detallada de `GemmaColRAC-AeroExpert`, la quinta iteración de nuestro modelo especializado en regulaciones aeronáuticas colombianas. Presenta un salto cualitativo con respecto a las versiones previas, exhibiendo mejoras en precisión y un uso de recursos de GPU más eficiente, reflejando nuestro compromiso con el desarrollo sostenible y de calidad de tecnologías de IA para la aviación.
 <p align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/6419c2f6b4adb0e101b17b6c/0undo4kZc7OtfGI5nnAa8.png" alt="Imagen del Reglamento Aeronáutico Colombiano" style="width: 40%; max-height: 550px;">
 </p>
 - **Developed by:** [Edison Bejarano](https://huggingface.co/ejbejaranos), [Nicolai Potes](https://huggingface.co/NickyNicky) and [Santiago Pineda](https://huggingface.co/Sapinedamo) ✨
+- **Funded by:** Fundación Universitaria Los Libertadores, SomosNLP, HuggingFace
+- **Model type:** Specialized Language Model for Colombian Aeronautical Regulations
+- **Language(s):** Spanish (`es-CO`)
+- **License:** apache-2.0 <!-- Elegid una licencia lo más permisiva posible teniendo en cuenta la licencia del model pre-entrenado y los datasets utilizados -->
+- **Fine-tuned from model:** [More Information Needed] <!-- Enlace al modelo pre-entrenado que habéis utilizado como base -->
+- **Dataset used:** [RAC Corpus: Base de Datos del Reglamento Aeronáutico Colombiano 🛫📚🇨🇴](https://huggingface.co/datasets/somosnlp/Reglamento_Aeronautico_Colombiano_2024/blob/01bf7eebef40aaba374ffd30697582ab10ec3503/README.md)
+### Model Sources
+- **Demo:** [Model Demo on HuggingFace Spaces](https://huggingface.co/spaces/somosnlp/ColombiaRAC-V1)
+- **Video presentation:** [Aviación Inteligente: LLMs para Navegar el RAC | Hackathon] (https://youtu.be/IGKU1qUur2c?si=Na4d3XIU3vbdaaJj)
+## Uses
+### Direct Use
+Is designed to assist professionals and students in the aviation industry by providing enhanced access to the Colombian Aeronautical Regulations through advanced language processing capabilities.
+### Out-of-Scope Use
+This model is not intended for making legally binding decisions without human oversight.
+## Bias, Risks, and Limitations
+The model may inherit biases from the data used for training, which primarily includes official legal texts. Users should exercise caution and not rely solely on the model for critical decision-making.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+from transformers import AutoModel, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("somosnlp/GemmaColRAC-AeroExpert")
+model = AutoModel.from_pretrained("somosnlp/GemmaColRAC-AeroExpert")
+# Example of how to use the model
+encoded_input = tokenizer("Example query about aviation regulations", return_tensors='pt')
+output = model(**encoded_input)
+```
+## Training Details
+### Training Data
+The model was trained on a curated dataset consisting of detailed question-answer pairs related to the Colombian Aeronautical Regulations.
+### Training Procedure
+The model was fine-tuned from a base language model using the following specifications:
 - **Tipo de GPU:** NVIDIA GeForce RTX 3090
 - **Tiempo Total de Entrenamiento:** 12607 segundos
 - **Optimizador:** AdamW con Bitfitting y Neutrino Noise
 - **Métodos de Cuantificación:** bf16 con gradient_accumulation_steps de 2
 - **Función de Activación:** gelu_pytorch_tanh
+- [Notebook to train the model](https://colab.research.google.com/drive/1VmcSVvkaXVe-ya5ATDxKilPY9kN-x2_I?usp=sharing)
+### Comparison with Previous Version 🔄
+The previous iteration, `GemmaColRAC-AeroExpertV4`, utilized an NVIDIA A100-SXM4-40GB GPU and was trained for approximately 50 minutes (3007 seconds). It operated with a learning rate of 0.00005 and used an 8-bit Paged AdamW optimizer. Furthermore, it was trained with a batch size per device of 1 and utilized version 4.39.0 of the Transformers library.
+**Key differences with the current version include:**
+- **GPU Upgrade:** 🆙 Switched from NVIDIA A100-SXM4-40GB to NVIDIA GeForce RTX 3090, offering better performance during training.
+- **Training Time:** ⏳ Increased to allow more extensive fine-tuning of the model, resulting in improved accuracy.
+- **Batch Size:** 🔢 Increased the batch size per device from 1 to 2, allowing for more efficient optimization.
+- **Optimizer Upgrade:** 🛠️ Introduction of advanced techniques such as Bitfitting and Neutrino Noise to enhance model convergence.
+- **Maximum Steps:** 🚶‍♂️ Significantly increased the maximum steps from 1638 to 4904, suggesting a broader coverage of data and deeper learning.
+These changes have resulted in a more robust and efficient version of our model, enhancing its capacity to assist and provide guidance in Colombian aeronautical regulation.
+#### Training Hyperparameters
+- **Training regime:** bf16 mixed precision
+- **Optimizer:** Paged AdamW 8-bit
+- **Learning Rate:** 5e-5
+- **Batch Size per Device:** 3
+- **Gradient Accumulation Steps:** 4
+- **Warmup Steps:** Computed as 3% of total steps
+- **Max Steps:** 14,688
+- **Total Training Time:** Approx. 5 hours 21 minutes (based on epochs and iteration speed)
+- **Max Sequence Length:** 2048
+- **Weight Decay:** 0.001
+- **Learning Rate Scheduler:** Cosine
+- **Adam Betas:** Beta1 = 0.99, Beta2 = 0.995
+- **Max Gradient Norm:** 0.4
+-
+#### Speeds, Sizes, Times
+- **Training Duration:** Approx. 3 hours 30 minutes for full training
+- **Training Throughput:** 0.76 iterations per second (it/s)
+- **Total Steps:** 14,688 steps over 8 epochs
+- **Checkpoint Size:** Final model size was not specified; typical sizes for models of this type are several gigabytes.
+- **Total Number of Trainable Parameters:** 78,446,592
+[More Information Needed]
+### Metrics
+Here is a detailed summary of the training metrics for `GemmaColRAC-AeroExpert`:
+- **Total Floating Point Operations (FLOPs):** 204,241,541,673,615,360
+- **Train Loss:** 0.393565042567292 (final reported loss)
+- **Training Runtime:** 10,763.56 seconds (approximately 2.99 hours)
+- **Samples per Second:** 4.556
+- **Steps per Second:** 0.456
+- **Total Training Epochs:** 2
+- **Total Training Steps:** 4,904
+- **Gradient Norm:** 3.515625
+- **Final Learning Rate:** 0 (end of training)
+- **Average Loss over Training:** 0.1934
+### Results
+<p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/6419c2f6b4adb0e101b17b6c/zuEetm8ifT5e3QtHfBBVD.png" alt="Trainning Loss" style="width: 80%; max-height: 350px;">
+</p>
+## Model Examination [optional]
+This model was evaluated the performance in simplifying RAC's content based on feedback from aeronautical experts, thereby enhancing regulatory compliance and understanding.
+<p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/6419c2f6b4adb0e101b17b6c/5iPvAhaTMnqRDBn2g7XIK.png" alt="Evaluation for model by Aeronautical experts" style="width: 40%; max-height: 550px;">
+</p>
+Previous table shows the model's strong performance with average scores of 7 from 276 tests. However, RAC 3's low scores (mean 3.464, median 1) indicate areas needing improvement, while high ratings in RACs 1 and 5 suggest strengths. These results confirm the model's potential for accuracy and generalization, though RAC 3 requires adjustments.
+## Environmental Impact 🌱
+The development of `GemmaColRAC-AeroExpert` has been carried out with a strong focus on sustainability 🌿. Efforts have been made to optimize efficiency and minimize environmental impact, including reducing energy consumption and lowering the carbon footprint during the model's training process. These measures not only enhance operational efficiency but also align with our commitment to environmental responsibility 🌎.
+### Energy Consumption and Carbon Emissions 📉
+- **Power Consumption:** 0.25 kW (250 watts)
+- **Runtime Hours:** 3.6 hours
+- **Carbon Intensity:** 475 gCO2eq per kWh (Global average)
+Given the use of an NVIDIA V100 GPU for approximately 3.6 hours, the carbon emissions have been meticulously estimated. Here are the details:
+- **Hardware Type:** NVIDIA GeForce RTX 3090 GPU
+- **Total Hours Used:** ~3.6 hours
+- **Total Carbon Emitted:** Approximately 356.25 grams of CO₂ equivalents
+These carbon emissions were calculated using the [Machine Learning Impact Calculator](https://mlco2.github.io/impact#compute) introduced in Lacoste et al. (2019), which considers hardware type, runtime, and other relevant factors to provide a comprehensive view of the environmental impact of training large AI models 📊.
+This proactive approach to understanding and mitigating our ecological footprint underlines our commitment to pioneering environmentally friendly AI development practices, setting a benchmark for sustainability within the AI industry 🌟.
+#### Hardware
+- **Hardware Used:** NVIDIA GeForce RTX 3090
+#### Software 🛠️
+The `GemmaColRAC-AeroExpert` model was developed and trained using a comprehensive stack of modern software libraries designed for high-performance machine learning tasks, particularly in Natural Language Processing (NLP). Here are the key libraries and tools used:
+- **Python Libraries:**
+  - `json`: For parsing JSON files and handling serialization 📄.
+  - `pandas`: A powerful data manipulation and analysis library providing data structures and operations for manipulating numerical tables and time series 📊.
+  - `torch`: PyTorch is an open-source machine learning library used for applications such as computer vision and natural language processing, developed by Facebook's AI Research lab (FAIR) 🔥.
+  - `datasets`: A lightweight and extensible library to easily share and access datasets and evaluation metrics for machine learning tasks 📚.
+  - `huggingface_hub`: Used for managing model repositories on Hugging Face and interacting with Hugging Face Hub APIs 🌐.
+- **Hugging Face Ecosystem:**
+  - `transformers`: Provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, and text generation in over 100 languages. It's designed to be both user-friendly for machine learning researchers and efficient to use in production 🤖.
+  - `BitsAndBytesConfig`, `TrainingArguments`: Advanced configurations from the Transformers library for fine-tuning the performance and efficiency of training neural networks ⚙️.
+  - `pipeline`: A utility for creating easy-to-use pipelines for various NLP tasks 🧪.
+  - `AutoModelForCausalLM`, `AutoTokenizer`: Utilities for loading and initializing pre-trained language models and their tokenizers 📝.
+  - `logging`: For configuring the logging level and output formats to track model training and inference processes effectively 📌.
+- **PEFT and LoRA Extensions:**
+  - `LoraConfig`, `PeftModel`: Extensions from the PEFT (Parameter Efficient Fine-Tuning) library, which include LoRA (Low-Rank Adaptation of large models), allowing efficient fine-tuning and adaptation of large pre-trained models with minimal computational overhead 🚀.
+- **Transformers Reinforcement Learning (TRL):**
+  - `SFTTrainer`: A component from the TRL library for applying reinforcement learning techniques to transformer models, specifically for sequence-to-sequence tasks 🎮.
+These tools collectively support the robust training environment necessary to develop state-of-the-art NLP models like `GemmaColRAC-AeroExpert`, ensuring that the model is both highly effective and efficient in processing and understanding complex regulatory texts.
+## License 📜
+`GemmaColRAC-AeroExpert` is released under the Apache 2.0 license 🏷️. This license is one of the most permissive and widely used licenses in the open-source community, allowing for both academic and commercial use without significant restrictions.
+- **Why Apache 2.0?** 🤔
+  - **Openness:** The Apache 2.0 license allows users to use, modify, and distribute the software freely, which encourages innovation and widespread use.
+  - **Protection:** It provides an explicit grant of patent rights from contributors to users, protecting them from patent litigation.
+  - **Commercial friendly:** Apache 2.0 is business-friendly, allowing the commercial use of the software which is crucial for wider adoption in industry settings.
+By choosing Apache 2.0, we ensure that `GemmaColRAC-AeroExpert` can be freely used and integrated into a wide array of projects and products, from academic research to commercial applications, thus supporting the growth and accessibility of AI technologies across different sectors 🌐.
+## Glossary [optional]
+- **RAC**: Reglamento Aeronáutico Colombiano
+## More Information
+<!-- Indicar aquí que el marco en el que se desarrolló el proyecto, en esta sección podéis incluir agradecimientos y más información sobre los miembros del equipo. Podéis adaptar el ejemplo a vuestro gusto. -->
+This project was developed during the [Hackathon #Somos600M](https://somosnlp.org/hackathon) organized by SomosNLP. The model was trained using GPUs sponsored by their own team.
+## Team 👥
+The development of the `GemmaColRAC-AeroExpert` model was supported by a dedicated team of experts specializing in machine learning, natural language processing, and aeronautics. Below are the key team members who contributed significantly to this project:
+- [Edison Bejarano](https://huggingface.co/ejbejaranos) - Lead AI Scientist, expert in NLP and machine learning, with a strong background in aeronautics.
+- [Nicolai Potes](https://huggingface.co/NickyNicky) - Data Scientist, specializes in AI-driven regulatory compliance solutions.
+- [Santiago Pineda](https://huggingface.co/Sapinedamo) - Project Manager and Senior ML Engineer, with extensive experience in deploying scalable AI solutions.
+- [Alec Mauricio](https://huggingface.co/alecrosales1) - AI Researcher, focused on developing innovative models for text analysis and interpretation.
+- [Danny Stevens](https://huggingface.co/dannystevens) - Software Engineer, provides expertise in software development and integration for machine learning applications.
+These individuals bring a wealth of knowledge and expertise, ensuring the highest quality and performance of the `GemmaColRAC-AeroExpert` model. Their collaborative efforts have been pivotal in pushing the boundaries of what's possible with AI in the aviation sector.
+## Contact [optional]
+Ejbejaranos@gmail.com