Unlocking the Power of locally running Llama-3 8B Model Agents with Chat-UI! 🔥🚀✨
I'm thrilled to share my hackathon-style side project: 1. Finetuning Llama-8B for function calling using PEFT QLoRA as the instruct Llama-3 model doesn't support this. The colab notebook for it is here: https://lnkd.in/ggJMzqh2. 🛠️ 2. Finetuned model along with the 4-bit quants here: https://lnkd.in/gNpFKY6V ✨ 3. Clone Hugging Face https://lnkd.in/gKBKuUBQ and make it compatible for function calling by building upon the PR https://lnkd.in/gnqFuAd4 for my model and local inferencing usecase using Ollama. This was a steep learning curve wherein I stayed awake the whole night to get it working. 💪🏽 4. Above, I used SerpAPI for web browsing and Mongo DB Atlas free tier for persistence of conversations and assistant configs. 🔎 5. More work is required to switch between using tools and responding directly wherein I see the model breaks. 🧐
How cool is this wherein we are approaching experience akin to ChatGPT while using local hosted agent model running on your laptop! 💻
Some highli📝ghts: 1. FSDP+QLoRA and DeepSpeed Stage-3+QLoRA 2. Layer expansion + LoRA 3. DoRA support for Conv2D layers and quantized bitsandbytes layers 4. New LoftQ utility 5. Batched inference for mixed LoRA adapters.
http://Answer.AI team in collaboration with bitsandbytes and Hugging Face 🤗 open sourced code enabling the usage of FSDP+QLoRA and explained the whole process in their insightful blogpost https://lnkd.in/g6jgfXyv. This is now integrated into Hugging Face ecosystem.
Kudos to http://Answer.AI team, Titus von Köller , Younes Belkada, Benjamin Bossan and Zachary Mueller for all the help without which this couldn't have been possible. 🤗
For efficient depthwise layer expansion akin to passthrough method of mergekit but without using additional memory and attaching LoRAs to it, refer to the details below! 🔥https://lnkd.in/ge95ztjA
Now DoRA is supported for Conv2D layers as well as bitsandbytes quantized layers ✨. For more details, please refer the below thread. https://lnkd.in/gsJbuWPD
Now you can mix different LoRA adapters in a batch during inference which speeds-up the inference by avoiding computation of base model multiple times which would be the case for adaptive inference with batch_size=1! ⚡️. Details below. https://lnkd.in/gD-pcX_B
LoftQ reduces quantization error by appropriately initializing the LoRA adapter weights. Normally, this is a two-step process. Benjamin Bossan added new util replace_lora_weights_loftq for LoftQ to use it on the fly with bnb.
For more details, refer to the release notes. 📝 https://lnkd.in/gg7-AmHA. As always, make sure losses go down and be happy to watch your model train!