Spaces:
Runtime error
Runtime error
File size: 1,704 Bytes
330f643 c826025 4e336ec c826025 330f643 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
---
title: Image To Audio
emoji: 📢
colorFrom: gray
colorTo: yellow
sdk: streamlit
sdk_version: 1.29.0
app_file: app.py
pinned: false
---
# The Image Reader 📢
[The Image Reader 📢 - Playground](https://huggingface.co/spaces/thivav/image-to-audio)
This application analyzes the uploaded image, generates an imaginative phrase, and then converts it into audio.
- For **image_to_audio** following technologies were used:
- **Image Reader:**
- HuggingFace ```image-to-text``` task used with ```Salesforce/blip-image-captioning-base``` pretrained model. Which produces a small description about the image.
- [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- **Generate an imaginative phrase:**
- OpenAI ```GPT-3.5-Turbo``` used to produce an imaginative narrative from the description generated earlier.
- The phrase generated with more than 40 words.
- [GPT-3.5 Turbo](https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates)
- **text-to-audio:**
- ```suno/bark-small``` used to generate the audio version of the imaginative narrative earlier.
- [suno/bark-small](https://huggingface.co/suno/bark-small)
- **BARK**: Bark is a transformer-based text-to-audio model created by [Suno](https://www.suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying.
|