Spaces:

thivav
/

image-to-audio

Runtime error

App Files Files Community

image-to-audio / README.md

thivav

README.md updated

4e336ec 11 months ago

preview code

raw

history blame contribute delete

1.7 kB

	---
	title: Image To Audio
	emoji: 📢
	colorFrom: gray
	colorTo: yellow
	sdk: streamlit
	sdk_version: 1.29.0
	app_file: app.py
	pinned: false
	---

	# The Image Reader 📢

	[The Image Reader 📢 - Playground](https://huggingface.co/spaces/thivav/image-to-audio)

	This application analyzes the uploaded image, generates an imaginative phrase, and then converts it into audio.

	- For image_to_audio following technologies were used:
	- Image Reader:
	- HuggingFace ```image-to-text``` task used with ```Salesforce/blip-image-captioning-base``` pretrained model. Which produces a small description about the image.
	- [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
	- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
	- Generate an imaginative phrase:
	- OpenAI ```GPT-3.5-Turbo``` used to produce an imaginative narrative from the description generated earlier.
	- The phrase generated with more than 40 words.
	- [GPT-3.5 Turbo](https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates)
	- text-to-audio:
	- ```suno/bark-small``` used to generate the audio version of the imaginative narrative earlier.
	- [suno/bark-small](https://huggingface.co/suno/bark-small)
	- BARK: Bark is a transformer-based text-to-audio model created by [Suno](https://www.suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying.