Supported Models
Given the fast-paced nature of the open ML ecosystem, the Inference API exposes models that have large community interest and are in active use (based on recent likes, downloads, and usage). Because of this, deployed models can be swapped without prior notice. The Hugging Face stack aims to keep all the latest popular models warm and ready to use.
You can find:
- Warm models: models ready to be used.
- Cold models: models that are not loaded but can be used.
- Frozen models: models that currently can’t be run with the API.
What do I get with a PRO subscription?
In addition to thousands of public models available in the Hub, PRO and Enterprise users get higher rate limits and free access to the following models:
Model | Size | Supported Context Length | Use |
---|---|---|---|
Meta Llama 3.1 Instruct | 8B, 70B | 70B: 32k tokens / 8B: 8k tokens | High quality multilingual chat model with large context length |
Meta Llama 3 Instruct | 8B, 70B | 8k tokens | One of the best chat models |
Meta Llama Guard 3 | 8B | 4k tokens | |
Llama 2 Chat | 7B, 13B, 70B | 4k tokens | One of the best conversational models |
DeepSeek Coder v2 | 236B | 16k tokens | A model with coding capabilities. |
Bark | 0.9B | - | Text to audio generation |
This list is not exhaustive and might be updated in the future.
Running Private Models
The free Serverless API is designed to run popular public models. If you have a private model, you can use Inference Endpoints to deploy it.
< > Update on GitHub