Supported Models

Given the fast-paced nature of the open ML ecosystem, the Inference API exposes models that have large community interest and are in active use (based on recent likes, downloads, and usage). Because of this, deployed models can be swapped without prior notice. The Hugging Face stack aims to keep all the latest popular models warm and ready to use.

You can find:

Warm models: models ready to be used.
Cold models: models that are not loaded but can be used.
Frozen models: models that currently can’t be run with the API.

What do I get with a PRO subscription?

In addition to thousands of public models available in the Hub, PRO and Enterprise users get higher rate limits and free access to the following models:

Model	Size	Supported Context Length	Use
Meta Llama 3.1 Instruct	8B, 70B	70B: 32k tokens / 8B: 8k tokens	High quality multilingual chat model with large context length
Meta Llama 3 Instruct	8B, 70B	8k tokens	One of the best chat models
Meta Llama Guard 3	8B	4k tokens
Llama 2 Chat	7B, 13B, 70B	4k tokens	One of the best conversational models
DeepSeek Coder v2	236B	16k tokens	A model with coding capabilities.
Bark	0.9B	-	Text to audio generation

This list is not exhaustive and might be updated in the future.

Running Private Models

The free Serverless API is designed to run popular public models. If you have a private model, you can use Inference Endpoints to deploy it.

< > Update on GitHub

api-inference

Supported Models

What do I get with a PRO subscription?

Running Private Models