Upload 5 files

Browse files

Files changed (5) hide show

README.md +68 -3
config.json +48 -0
configuration_rene.py +103 -0
model.safetensors +3 -0
rene.py +435 -0

README.md CHANGED Viewed

@@ -1,3 +1,68 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+datasets:
+- allenai/dolma
+tags:
+- rene
+- mamba
+- cartesia
+---
+# Model Card for Rene
+Rene is a 1.3 billion-parameter language model trained by [Cartesia](https://cartesia.ai).
+Rene has a hybrid architecture based on [Mamba-2](https://arxiv.org/abs/2405.21060), with feedforward and sliding window attention layers interspersed.
+It uses the [allenai/OLMo-1B-hf](https://huggingface.co/allenai/OLMo-1B-hf) tokenizer.
+Rene was pretrained on 1.5 trillion tokens of the [Dolma-1.7](https://huggingface.co/datasets/allenai/dolma) dataset.
+For more details, see our [blog post](https://cartesia.ai/blog/on-device).
+## Usage
+### Installation
+The Rene model depends on the `cartesia-pytorch` package, which can be installed with `pip` as follows:
+```shell
+pip install --no-binary :all: cartesia-pytorch
+```
+### Generation example
+```python
+from cartesia_pytorch import ReneLMHeadModel
+from transformers import AutoTokenizer
+model = ReneLMHeadModel.from_pretrained("cartesia-ai/Rene-v0.1-1.3b-pytorch").half().cuda()
+tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1B-hf")
+in_message = ["Rene Descartes was"]
+inputs = tokenizer(in_message, return_tensors="pt")
+outputs = model.generate(inputs.input_ids.cuda(), max_length=50, top_k=100, top_p=0.99)
+out_message = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
+print(out_message)
+# Example output: "Rene Descartes was a French mathematician, philosopher, and scientist. Descartes is famously credited for creating the Cartesian coordinate system: a 3 dimensional representation of points, vectors, and directions. This work is, for the most part" ...
+```
+### Evaluation example
+You can use our `cartesia_lm_eval` wrapper around the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) to evaluate our model on standard text benchmarks. Example command (clone this repo and run the below from within the `cartesia-pytorch` directory):
+```shell
+python -m evals.cartesia_lm_eval --model rene_ssm --model_args pretrained=cartesia-ai/Rene-v0.1-1.3b-pytorch,trust_remote_code=True --trust_remote_code --tasks copa,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --cache_requests true --batch_size auto:4 --output_path outputs/rene_evals/
+```
+## Results on common benchmarks
+| Model                                          | Params (B) | Train Tokens | COPA | HellaSwag | MMLU (5-shot) | PIQA | ARC-e | ARC-c | WinoGrande | OpenBookQA | Average |
+|------------------------------------------------|------------|--------------|------|-----------|---------------|------|-------|-------|------------|------------|---------|
+| allenai/OLMo-1B-hf                             | 1.2        | 3.0          | 82.0 | 62.9      | 26.2          | 75.1 | 57.4  | 31.1  | 60.0       | 36.2       | 53.9    |
+| apple/OpenELM-1\_1B                            | 1.1        | 1.5          | 81.0 | 64.8      | 27.1          | 75.6 | 55.4  | 32.3  | 61.9       | 36.2       | 54.3    |
+| state-spaces/mamba2-1.3b                       | 1.3        | 0.3          | 82.0 | 60.0      | 25.8          | 73.7 | 64.2  | 33.3  | 61.0       | 37.8       | 54.7    |
+| microsoft/phi-1\_5                             | 1.4        | 0.15         | 79.0 | 62.6      | 42.5          | 75.5 | 73.2  | 48.0  | 72.8       | 48.0       | 62.7    |
+| Qwen/Qwen2-1.5B                                | 1.5        | 7.0          | 80.0 | 65.4      | 56.0          | 75.5 | 60.4  | 35.0  | 65.8       | 36.4       | 59.3    |
+| RWKV/rwkv-6-world-1b6                          | 1.6        | 1.1          | 84.0 | 58.3      | 25.9          | 73.5 | 56.7  | 34.1  | 60.0       | 37.4       | 53.7    |
+| stabilityai/stablelm-2-1\_6b                   | 1.6        | 4.0          | 86.0 | 69.0      | 38.1          | 76.7 | 68.1  | 38.9  | 63.6       | 38.8       | 59.9    |
+| HuggingFaceTB/SmolLM-1.7B                      | 1.7        | 1.0          | 76.0 | 65.8      | 29.9          | 76.1 | 73.5  | 46.4  | 60.9       | 42.0       | 58.8    |
+| h2oai/h2o-danube2-1.8b-base                    | 1.8        | 3.0          | 82.0 | 72.4      | 39.9          | 77.3 | 69.0  | 39.9  | 63.9       | 41.4       | 60.7    |
+| google/recurrentgemma-2b                       | 2.7        | 2.0          | 62.0 | 61.8      | 32.3          | 68.8 | 46.4  | 29.9  | 57.1       | 29.0       | 48.4    |
+| cognitivecomputations/TinyDolphin-2.8.1-1.1b   | 1.1        |              | 71.0 | 59.9      | 25.7          | 73.1 | 55.8  | 33.0  | 59.7       | 36.6       | 51.9    |
+| cartesia-ai/Rene-v0.1-1.3b-pytorch (OUR MODEL) | 1.3        | 1.5          | 82.0 | 69.4      | 32.6          | 77.5 | 61.7  | 34.4  | 62.9       | 39.2       | 57.5    |
+## Bias, Risks, and Limitations
+Rene is a pretrained base model which has not undergone any alignment or instruction tuning, and therefore does not have any moderation or safety guarantees. Users should implement appropriate guardrails and moderation mechanisms based on their particular needs in order to ensure responsible and ethical usage.
+## About Cartesia
+At [Cartesia](https://cartesia.ai/), we're building real-time multimodal intelligence for every device.

config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "attn_cfg": {
+    "causal": true,
+    "head_dim": 64,
+    "num_heads": 48,
+    "out_proj_bias": true,
+    "qkv_proj_bias": true,
+    "sliding_window_length": 2048
+  },
+  "attn_layer_idx": [
+    6,
+    18,
+    30,
+    42
+  ],
+  "d_model": 2048,
+  "eos_token_id": 50279,
+  "mlp_cfg": {},
+  "mlp_layer_idx": [
+    2,
+    5,
+    8,
+    11,
+    14,
+    17,
+    20,
+    23,
+    26,
+    29,
+    32,
+    35,
+    38,
+    41,
+    44,
+    47
+  ],
+  "model_type": "rene",
+  "n_layer": 48,
+  "pad_token_id": 1,
+  "pad_vocab_size_multiple": 16,
+  "residual_in_fp32": true,
+  "rms_norm": true,
+  "ssm_cfg": {
+    "norm_before_gate": true
+  },
+  "tie_word_embeddings": true,
+  "vocab_size": 50280
+}

configuration_rene.py ADDED Viewed

	@@ -0,0 +1,103 @@

+from typing import Dict, List, Optional
+from transformers.configuration_utils import PretrainedConfig
+class ReneConfig(PretrainedConfig):
+    r"""Configuration class for the Rene model.
+    This is the configuration class to store the configuration of a [`ReneLMHeadModel`].
+    It is used to instantiate a Rene model according to the specified arguments,
+    defining the model architecture. Instantiating a configuration with the defaults will yield
+    a similar configuration to that of the Rene-v0.1-1.3b-pytorch model.
+    [cartesia-ai/Rene-v0.1-1.3b-pytorch](https://huggingface.co/cartesia-ai/Rene-v0.1-1.3b-pytorch)
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        d_model (`int`, *optional*, defaults to 2048):
+            Dimension of the hidden representations.
+        n_layer (`int`, *optional*, defaults to 48):
+            Number of architecture blocks.
+        vocab_size (`int`, *optional*, defaults to 50280):
+            Vocabulary size of the Rene model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`ReneModel`].
+        ssm_cfg (`dict`, *optional*):
+            Configuration parameters for the SSM layers.
+        attn_layer_idx (`List[int]`, *optional*):
+            Indices of the architecture blocks that should have attention layers.
+        attn_cfg (`dict`, *optional*):
+            Configuration parameters for the attention layers.
+        mlp_layer_idx (`List[int]`, *optional*):
+            Indices of the architecture blocks that should have MLP layers.
+        mlp_cfg (`dict`, *optional*):
+            Configuration parameters for the MLP layers.
+        rms_norm (`bool`, *optional*, defaults to `True`):
+            Whether to use RMSNorm (instead of LayerNorm).
+        residual_in_fp32 (`bool`, *optional*, defaults to `True`):
+            Whether to keep residual values in fp32.
+        pad_vocab_size_multiple (`int`, *optional*, defaults to 16):
+            Pad the vocabulary size up to the next multiple of this value.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied. Note that this is only relevant if the
+            model has a output word embedding layer.
+        pad_token_id (`int`, *optional*, defaults to 1):
+            The id of the padding token.
+        bos_token_id (`int`, *optional*):
+            The id of the "beginning-of-sequence" token.
+        eos_token_id (`int`, *optional*, defaults to 50279):
+            The id of the "end-of-sequence" token.
+    """
+    model_type = "rene"
+    def __init__(
+        self,
+        d_model: int = 2048,
+        n_layer: int = 48,
+        vocab_size: int = 50280,
+        ssm_cfg: Optional[Dict] = None,
+        attn_layer_idx: Optional[List] = None,
+        attn_cfg: Optional[Dict] = None,
+        mlp_layer_idx: Optional[List] = None,
+        mlp_cfg: Optional[Dict] = None,
+        rms_norm: bool = True,
+        residual_in_fp32: bool = True,
+        pad_vocab_size_multiple: int = 16,
+        tie_word_embeddings: bool = True,
+        pad_token_id=1,
+        bos_token_id=None,
+        eos_token_id=50279,
+        **kwargs,
+    ):
+        if ssm_cfg is None:
+            ssm_cfg = {}
+        if attn_layer_idx is None:
+            attn_layer_idx = []
+        if attn_cfg is None:
+            attn_cfg = {}
+        if mlp_layer_idx is None:
+            mlp_layer_idx = []
+        if mlp_cfg is None:
+            mlp_cfg = {}
+        self.d_model = d_model
+        self.n_layer = n_layer
+        self.vocab_size = vocab_size
+        self.ssm_cfg = ssm_cfg
+        self.attn_layer_idx = attn_layer_idx
+        self.attn_cfg = attn_cfg
+        self.mlp_layer_idx = mlp_layer_idx
+        self.mlp_cfg = mlp_cfg
+        self.rms_norm = rms_norm
+        self.residual_in_fp32 = residual_in_fp32
+        self.pad_vocab_size_multiple = pad_vocab_size_multiple
+        self.tie_word_embeddings = tie_word_embeddings
+        super().__init__(
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5a62c98beb82cd70e4ff866b3cd479f836f17676a76b82a337a1dde2126673de
+size 2866628624

rene.py ADDED Viewed

	@@ -0,0 +1,435 @@

+from functools import partial
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from flash_attn import flash_attn_with_kvcache
+from mamba_ssm.models.mixer_seq_simple import _init_weights
+from mamba_ssm.modules.mamba2 import Mamba2
+from mamba_ssm.modules.mha import _update_kv_cache
+from mamba_ssm.utils.generation import GenerationMixin as MambaGenerationMixin
+from transformers.modeling_outputs import CausalLMOutput
+from transformers.modeling_utils import PreTrainedModel
+from .configuration_rene import ReneConfig
+class ReneMLP(nn.Module):
+    """One-hidden-layer network with GELU activation.
+    Args:
+      d_input: Block input dimension.
+      d_output: Block output dimension.
+      expand: Block expansion factor.
+      bias: Use biases in linear layers.
+    """
+    def __init__(self, d_input, d_output=None, expand=3, bias=True, device=None, dtype=None):
+        super().__init__()
+        factory_kwargs = {"device": device, "dtype": dtype}
+        self.d_input = d_input
+        self.d_output = d_input if d_output is None else d_output
+        self.d_inner = int(round(expand * d_input))
+        self.in_proj = nn.Linear(self.d_input, self.d_inner, bias=bias, **factory_kwargs)
+        self.activation = nn.GELU()
+        self.out_proj = nn.Linear(self.d_inner, self.d_input, bias=bias, **factory_kwargs)
+    def forward(self, x, inference_params=None):
+        """Forward pass through the MLP module."""
+        y = self.in_proj(x)
+        y = self.activation(y)
+        y = self.out_proj(y)
+        return y
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
+        """Allocate inference cache for ReneMLP. (There is nothing to cache for this module)."""
+        return None
+class ReneMHA(nn.Module):
+    """Multi-head self-attention. Adapted from mamba_ssm MHA class."""
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        num_heads_kv=None,
+        head_dim=None,  # If None, use embed_dim // num_heads
+        qkv_proj_bias=True,
+        out_proj_bias=True,
+        softmax_scale=None,
+        causal=True,
+        sliding_window_length=None,  # If None, infinite context
+        layer_idx=None,
+        device=None,
+        dtype=None,
+    ) -> None:
+        """
+        num_heads_kv: can be used to toggle MQA / GQA. If None, use num_heads.
+        return_residual: whether to return the input x along with the output. This is for
+            performance reason: for post-norm architecture, returning the input allows us
+            to fuse the backward of nn.Linear with the residual connection.
+        """
+        super().__init__()
+        factory_kwargs = {"device": device, "dtype": dtype}
+        self.embed_dim = embed_dim
+        self.layer_idx = layer_idx
+        self.softmax_scale = softmax_scale
+        self.causal = causal
+        assert self.causal, "Rene does not yet support non-causal modeling"
+        self.num_heads = num_heads
+        self.num_heads_kv = num_heads_kv if num_heads_kv is not None else num_heads
+        assert (
+            self.num_heads % self.num_heads_kv == 0
+        ), "num_heads must be divisible by num_heads_kv"
+        if head_dim is None:
+            assert self.embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
+        self.head_dim = head_dim if head_dim is not None else self.embed_dim // num_heads
+        qkv_dim = self.head_dim * (self.num_heads + 2 * self.num_heads_kv)
+        out_dim = self.head_dim * self.num_heads
+        self.sliding_window_length = sliding_window_length
+        if self.sliding_window_length is None:
+            self.window_size = (-1, -1)
+        else:
+            self.window_size = (self.sliding_window_length - 1, 0)  # for flash_attn
+        self.in_proj = nn.Linear(embed_dim, qkv_dim, bias=qkv_proj_bias, **factory_kwargs)
+        self.out_proj = nn.Linear(out_dim, embed_dim, bias=out_proj_bias, **factory_kwargs)
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None):
+        """Allocate inference cache for the multi-head self-attention module."""
+        dtype = self.out_proj.weight.dtype if dtype is None else dtype
+        device = self.out_proj.weight.device
+        kv_cache = torch.empty(
+            batch_size,
+            max_seqlen,
+            2,
+            self.num_heads_kv,
+            self.head_dim,
+            dtype=dtype,
+            device=device,
+        )
+        return kv_cache, None
+    def _pytorch_attn(self, q, kv):
+        k, v = kv.unbind(dim=-3)
+        k = torch.repeat_interleave(k, dim=2, repeats=self.num_heads // self.num_heads_kv)
+        v = torch.repeat_interleave(v, dim=2, repeats=self.num_heads // self.num_heads_kv)
+        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)
+        L, S = q.size(-2), k.size(-2)
+        if S > self.sliding_window_length:
+            attn_mask = (
+                torch.ones(L, S, dtype=torch.bool)
+                .tril(diagonal=0)
+                .triu(-self.window_size[0])
+                .to(device=q.device)
+            )
+            # Since we pass in an attn_mask explicitly, we need to pass is_causal=False to
+            # `scaled_dot_product_attention` (even though the attn_mask itself is in fact causal).
+            is_causal_arg = False
+        else:
+            # The previous branch would also handle this case correctly, but it is more efficient
+            # to omit the attn_mask when we don't need it.
+            attn_mask = None
+            is_causal_arg = True
+        return F.scaled_dot_product_attention(
+            q, k, v, attn_mask=attn_mask, is_causal=is_causal_arg, scale=self.softmax_scale
+        ).transpose(1, 2)
+    def _update_kv_cache(self, kv, inference_params):
+        """kv: (batch_size, seqlen, 2, nheads, head_dim) or (batch_size, 1, 2, nheads, head_dim)."""
+        assert self.layer_idx is not None, "Generation requires layer_idx in the constructor"
+        return _update_kv_cache(kv, inference_params, self.layer_idx)
+    def _update_kvcache_attention(self, q, kv, inference_params):
+        """Write kv to inference_params, then compute attention."""
+        if inference_params.seqlen_offset == 0 or flash_attn_with_kvcache is None:
+            # TODO: this only uses seqlen_offset and not lengths_per_sample.
+            kv = self._update_kv_cache(kv, inference_params)
+            return self._pytorch_attn(q, kv)
+        else:
+            batch = q.shape[0]
+            kv_cache, _ = inference_params.key_value_memory_dict[self.layer_idx]
+            kv_cache = kv_cache[:batch]
+            cache_seqlens = (
+                inference_params.lengths_per_sample[:batch]
+                if inference_params.lengths_per_sample is not None
+                else inference_params.seqlen_offset
+            )
+            return flash_attn_with_kvcache(
+                q,
+                kv_cache[:, :, 0],
+                kv_cache[:, :, 1],
+                kv[:, :, 0],
+                kv[:, :, 1],
+                cache_seqlens=cache_seqlens,
+                softmax_scale=self.softmax_scale,
+                causal=self.causal,
+                window_size=self.window_size,
+            )
+    def forward(self, x, inference_params=None):
+        """Forward pass through the multi-head self-attention module."""
+        if (
+            inference_params is not None
+            and self.layer_idx not in inference_params.key_value_memory_dict
+        ):
+            inference_params.key_value_memory_dict[self.layer_idx] = self.allocate_inference_cache(
+                x.shape[0], inference_params.max_seqlen, dtype=x.dtype
+            )
+        qkv = self.in_proj(x)
+        q, kv = qkv.split(
+            [self.num_heads * self.head_dim, self.num_heads_kv * 2 * self.head_dim], dim=-1
+        )
+        q = rearrange(q, "... (h d) -> ... h d", d=self.head_dim)
+        kv = rearrange(kv, "... (two hkv d) -> ... two hkv d", two=2, d=self.head_dim)
+        if inference_params is None:
+            context = self._pytorch_attn(q, kv)
+        else:
+            context = self._update_kvcache_attention(q, kv, inference_params)
+        context = rearrange(context, "... h d -> ... (h d)")
+        out = self.out_proj(context)
+        return out
+class Block(nn.Module):
+    """Simple residual block with normalization that wraps an inner "mixer" module."""
+    def __init__(self, dim, mixer_cls, norm_cls=nn.LayerNorm, residual_in_fp32=False):
+        """
+        dim: The dimension of the input data.
+        mixer_cls: The class of the mixer module.
+        norm_cls: The class of the normalization module.
+        residual_in_fp32: Whether to keep residuals in fp32.
+        """
+        super().__init__()
+        self.residual_in_fp32 = residual_in_fp32
+        self.norm = norm_cls(dim)
+        self.mixer = mixer_cls(dim)
+    def forward(self, x, inference_params=None, **mixer_kwargs):
+        """Forward pass through the block."""
+        y = self.norm(x.to(dtype=self.norm.weight.dtype))
+        y = self.mixer(y, inference_params=inference_params, **mixer_kwargs)
+        residual = x
+        if self.residual_in_fp32:
+            residual = residual.to(torch.float32)
+        y = y + residual
+        y = y.to(dtype=x.dtype)
+        return y
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
+        """Allocate inference cache for the mixer module."""
+        return self.mixer.allocate_inference_cache(batch_size, max_seqlen, dtype=dtype, **kwargs)
+def _create_block(
+    d_model,
+    norm_cls,
+    ssm_cfg=None,
+    attn_layer_idx=None,
+    attn_cfg=None,
+    mlp_layer_idx=None,
+    mlp_cfg=None,
+    residual_in_fp32=False,
+    layer_idx=None,
+    device=None,
+    dtype=None,
+):
+    factory_kwargs = {"device": device, "dtype": dtype}
+    if ssm_cfg is None:
+        ssm_cfg = {}
+    if attn_layer_idx is None:
+        attn_layer_idx = []
+    if attn_cfg is None:
+        attn_cfg = {}
+    if mlp_layer_idx is None:
+        mlp_layer_idx = []
+    if mlp_cfg is None:
+        mlp_cfg = {}
+    if layer_idx in attn_layer_idx:
+        mixer_cls = partial(ReneMHA, layer_idx=layer_idx, **attn_cfg, **factory_kwargs)
+    elif layer_idx in mlp_layer_idx:
+        mixer_cls = partial(ReneMLP, **mlp_cfg, **factory_kwargs)
+    else:
+        mixer_cls = partial(Mamba2, layer_idx=layer_idx, **ssm_cfg, **factory_kwargs)
+    return Block(d_model, mixer_cls, norm_cls=norm_cls, residual_in_fp32=residual_in_fp32)
+class MixerModel(nn.Module):
+    """Adapted from mamba_ssm.models.mixer_seq_simple.MixerModel."""
+    def __init__(
+        self,
+        d_model: int,
+        n_layer: int,
+        vocab_size: int,
+        ssm_cfg=None,
+        attn_layer_idx=None,
+        attn_cfg=None,
+        mlp_layer_idx=None,
+        mlp_cfg=None,
+        norm_epsilon: float = 1e-5,
+        rms_norm: bool = False,
+        initializer_cfg=None,
+        residual_in_fp32=False,
+        device=None,
+        dtype=None,
+    ) -> None:
+        super().__init__()
+        factory_kwargs = {"device": device, "dtype": dtype}
+        self.residual_in_fp32 = residual_in_fp32
+        if rms_norm:
+            from mamba_ssm.ops.triton.layer_norm import RMSNorm as norm_cls_base
+        else:
+            norm_cls_base = nn.LayerNorm
+        norm_cls = partial(norm_cls_base, eps=norm_epsilon, **factory_kwargs)
+        self.embedding = nn.Embedding(vocab_size, d_model, **factory_kwargs)
+        self.layers = nn.ModuleList(
+            [
+                _create_block(
+                    d_model,
+                    norm_cls=norm_cls,
+                    ssm_cfg=ssm_cfg,
+                    attn_layer_idx=attn_layer_idx,
+                    attn_cfg=attn_cfg,
+                    mlp_layer_idx=mlp_layer_idx,
+                    mlp_cfg=mlp_cfg,
+                    residual_in_fp32=residual_in_fp32,
+                    layer_idx=i,
+                    **factory_kwargs,
+                )
+                for i in range(n_layer)
+            ]
+        )
+        self.norm_f = norm_cls(d_model)
+        self.apply(
+            partial(
+                _init_weights,
+                n_layer=n_layer,
+                **(initializer_cfg if initializer_cfg is not None else {}),
+                n_residuals_per_layer=1,
+            )
+        )
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
+        """Allocate inference cache for all layers."""
+        return {
+            i: layer.allocate_inference_cache(batch_size, max_seqlen, dtype=dtype, **kwargs)
+            for i, layer in enumerate(self.layers)
+        }
+    def forward(self, input_ids, inference_params=None, **mixer_kwargs):
+        """Forward pass through the model."""
+        hidden_states = self.embedding(input_ids)
+        for layer in self.layers:
+            hidden_states = layer(hidden_states, inference_params=inference_params, **mixer_kwargs)
+        hidden_states = self.norm_f(hidden_states.to(dtype=self.norm_f.weight.dtype))
+        return hidden_states
+class ReneLMHeadModel(PreTrainedModel, MambaGenerationMixin):
+    """
+    Rene language model architecture.
+    Based on mamba_ssm.models.mixer_seq_simple.MambaLMHeadModel, with several adaptations.
+    """
+    config_class = ReneConfig
+    base_model_prefix = "backbone"
+    _no_split_modules = ["Block", "Mamba2"]
+    supports_gradient_checkpointing = True
+    _is_stateful = True
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(
+        self,
+        config: ReneConfig,
+        initializer_cfg=None,
+        device=None,
+        dtype=None,
+    ) -> None:
+        super().__init__(config)
+        d_model = config.d_model
+        n_layer = config.n_layer
+        vocab_size = config.vocab_size
+        ssm_cfg = config.ssm_cfg
+        attn_layer_idx = config.attn_layer_idx
+        attn_cfg = config.attn_cfg
+        mlp_layer_idx = config.mlp_layer_idx
+        mlp_cfg = config.mlp_cfg
+        rms_norm = config.rms_norm
+        residual_in_fp32 = config.residual_in_fp32
+        pad_vocab_size_multiple = config.pad_vocab_size_multiple
+        factory_kwargs = {"device": device, "dtype": dtype}
+        if set(attn_layer_idx).intersection(mlp_layer_idx):
+            raise ValueError(f"Conflicting {attn_layer_idx=} and {mlp_layer_idx=}")
+        if vocab_size % pad_vocab_size_multiple != 0:
+            vocab_size += pad_vocab_size_multiple - (vocab_size % pad_vocab_size_multiple)
+        self.backbone = MixerModel(
+            d_model=d_model,
+            n_layer=n_layer,
+            vocab_size=vocab_size,
+            ssm_cfg=ssm_cfg,
+            attn_layer_idx=attn_layer_idx,
+            attn_cfg=attn_cfg,
+            mlp_layer_idx=mlp_layer_idx,
+            mlp_cfg=mlp_cfg,
+            rms_norm=rms_norm,
+            initializer_cfg=initializer_cfg,
+            residual_in_fp32=residual_in_fp32,
+            **factory_kwargs,
+        )
+        self.lm_head = nn.Linear(d_model, vocab_size, bias=False, **factory_kwargs)
+        # Initialize weights
+        self.apply(
+            partial(
+                _init_weights,
+                n_layer=n_layer,
+                **(initializer_cfg if initializer_cfg is not None else {}),
+            )
+        )
+        self.tie_weights()
+    def tie_weights(self):
+        """Tie embeddings and softmax layer weights if specified by config."""
+        if self.config.tie_word_embeddings:
+            self.lm_head.weight = self.backbone.embedding.weight
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
+        """Allocate inference cache."""
+        return self.backbone.allocate_inference_cache(batch_size, max_seqlen, dtype=dtype, **kwargs)
+    def forward(
+        self, input_ids, position_ids=None, inference_params=None, num_last_tokens=0, **mixer_kwargs
+    ):
+        """
+        "position_ids" is just to be compatible with Transformer generation. We don't use it.
+        num_last_tokens: if > 0, only return the logits for the last n tokens.
+        """
+        hidden_states = self.backbone(input_ids, inference_params=inference_params, **mixer_kwargs)
+        if num_last_tokens > 0:
+            hidden_states = hidden_states[:, -num_last_tokens:]
+        lm_logits = self.lm_head(hidden_states)
+        return CausalLMOutput(logits=lm_logits)
+    def generate(self, *args, **kwargs):
+        """
+        Calls the custom `generate` method from `mamba_ssm.utils.generation.GenerationMixin`.
+        Refer to that method for argument names and defaults.
+        """
+        return MambaGenerationMixin.generate(self, *args, **kwargs)