Kashmir Text Generation Model

Model Overview

This is a transformer-based text generation model designed for Kashmiri language text generation. The model uses a decoder-only architecture with positional encoding and self-attention mechanisms.

TRY LIVE DEMO ON SPACES

VIEW HERE (Click)

Intended Use

Primary Use: Generating coherent Kashmiri text continuations from given prompts
Intended Users: Researchers and developers working with Kashmiri language processing
Out-of-Scope Uses: Not intended for production deployment without further evaluation

Model Architecture

Type: Decoder-only Transformer
Components:
- Positional Encoding
- Embedding Layer
- Transformer Decoder Layers
- Linear Output Layer
Implementation: PyTorch

This is a custom transformer-based text generation model for Kashmiri language.

Model Details

Architecture: Custom Transformer Decoder
Vocabulary Size: 36100
Embedding Dimension: 256
Number of Layers: 4
Number of Attention Heads: 8
Sequence Length: 64
Training Data: Kashmiri text corpus

Technical Specifications

Framework: PyTorch
Input: Text prompts in Kashmiri
Output: Generated text continuation
Model Parameters:
- Embedding Dimension: Specified in model_config.json
- Number of Layers: Specified in model_config.json
- Number of Attention Heads: Specified in model_config.json
- Sequence Length: Specified in model_config.json
- Dropout Rate: 0.2

Files Structure

├── root /
│   ├── model.pt              # Trained model weights
│   ├── word_to_int.json      # Word to integer mapping
│   ├── int_to_word.json      # Integer to word mapping
│   └── model_config.json     # Model configuration

NOTE

Ensure all required files are present in the root directory

Setup in Google Colab

Create a new Google Colab notebook

Copy and paste the following code into a code cell:

!git clone https://huggingface.co/Omarrran/Kashur_gpt_version_1

Required Files

The model requires the following files which will be downloaded from the HuggingFace repository:

model.pt: The trained model weights
model_config.json: Model configuration parameters
word_to_int.json: Vocabulary mapping from words to integers
int_to_word.json: Vocabulary mapping from integers to words

NOTE

Ensure all required files are present in the root directory

import os
import shutil

# Define the source and destination paths
source_path = "/content/Kashur_gpt_version_1/"
destination_path = "/content/"

# Loop through all files in the source directory and move them to the destination
for filename in os.listdir(source_path):
    file_path = os.path.join(source_path, filename)
    if os.path.isfile(file_path):
        shutil.move(file_path, destination_path)

print(f"All files from {source_path} moved to {destination_path}")

Usage

1. Import Required Libraries

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import json
import os

2. Device configuration

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

class PositionalEncoding(nn.Module):
    def __init__(self, max_len, d_model, dropout=0.1):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

class TextGen(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_layers, num_heads, sequence_length):
        super(TextGen, self).__init__()
        self.pos_encoder = PositionalEncoding(max_len=sequence_length, d_model=embed_dim)
        self.emb = nn.Embedding(vocab_size, embed_dim)
        self.decoder_layer = nn.TransformerDecoderLayer(d_model=embed_dim, nhead=num_heads, batch_first=True)
        self.decoder = nn.TransformerDecoder(decoder_layer=self.decoder_layer, num_layers=num_layers)
        self.linear = nn.Linear(embed_dim, vocab_size)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        emb = self.emb(x)
        input_mask = generate_square_subsequent_mask(x.size(1)).to(x.device)
        x = self.pos_encoder(emb)
        x = self.decoder(x, memory=x, tgt_mask=input_mask, memory_mask=input_mask)
        x = self.dropout(x)
        out = self.linear(x)
        return out

def load_model():
    # Load configuration
    with open('model_config.json', 'r') as f:
        config = json.load(f)

    # Load vocabularies
    with open('word_to_int.json', 'r', encoding='utf-8') as f:
        word_to_int = json.load(f)
    with open('int_to_word.json', 'r', encoding='utf-8') as f:
        int_to_word = json.load(f)

    # Initialize model
    model = TextGen(
        vocab_size=config['vocab_size'],
        embed_dim=config['embed_dim'],
        num_layers=config['num_layers'],
        num_heads=config['num_heads'],
        sequence_length=config['sequence_length']
    ).to(device)

    # Load model weights
    model.load_state_dict(torch.load('model.pt', map_location=device))
    model.eval()

    return model, word_to_int, int_to_word, config['sequence_length']

@torch.no_grad()
def generate_text(model, prompt, word_to_int, int_to_word, sequence_length, max_length=100, temperature=1.0):
    model.eval()
    words = prompt.split()
    current_seq = torch.LongTensor([word_to_int.get(w, 0) for w in words]).unsqueeze(0).to(device)

    for _ in range(max_length):
        if current_seq.size(1) > sequence_length:
            current_seq = current_seq[:, -sequence_length:]

        output = model(current_seq)
        next_token_logits = output[:, -1, :] / temperature
        next_token = torch.multinomial(F.softmax(next_token_logits, dim=-1), num_samples=1)

        current_seq = torch.cat([current_seq, next_token], dim=1)
        next_word = int_to_word.get(str(next_token.item()), "<UNK>")
        words.append(next_word)

        if next_word == ".":
            break

    return " ".join(words)

if __name__ == "__main__":
    # Load model and required files
    model, word_to_int, int_to_word, sequence_length = load_model()

Load the Model

The model will automatically load after running the provided code above. It uses either GPU (if available) or CPU.

3. Generate Text

To generate text, use the following format:

# Example prompt (in Kashmiri)
prompt = " دِتم مصمت۔یم بگُل غلام چھُ آں تس اکھ حمزہ گویی"   # Replace With your Kashmiri text prompt

generated_text = generate_text(
    model, 
    prompt, 
    word_to_int, 
    int_to_word,
    sequence_length, 
    max_length=100  # Adjust this value to control output length
)
print(f"Generated text: {generated_text}")

Parameters

You can adjust the following parameters for text generation:

max_length: Maximum number of words to generate (default: 100)
temperature: Controls randomness in generation (default: 1.0)
- Higher values (>1.0) make the output more random
- Lower values (<1.0) make the output more focused and deterministic

Generation Parameters

Temperature: Controls randomness in generation (default: 1.0)
- Higher values (>1.0) result in more diverse outputs
- Lower values (<1.0) make the output more deterministic
Max Length: Maximum number of tokens to generate (default: 100)
Sequence Length: Maximum context window size (specified in config)

Limitations

The model operates at word-level tokenization
Limited by the maximum sequence length specified in the configuration
Generation stops at the first period (.) encountered
Performance may vary based on input prompt quality and length

Performance Considerations

Runs on both CPU and CUDA-enabled GPUs
Memory usage scales with sequence length and batch size
Inference speed depends on hardware capabilities and generation parameters

Dependencies

Python 3.6+
PyTorch
Math
JSON
OS

License

[See above card]

Citation

If you use this model in your research, please cite:

@misc{{kashmiri_text_gen,
  author = {{Haq Nawaz Malik}},
  title = {{Kashmiri Text Generation Model}},
  year = {{2024}},
  journal = {{for Preprint}},
  howpublished = {{\\url{{https://huggingface.co/Omarrran/kashmiri_text_gen_model}}}}
}}

Contact

[Add contact information for model maintainers]

Updates and Maintenance

Version: 1.0
Last Updated: [26-10-2024]
[Working to make an updated version]