Kashmir Text Generation Model
Model Overview
This is a transformer-based text generation model designed for Kashmiri language text generation. The model uses a decoder-only architecture with positional encoding and self-attention mechanisms.
TRY LIVE DEMO ON SPACES
Intended Use
- Primary Use: Generating coherent Kashmiri text continuations from given prompts
- Intended Users: Researchers and developers working with Kashmiri language processing
- Out-of-Scope Uses: Not intended for production deployment without further evaluation
Model Architecture
- Type: Decoder-only Transformer
- Components:
- Positional Encoding
- Embedding Layer
- Transformer Decoder Layers
- Linear Output Layer
- Implementation: PyTorch
This is a custom transformer-based text generation model for Kashmiri language.
Model Details
- Architecture: Custom Transformer Decoder
- Vocabulary Size: 36100
- Embedding Dimension: 256
- Number of Layers: 4
- Number of Attention Heads: 8
- Sequence Length: 64
- Training Data: Kashmiri text corpus
Technical Specifications
- Framework: PyTorch
- Input: Text prompts in Kashmiri
- Output: Generated text continuation
- Model Parameters:
- Embedding Dimension: Specified in
model_config.json
- Number of Layers: Specified in
model_config.json
- Number of Attention Heads: Specified in
model_config.json
- Sequence Length: Specified in
model_config.json
- Dropout Rate: 0.2
- Embedding Dimension: Specified in
Files Structure
├── root /
│ ├── model.pt # Trained model weights
│ ├── word_to_int.json # Word to integer mapping
│ ├── int_to_word.json # Integer to word mapping
│ └── model_config.json # Model configuration
NOTE
- Ensure all required files are present in the root directory
Setup in Google Colab
- Create a new Google Colab notebook
- Copy and paste the following code into a code cell:
!git clone https://huggingface.co/Omarrran/Kashur_gpt_version_1
Required Files
The model requires the following files which will be downloaded from the HuggingFace repository:
model.pt
: The trained model weightsmodel_config.json
: Model configuration parametersword_to_int.json
: Vocabulary mapping from words to integersint_to_word.json
: Vocabulary mapping from integers to words
NOTE
- Ensure all required files are present in the root directory
import os
import shutil
# Define the source and destination paths
source_path = "/content/Kashur_gpt_version_1/"
destination_path = "/content/"
# Loop through all files in the source directory and move them to the destination
for filename in os.listdir(source_path):
file_path = os.path.join(source_path, filename)
if os.path.isfile(file_path):
shutil.move(file_path, destination_path)
print(f"All files from {source_path} moved to {destination_path}")
Usage
1. Import Required Libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import json
import os
2. Device configuration
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def generate_square_subsequent_mask(sz):
mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
return mask
class PositionalEncoding(nn.Module):
def __init__(self, max_len, d_model, dropout=0.1):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:, :x.size(1)]
return self.dropout(x)
class TextGen(nn.Module):
def __init__(self, vocab_size, embed_dim, num_layers, num_heads, sequence_length):
super(TextGen, self).__init__()
self.pos_encoder = PositionalEncoding(max_len=sequence_length, d_model=embed_dim)
self.emb = nn.Embedding(vocab_size, embed_dim)
self.decoder_layer = nn.TransformerDecoderLayer(d_model=embed_dim, nhead=num_heads, batch_first=True)
self.decoder = nn.TransformerDecoder(decoder_layer=self.decoder_layer, num_layers=num_layers)
self.linear = nn.Linear(embed_dim, vocab_size)
self.dropout = nn.Dropout(0.2)
def forward(self, x):
emb = self.emb(x)
input_mask = generate_square_subsequent_mask(x.size(1)).to(x.device)
x = self.pos_encoder(emb)
x = self.decoder(x, memory=x, tgt_mask=input_mask, memory_mask=input_mask)
x = self.dropout(x)
out = self.linear(x)
return out
def load_model():
# Load configuration
with open('model_config.json', 'r') as f:
config = json.load(f)
# Load vocabularies
with open('word_to_int.json', 'r', encoding='utf-8') as f:
word_to_int = json.load(f)
with open('int_to_word.json', 'r', encoding='utf-8') as f:
int_to_word = json.load(f)
# Initialize model
model = TextGen(
vocab_size=config['vocab_size'],
embed_dim=config['embed_dim'],
num_layers=config['num_layers'],
num_heads=config['num_heads'],
sequence_length=config['sequence_length']
).to(device)
# Load model weights
model.load_state_dict(torch.load('model.pt', map_location=device))
model.eval()
return model, word_to_int, int_to_word, config['sequence_length']
@torch.no_grad()
def generate_text(model, prompt, word_to_int, int_to_word, sequence_length, max_length=100, temperature=1.0):
model.eval()
words = prompt.split()
current_seq = torch.LongTensor([word_to_int.get(w, 0) for w in words]).unsqueeze(0).to(device)
for _ in range(max_length):
if current_seq.size(1) > sequence_length:
current_seq = current_seq[:, -sequence_length:]
output = model(current_seq)
next_token_logits = output[:, -1, :] / temperature
next_token = torch.multinomial(F.softmax(next_token_logits, dim=-1), num_samples=1)
current_seq = torch.cat([current_seq, next_token], dim=1)
next_word = int_to_word.get(str(next_token.item()), "<UNK>")
words.append(next_word)
if next_word == ".":
break
return " ".join(words)
if __name__ == "__main__":
# Load model and required files
model, word_to_int, int_to_word, sequence_length = load_model()
Load the Model
The model will automatically load after running the provided code above. It uses either GPU (if available) or CPU.
3. Generate Text
To generate text, use the following format:
# Example prompt (in Kashmiri)
prompt = " دِتم مصمت۔یم بگُل غلام چھُ آں تس اکھ حمزہ گویی" # Replace With your Kashmiri text prompt
generated_text = generate_text(
model,
prompt,
word_to_int,
int_to_word,
sequence_length,
max_length=100 # Adjust this value to control output length
)
print(f"Generated text: {generated_text}")
Parameters
You can adjust the following parameters for text generation:
max_length
: Maximum number of words to generate (default: 100)temperature
: Controls randomness in generation (default: 1.0)- Higher values (>1.0) make the output more random
- Lower values (<1.0) make the output more focused and deterministic
Generation Parameters
- Temperature: Controls randomness in generation (default: 1.0)
- Higher values (>1.0) result in more diverse outputs
- Lower values (<1.0) make the output more deterministic
- Max Length: Maximum number of tokens to generate (default: 100)
- Sequence Length: Maximum context window size (specified in config)
Limitations
- The model operates at word-level tokenization
- Limited by the maximum sequence length specified in the configuration
- Generation stops at the first period (.) encountered
- Performance may vary based on input prompt quality and length
Performance Considerations
- Runs on both CPU and CUDA-enabled GPUs
- Memory usage scales with sequence length and batch size
- Inference speed depends on hardware capabilities and generation parameters
Dependencies
- Python 3.6+
- PyTorch
- Math
- JSON
- OS
License
[See above card]
Citation
If you use this model in your research, please cite:
@misc{{kashmiri_text_gen,
author = {{Haq Nawaz Malik}},
title = {{Kashmiri Text Generation Model}},
year = {{2024}},
journal = {{for Preprint}},
howpublished = {{\\url{{https://huggingface.co/Omarrran/kashmiri_text_gen_model}}}}
}}
Contact
[Add contact information for model maintainers]
Updates and Maintenance
- Version: 1.0
- Last Updated: [26-10-2024]
- [Working to make an updated version]