agentlans's picture
Update README.md
cb6fdf9 verified
metadata
language:
  - en
tags:
  - keyword-extraction
  - text-summarization
  - flan-t5
license: mit
datasets:
  - agentlans/wikipedia-paragraph-keywords
base_model: google/flan-t5-small
library_name: transformers

Keyword Extraction Model

This model is a fine-tuned version of the Flan-T5 small model, specifically adapted for extracting keywords from paragraphs. It uses the power of the T5 architecture to identify and output key phrases that capture the essence of the input text.

Model Description

The model takes a paragraph as input and generates a list of keywords or key phrases that summarize the main topics and themes of the text. It's particularly useful for:

  • Summarizing long texts
  • Generating tags for articles or blog posts
  • Identifying main themes in documents

Intended Uses & Limitations

Intended Uses:

  • Quick summarization of long paragraphs
  • Generating metadata for content management systems
  • Assisting in SEO keyword identification

Limitations:

  • The model may sometimes generate irrelevant keywords
  • Performance may vary depending on the length and complexity of the input text
    • For best results, use long clean texts
    • Length limit is 512 tokens due to Flan-T5 architecture
  • The model is trained on English text and may not perform well on other languages

Training and Evaluation

The model was fine-tuned on a dataset of English Wikipedia paragraphs and their corresponding keywords which includes a diverse range of topics to ensure broad applicability.

How to Use

Here's a simple example of how to use the model:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "agentlans/flan-t5-small-keywords"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "Your paragraph here..."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Process the output to get a list of keywords (split and remove duplicates)
keywords = list(set(decoded_output.split('||')))
print(keywords)

Example input paragraph:

In the heart of the bustling city, a hidden gem awaits discovery: a quaint little bookstore that seems to have escaped the relentless march of time. As you step inside, the scent of aged paper and rich coffee envelops you, creating an inviting atmosphere that beckons you to explore its shelves. Each corner is adorned with carefully curated collections, from classic literature to contemporary bestsellers, inviting readers of all tastes to lose themselves in the pages of a good book. The soft glow of warm lighting casts a cozy ambiance, while the gentle hum of conversation among fellow book lovers adds to the charm. This bookstore is not just a place to buy books; it's a sanctuary for those seeking solace, inspiration, and a sense of community in the fast-paced world outside.

Example output keywords:

['old paper coffee scent', 'cosy hum of conversation', 'quaint bookstore', 'community in the fast-paced world', 'solace inspiration', 'curated collections']

Limitations and Bias

This model has been trained on English Wikipedia paragraphs, which may introduce biases. Users should be aware that the keywords generated might reflect these biases and should use the output judiciously.

Training Details

  • Training Data: dataset of Wikipedia paragraphs and keywords
  • Training Procedure: Fine-tuning of google/flan-t5-small

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 10.0

Framework versions

  • Transformers 4.45.1
  • Pytorch 2.4.1+cu121
  • Datasets 3.0.1
  • Tokenizers 0.20.0

Ethical Considerations

When using this model, consider the potential impact of automated keyword extraction on content creation and SEO practices. Ensure that the use of this model complies with relevant guidelines and does not contribute to the creation of misleading or spammy content.