File size: 4,284 Bytes
b7fa4cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb6fdf9
 
b7fa4cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c25919
b7fa4cc
 
 
 
 
 
 
 
cb6fdf9
b7fa4cc
 
 
cb6fdf9
b7fa4cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb6fdf9
 
 
 
b7fa4cc
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
language:
  - en
tags:
  - keyword-extraction
  - text-summarization
  - flan-t5
license: mit
datasets:
  - agentlans/wikipedia-paragraph-keywords
base_model: google/flan-t5-small
library_name: transformers
---

# Keyword Extraction Model

This model is a fine-tuned version of the [Flan-T5 small](https://huggingface.co/google/flan-t5-small) model, specifically adapted for extracting keywords from paragraphs. It uses the power of the T5 architecture to identify and output key phrases that capture the essence of the input text.

## Model Description

The model takes a paragraph as input and generates a list of keywords or key phrases that summarize the main topics and themes of the text. It's particularly useful for:

- Summarizing long texts
- Generating tags for articles or blog posts
- Identifying main themes in documents

## Intended Uses & Limitations

**Intended Uses:**
- Quick summarization of long paragraphs
- Generating metadata for content management systems
- Assisting in SEO keyword identification

**Limitations:**
- The model may sometimes generate irrelevant keywords
- Performance may vary depending on the length and complexity of the input text
  - For best results, use long clean texts
  - Length limit is 512 tokens due to Flan-T5 architecture
- The model is trained on English text and may not perform well on other languages

## Training and Evaluation

The model was fine-tuned on a dataset of English Wikipedia paragraphs and their corresponding keywords which includes a diverse range of topics to ensure broad applicability.

## How to Use

Here's a simple example of how to use the model:

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "agentlans/flan-t5-small-keywords"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "Your paragraph here..."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Process the output to get a list of keywords (split and remove duplicates)
keywords = list(set(decoded_output.split('||')))
print(keywords)
```

Example input paragraph:

```In the heart of the bustling city, a hidden gem awaits discovery: a quaint little bookstore that seems to have escaped the relentless march of time. As you step inside, the scent of aged paper and rich coffee envelops you, creating an inviting atmosphere that beckons you to explore its shelves. Each corner is adorned with carefully curated collections, from classic literature to contemporary bestsellers, inviting readers of all tastes to lose themselves in the pages of a good book. The soft glow of warm lighting casts a cozy ambiance, while the gentle hum of conversation among fellow book lovers adds to the charm. This bookstore is not just a place to buy books; it's a sanctuary for those seeking solace, inspiration, and a sense of community in the fast-paced world outside.```

Example output keywords:

`['old paper coffee scent', 'cosy hum of conversation', 'quaint bookstore', 'community in the fast-paced world', 'solace inspiration', 'curated collections']`

## Limitations and Bias

This model has been trained on English Wikipedia paragraphs, which may introduce biases. Users should be aware that the keywords generated might reflect these biases and should use the output judiciously.

## Training Details

- **Training Data:** dataset of Wikipedia paragraphs and keywords
- **Training Procedure:** Fine-tuning of google/flan-t5-small

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10.0

### Framework versions

- Transformers 4.45.1
- Pytorch 2.4.1+cu121
- Datasets 3.0.1
- Tokenizers 0.20.0

## Ethical Considerations

When using this model, consider the potential impact of automated keyword extraction on content creation and SEO practices. Ensure that the use of this model complies with relevant guidelines and does not contribute to the creation of misleading or spammy content.