File size: 2,894 Bytes
5f8a486
 
 
 
 
3dd32cd
 
 
 
 
 
37c6c34
 
bddd728
 
5f8a486
 
3dd32cd
5f8a486
3dd32cd
5f8a486
3dd32cd
5f8a486
37c6c34
5f8a486
 
3dd32cd
 
5f8a486
3dd32cd
 
 
 
5f8a486
3dd32cd
5f8a486
3dd32cd
5f8a486
3dd32cd
 
 
5f8a486
3dd32cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f8a486
 
 
 
 
 
 
 
 
 
3dd32cd
5f8a486
3dd32cd
 
 
 
 
 
 
 
5f8a486
3dd32cd
5f8a486
 
 
 
3dd32cd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
base_model: jjzha/jobbert-base-cased
model-index:
- name: jobbert-base-cased-compdecs
  results: []
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: text-classification
widget:
- text: "Would you like to join a major manufacturing company?"
- text: "You must be proficient in Excel."
- text: "Meta sells advertising placements for marketers to reach people based on various factors including age, gender, location, interests, and behavior."
---

## 🖊️ Model description

This model is a fine-tuned version of [jjzha/jobbert-base-cased](https://huggingface.co/jjzha/jobbert-base-cased). JobBERT is a continuously pre-trained bert-base-cased checkpoint on ~3.2M sentences from job postings.

It has been fine tuned with a classification head to binarily classify job advert sentences as being a `company description` or not.  

The model was trained on **486 manually labelled company description sentences** and **1000 non company description sentences less than 250 characters in length.**


It achieves the following results on a held out test set 147 sentences:
- Accuracy: 0.92157

| Label      | precision | recall | f1-score | support |
| ----------- | ----------- | ----------- |----------- |----------- |
| not company description      | 0.930693       |0.959184|0.944724|98|
| company description   | 0.913043        |0.857143|0.884211|49|

## 🖨️ Use

To use the model: 

```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

model = AutoModelForSequenceClassification.from_pretrained("ihk/jobbert-base-cased-compdecs")
tokenizer = AutoTokenizer.from_pretrained("ihk/jobbert-base-cased-compdecs")

comp_classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)
```
An example use is as follows:

```
job_sent = "Would you like to join a major manufacturing company?"
comp_classifier(job_sent)

>> [{'label': 'LABEL_1', 'score': 0.9953641891479492}]
```

The intended use of this model is to extract company descriptions from online job adverts to use in downstream tasks such as mapping to [Standardised Industrial Classification (SIC)](https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic) codes. 


### ⚖️ Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10

### ⚖️ Training results

The fine-tuning metrics are as follows:
- eval_loss: 0.462236
- eval_runtime: 0.629300
- eval_samples_per_second: 233.582000
- eval_steps_per_second: 15.890000
- epoch: 10.000000
- perplexity: 1.590000
- 

### ⚖️ Framework versions

- Transformers 4.32.0
- Pytorch 2.0.1+cu118
- Datasets 2.14.4
- Tokenizers 0.13.3