ihk's picture
Update README.md
cf5074d
---
base_model: jjzha/jobbert-base-cased
model-index:
- name: jobbert-base-cased-compdecs
results: []
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: text-classification
widget:
- text: "You must be proficient in Excel."
- text: "Would you like to join a major manufacturing company?"
---
_Nesta, the UK's innovation agency, has been scraping online job adverts since 2021 and building algorithms to extract and structure information as part of the [Open Jobs Observatory](https://www.nesta.org.uk/project/open-jobs-observatory/) project._
_Although we are unable to share the raw data openly, we aim to open source **our models, algorithms and tools** so that anyone can use them for their own research and analysis._
## 🖊️ Model description
This model is a fine-tuned version of [jjzha/jobbert-base-cased](https://huggingface.co/jjzha/jobbert-base-cased). JobBERT is a continuously pre-trained bert-base-cased checkpoint on ~3.2M sentences from job postings.
It has been fine tuned with a classification head to binarily classify job advert sentences as being a `company description` or not.
The model was trained on **486 manually labelled company description sentences** and **1000 non company description sentences less than 250 characters in length.**
It achieves the following results on a held out test set 147 sentences:
- Accuracy: 0.92157
| Label | precision | recall | f1-score | support |
| ----------- | ----------- | ----------- |----------- |----------- |
| not company description | 0.930693 |0.959184|0.944724|98|
| company description | 0.913043 |0.857143|0.884211|49|
The code for training the model is in our [ojd_daps_language_models repo](https://github.com/nestauk/ojd_daps_language_models), a central repository for fine-tuning transformer models on our database of scraped job adverts.
## 🖨️ Use
To use the model:
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
model = AutoModelForSequenceClassification.from_pretrained("nestauk/jobbert-base-cased-compdecs")
tokenizer = AutoTokenizer.from_pretrained("nestauk/jobbert-base-cased-compdecs")
comp_classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)
```
An example use is as follows:
```
job_sent = "Would you like to join a major manufacturing company?"
comp_classifier(job_sent)
>> [{'label': 'LABEL_1', 'score': 0.9953641891479492}]
```
The intended use of this model is to extract company descriptions from online job adverts to use in downstream tasks such as mapping to [Standardised Industrial Classification (SIC)](https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic) codes.
### ⚖️ Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10
### ⚖️ Training results
The fine-tuning metrics are as follows:
- eval_loss: 0.462236
- eval_runtime: 0.629300
- eval_samples_per_second: 233.582000
- eval_steps_per_second: 15.890000
- epoch: 10.000000
- perplexity: 1.590000
-
### ⚖️ Framework versions
- Transformers 4.32.0
- Pytorch 2.0.1+cu118
- Datasets 2.14.4
- Tokenizers 0.13.3