--- base_model: jjzha/jobbert-base-cased model-index: - name: jobbert-base-cased-compdecs results: [] license: mit language: - en metrics: - accuracy pipeline_tag: text-classification widget: - text: "You must be proficient in Excel." - text: "Would you like to join a major manufacturing company?" --- _Nesta, the UK's innovation agency, has been scraping online job adverts since 2021 and have been building algorithms to extract and structure information as part of the [Open Jobs Observatory](https://www.nesta.org.uk/project/open-jobs-observatory/) project._ _Although we are unable to share the raw data openly, we aim to open source **our models, algorithms and tools** so that anyone can use them for their own research and analysis._ ## 🖊️ Model description This model is a fine-tuned version of [jjzha/jobbert-base-cased](https://huggingface.co/jjzha/jobbert-base-cased). JobBERT is a continuously pre-trained bert-base-cased checkpoint on ~3.2M sentences from job postings. It has been fine tuned with a classification head to binarily classify job advert sentences as being a `company description` or not. The model was trained on **486 manually labelled company description sentences** and **1000 non company description sentences less than 250 characters in length.** It achieves the following results on a held out test set 147 sentences: - Accuracy: 0.92157 | Label | precision | recall | f1-score | support | | ----------- | ----------- | ----------- |----------- |----------- | | not company description | 0.930693 |0.959184|0.944724|98| | company description | 0.913043 |0.857143|0.884211|49| The code for training the model is in our [ojd_daps_language_models repo](https://github.com/nestauk/ojd_daps_language_models), a central repository for fine-tuning transformer models on our database of scraped job adverts. ## 🖨️ Use To use the model: ``` from transformers import AutoTokenizer, AutoModelForSequenceClassification from transformers import pipeline model = AutoModelForSequenceClassification.from_pretrained("ihk/jobbert-base-cased-compdecs") tokenizer = AutoTokenizer.from_pretrained("ihk/jobbert-base-cased-compdecs") comp_classifier = pipeline('text-classification', model=model, tokenizer=tokenizer) ``` An example use is as follows: ``` job_sent = "Would you like to join a major manufacturing company?" comp_classifier(job_sent) >> [{'label': 'LABEL_1', 'score': 0.9953641891479492}] ``` The intended use of this model is to extract company descriptions from online job adverts to use in downstream tasks such as mapping to [Standardised Industrial Classification (SIC)](https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic) codes. ### ⚖️ Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 16 - eval_batch_size: 16 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 10 ### ⚖️ Training results The fine-tuning metrics are as follows: - eval_loss: 0.462236 - eval_runtime: 0.629300 - eval_samples_per_second: 233.582000 - eval_steps_per_second: 15.890000 - epoch: 10.000000 - perplexity: 1.590000 - ### ⚖️ Framework versions - Transformers 4.32.0 - Pytorch 2.0.1+cu118 - Datasets 2.14.4 - Tokenizers 0.13.3