llama-3.1-8b-ocr-correction / README.md

pbevan11

Update README.md

0dc068c verified about 2 months ago

preview code

raw

history blame contribute delete

No virus

7.15 kB

	---
	base_model: meta-llama/Meta-Llama-3.1-8B
	library_name: peft
	license: llama3.1
	tags:
	- axolotl
	- generated_from_trainer
	model-index:
	- name: llama-3.1-8b-ocr-correction
	results: []
	datasets:
	- pbevan11/synthetic-ocr-correction-gpt4o
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
	<details><summary>See axolotl config</summary>

	axolotl version: `0.4.1`
	```yaml
	base_model: meta-llama/Meta-Llama-3.1-8B
	model_type: AutoModelForCausalLM
	tokenizer_type: AutoTokenizer

	load_in_8bit: false
	load_in_4bit: true
	strict: false

	lora_fan_in_fan_out: false
	data_seed: 49
	seed: 49

	datasets:
	- path: ft_data/alpaca_data.jsonl
	type: alpaca
	dataset_prepared_path: last_run_prepared
	val_set_size: 0.05
	output_dir: ./qlora-alpaca-out
	hub_model_id: pbevan11/llama-3.1-8b-ocr-correction

	adapter: qlora
	lora_model_dir:

	sequence_len: 8192
	sample_packing: true
	pad_to_sequence_len: true

	lora_r: 32
	lora_alpha: 16
	lora_dropout: 0.05
	lora_target_linear: true
	lora_fan_in_fan_out:
	lora_target_modules:
	- gate_proj
	- down_proj
	- up_proj
	- q_proj
	- v_proj
	- k_proj
	- o_proj

	wandb_project: ocr-ft
	wandb_entity: sncds
	wandb_name: llama31

	gradient_accumulation_steps: 4
	micro_batch_size: 2 # was 16
	eval_batch_size: 2 # was 16
	num_epochs: 2
	optimizer: paged_adamw_32bit
	lr_scheduler: cosine
	learning_rate: 0.0002

	train_on_inputs: false
	group_by_length: false
	bf16: auto
	fp16:
	tf32: false

	gradient_checkpointing: true
	early_stopping_patience:
	resume_from_checkpoint:
	local_rank:
	logging_steps: 1
	xformers_attention:
	flash_attention: true

	loss_watchdog_threshold: 5.0
	loss_watchdog_patience: 3

	warmup_steps: 10
	evals_per_epoch: 4
	eval_table_size:
	saves_per_epoch: 1
	debug:
	deepspeed:
	weight_decay: 0.0
	fsdp:
	fsdp_config:
	special_tokens:
	pad_token: "<\|end_of_text\|>"
	```

	</details><br>

	# llama-3.1-8b-ocr-correction

	This model is a qlora fine-tuned adapter for [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) on the [pbevan11/synthetic-ocr-correction-gpt4o](https://huggingface.co/datasets/pbevan11/synthetic-ocr-correction-gpt4o) dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.1901

	## Usage

	First, download the model

	```python
	from peft import AutoPeftModelForCausalLM
	from transformers import AutoTokenizer
	model_id='pbevan11/llama-3.1-8b-ocr-correction'
	model = AutoPeftModelForCausalLM.from_pretrained(model_id).cuda()
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	tokenizer.pad_token = tokenizer.eos_token
	```

	Then, construct the prompt template like so:

	```python
	def prompt(instruction, inp):
	return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

	### Instruction:
	{instruction}

	### Input:
	{inp}

	### Response:
	"""

	def prompt_tok(instruction, inp, return_ids=False):
	_p = prompt(instruction, inp)
	input_ids = tokenizer(_p, return_tensors="pt", truncation=True).input_ids.cuda()
	out_ids = model.generate(input_ids=input_ids, max_new_tokens=5000,
	do_sample=False)
	ids = out_ids.detach().cpu().numpy()
	if return_ids: return out_ids

	full_output = tokenizer.batch_decode(ids, skip_special_tokens=True)[0]
	response_start = full_output.find("### Response:")
	if response_start != -1:
	return full_output[response_start + len("### Response:"):]
	else:
	return full_output[len(_p):]
	```

	Finally, you can get predictions like this:

	```python
	# model inputs
	instruction = "You are an assistant that takes a piece of text that has been corrupted during OCR digitisation, and produce a corrected version of the same text."
	inp = "Do Not Kule Oi't hy.er-l'rieed AjijqIi: imac - Analyst (fteuiers) Hcuiers - A \| ) \| ilf, <;/) in \|) nter \|iic . conic! deeiilf. l.o sell n lower-\|)rieofl wersinn oi its Macintosh cornutor to nttinct ronsnnu-rs already euami'red ot its iPod music jiayo-r untl annoyoil. by sccnrit.y problems ivitJi Willtlows PCs , Piper.iaffray analyst. (Jcne Muster <aid on Tlinrtiday."

	# print prediction
	out = prompt_tok(instruction, inp)
	print(out.replace('\\', ' ').strip('\\n'))
	```

	This will give you a prediction that looks like this:

	```md
	"Do Not Rule Out Lower-Priced Mac - Analyst (Reuters) Reuters - Apple Inc. may be considering a lower-priced version of its Macintosh computer to attract consumers already enamored of its iPod music player and annoyed by security problems with Windows PCs, PiperJaffray analyst Gene Munster said on Thursday."
	```

	Alternatively, you can play with this model on Replicate: [https://replicate.com/pbevan1/llama-3.1-8b-ocr-correction](https://replicate.com/pbevan1/llama-3.1-8b-ocr-correction)


	## Intended uses & limitations

	Reconstructions should not be taken as the truth, the model is likely to make some things up to fill in the gaps, and so some things may not be perfectly histoically acurate.

	This model was intended to be used to restore historical documents that have been imperfectly digitalised using OCR.

	This model could be used to transform poorly transcribed text into semi-synthetic training data, potentially unlocking millions of tokens of training data for future LLMs. The llama 3.1 license allows training on outputs, so this semi-synthetic data is perfectly legal to use.

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0002
	- train_batch_size: 2
	- eval_batch_size: 2
	- seed: 49
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 8
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 10
	- num_epochs: 2

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| 0.61 \| 0.0331 \| 1 \| 0.6018 \|
	\| 0.4379 \| 0.2645 \| 8 \| 0.4256 \|
	\| 0.2531 \| 0.5289 \| 16 \| 0.2714 \|
	\| 0.2366 \| 0.7934 \| 24 \| 0.2247 \|
	\| 0.1839 \| 1.0331 \| 32 \| 0.2053 \|
	\| 0.1752 \| 1.2975 \| 40 \| 0.1961 \|
	\| 0.1629 \| 1.5620 \| 48 \| 0.1909 \|
	\| 0.163 \| 1.8264 \| 56 \| 0.1901 \|


	### Framework versions

	- PEFT 0.11.1
	- Transformers 4.43.2
	- Pytorch 2.1.2+cu118
	- Datasets 2.19.1
	- Tokenizers 0.19.1

	### Citation:
	```
	@misc {peter_j._bevan_2024,
	author = { {Peter J. Bevan} },
	title = { llama-3.1-8b-ocr-correction (Revision 2760c4e) },
	year = 2024,
	url = { https://huggingface.co/pbevan11/llama-3.1-8b-ocr-correction },
	doi = { 10.57967/hf/2791 },
	publisher = { Hugging Face }
	}
	```