|
--- |
|
base_model: |
|
- Kansallisarkisto/multicentury-htr-model-onnx |
|
pipeline_tag: image-to-text |
|
license: mit |
|
--- |
|
|
|
## Handwritten text recognition for table cell images |
|
|
|
The model performs handwritten text recognition from text line images. |
|
It was trained by fine-tuning the National Archives' Multicentury HTR model Microsoft's TrOCR model |
|
using text line images taken from Finnish death record and census record tables from the 1930s. |
|
|
|
## Intended uses & limitations |
|
|
|
The model has been trained to recognize handwritten text from a specific type of table cell data, |
|
and may generalize poorly to other datasets. |
|
|
|
The model takes as input text line images, and the use of other types of inputs are not recommended. |
|
|
|
## How to use |
|
|
|
The model can be used for predicting the text content of images following the code below. |
|
It is recommended to use GPU for inference if available. |
|
|
|
```python |
|
from transformers import TrOCRProcessor, VisionEncoderDecoderModel |
|
from PIL import Image |
|
import torch |
|
|
|
# Use GPU if available |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# Model location in Huggingface Hub |
|
model_checkpoint = "Kansallisarkisto/tablecell-htr" |
|
# Path to textline image |
|
line_image_path = "/path/to/textline_image.jpg" |
|
|
|
# Initialize processor and model |
|
processor = TrOCRProcessor.from_pretrained(model_checkpoint) |
|
model = VisionEncoderDecoderModel.from_pretrained(model_checkpoint).to(device) |
|
|
|
# Open image file and extract pixel values |
|
image = Image.open(line_image_path).convert("RGB") |
|
pixel_values = processor(image, return_tensors="pt").pixel_values |
|
|
|
# Use the model to generate predictions |
|
generated_ids = model.generate(pixel_values.to(device)) |
|
# Use the processor to decode ids to text |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(generated_text) |
|
``` |
|
The model that is downloaded from the HuggingFace Hub is saved locally to `~/.cache/huggingface/hub/`. |
|
|
|
## Training data |
|
|
|
Model was trained using 6704 text line images, while the validation dataset contained |
|
836 text line images. |
|
|
|
## Training procedure |
|
|
|
This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters: |
|
|
|
- train batch size: 16 |
|
- epochs: 15 |
|
- optimizer: AdamW |
|
- maximum length of text sequence: 64 |
|
|
|
For other parameters, the default values were used (find more information [here](https://huggingface.co/docs/transformers/model_doc/trocr)). |
|
The training code is available in the `train_trocr.py` code file. |
|
|
|
## Evaluation results |
|
|
|
Evaluation results using the validation dataset are listed below: |
|
|
|
|
|
| Validation loss | Validation CER | Validation WER | |
|
| :-------------- | :------------- | :------------- | |
|
| 0.903 | 0.107 | 0.237 | |
|
|
|
|
|
|
|
The metrics were calculated using the [Evaluate](https://huggingface.co/docs/evaluate/index) library. |
|
More information on the CER metric can be found [here](https://huggingface.co/spaces/evaluate-metric/cer). |
|
More information on the WER metric can be found [here](https://huggingface.co/spaces/evaluate-metric/wer). |