Riksarkivet/satrn_htr · Hugging Face

Model Description

The Swedish National Archives presents an end-to-end Handwritten Text Recognition (HTR) pipeline for running-text documents ranging from the mid 17th century to the late 19th century. The pipeline consists of the following components:

RTMDet Instance Segmentation Models: The pipeline utilizes two RTMDet instance segmentation models, trained using MMDetection. The first model is designed to segment text regions within the documents, while the second model focuses on segmenting text lines within these regions. These models enable the identification and localization of text-line regions, which is a crucial step in the HTR pipeline since text-recognition models work at the text-line level.
SATRN HTR Model: The pipeline incorporates a SATRN (Spatial Attention Transformer Networks) model, trained using MMOCR (OpenMMLab's OCR toolbox). SATRN is a state-of-the-art model for irregular scene-text recognition, which makes it an excellent choice for HTR, given that handwriting is highly irregular. The SATRN model consists of a shallow CNN, a 2D-transformer encoder, and a transformer decoder that works on the character level. It is trained on about a million text-line images of running-text handwritten documents ranging from the mid 17th century to the late 19th century.

The models are designed to provide a generic pipeline for handwritten text recognition, offering robust performance for running-text documents from the mid 17th to the late 19th century.

Evaluation

The Swedish National Archives HTR pipeline has been evaluated using standard evaluation metrics for Handwritten Text Recognition. The Character Error Rate (CER) is commonly used to assess the accuracy of the text-recognition model. The best way to evaluate the entire pipeline is to run all three models on unsegmented document images and calculate CER for the entire pipeline.

The reported performance metrics are obtained on several test-sets from archives that weren't included in the training-set, ranging the entire time-period the model was trained on. So these error rates are what you should expect if you run the pipeline out-of-the-box on your own documents given that the documents contain running-text and are from the model's time-period-domain. It is important to note that the actual performance may vary depending on the specific layout and handwriting styles encountered in the document.

Model	train-eval	1661-testset	1664-testset	1688-testset-unusual-layout	1735-testset	1740-1793-testset	1777-testset	1840-1890-testset	1861-testset
SATRN_1650_1900	0.033	0.096	0.078	0.215	0.079	0.066	0.074	0.037	0.043
SATRN_1650_1800	0.039	0.109	0.085	0.243	0.079	0.079	0.087	0.239	0.157
SATRN_1800_1900	0.031	0.455	0.382	0.381	0.309	0.252	0.182	0.046	0.051

The lower two rows are for comparison only. You can see that the model trained exclusively on the 19th century actually performed worse on 19th century testsets than the model trained on the entire time-period. This was the reason we only published the aggregated model rather than models specialized on a specific century.

Regular evaluations are conducted to monitor and improve the performance of the pipeline. As new evaluation results become available, this table will be updated to reflect the most recent performance metrics.

We also did some fine-tuning experiments to give an idea of the performance benefits of finetuning the model on domain-specific material, as well as a rough estimate of how many pages one needs to transcribe to do the fine-tuning.

Model	16th-century-testsets-combined	17th-century-testsets-combined	18th-century-testsets-combined
SATRN_1650_1900	0.124	0.095	0.038
SATRN_1650_1900_ft	0.064	0.084	0.026
Number of pages	57	28	29

As seen 50-60 transcribed pages is enough to halve the CER on 17th century documents. 30 pages of transcribed text gives significant improvements on 18th and 19th century text, but the improvement are not as steep. Our recommendation, if you have a large domain you want to run the pipeline on, is to transcribe 50-100 pages, and finetune the text-recognition model on this data. Guides on how to do this will be forthcoming.

Intended Use

The Swedish National Archives HTR pipeline is intended to be used for the following purposes:

Handwritten Text Recognition: The pipeline enables the automatic recognition of handwritten text in running-text documents from the 17th to the 19th century. It can be utilized by researchers, historians, and archivists to efficiently transcribe and analyze historical texts.
Document Digitization: The pipeline aids in the process of digitizing archival documents by automating the extraction and transcription of handwritten text. This facilitates broader accessibility and preservation of historical materials.

It's important to note that the pipeline is optimized for running-text documents from the specified time period and may not perform optimally for other types of documents or handwriting styles. Additionally, it is currently more suitable for documents from books rather than complex layouts from either tables or newspapers.

Performance and Limitations

The performance of the Swedish National Archives HTR pipeline is influenced by several factors:

Accuracy: The pipeline achieves high accuracy in segmenting text regions and lines, as well as recognizing the text content accurately. However, the recognition accuracy may vary depending on the quality of the original document, handwriting style, and legibility.
Speed: The pipeline aims to provide real-time or near real-time performance for efficient processing of handwritten text documents. The speed may vary depending on the hardware used for inference.
Document Specificity: The pipeline is specifically trained for running-text documents from the 17th to the 19th century. It may not perform optimally for documents outside this time period or for documents with non-typical layouts.
Language Limitations: The pipeline is mainly for Swedish text recognition. While it may handle other languages to some extent, Finnish for example, its performance may not be as accurate as for Swedish.
Handwriting Style: The pipeline is optimized for the cursive handwriting style prevalent in the historical documents of the Swedish National Archives. It may not perform as well for other handwriting styles, such as block letters or highly stylized scripts.

Training Data

The Swedish National Archives HTR pipeline was trained using a diverse dataset of binarized, running-text documents from the 17th to the 19th century. The training data includes various types of historical texts, such as letters, manuscripts, and official records.

The dataset comprises both high-quality and challenging examples to ensure the models' robustness. It covers a wide range of handwriting styles, legibility levels, and document conditions.

The training data was annotated to provide ground truth for text region and line segmentation, as well as text transcription. Expert archivists and historians contributed to the annotation process to ensure accurate labeling.

The data can be find here: (WIP will be added soon)

Caveats and Future Work

Although the Swedish National Archives HTR pipeline has been trained and optimized for running-text documents from the specified time period, there are a few caveats and considerations to keep in mind:

Continuous Improvement: The pipeline is continuously being updated and improved as new training data becomes available and advancements in OCR technology occur. With access to more training data, the models will be updated to further enhance their performance and adaptability.

User Feedback: Users are encouraged to provide feedback on the pipeline's performance, identify issues, and report any potential biases or limitations. This feedback is highly valuable in refining the pipeline, addressing concerns, and informing future updates.

References

If you would like to learn more about the Swedish National Archives HTR pipeline or access the training data, please refer to the following resources:

Riksarkivet
/

satrn_htr