- Dans Document Analysis and Recognition – ICDAR 2021 Workshops
- Éditeur : Springer International Publishing
- Pages : 265-281
Résumé
The Arabic scripts raise numerous issues in text recognition and layout analysis. To overcome these, several datasets and methods have been proposed in recent years. Although the latter are focused on common scripts and layout, many Arabic writings and written traditions remain under-resourced. We therefore propose a new dataset comprising 300 images representative of the handwritten production of the Arabic Maghrebi scripts. This dataset is the achievement of a collaborative work undertaken in the first quarter of 2021, and it offers several levels of annotation and transcription. The article intends to shed light on the specificities of these writing and manuscripts, as well as highlight the challenges of the recognition. The collaborative tools used for the creation of the dataset are assessed and the dataset itself is evaluated with state of the art methods in layout analysis. The word-based text recognition method used and experimented on for these writings achieves CER of 4.8% on average. The pipeline described constitutes an experience feedback for the quick creation of data and the training of effective HTR systems for Arabic scripts and non-Latin scripts in general.
Partager sur les réseaux sociaux
Publications de chercheur
Publication de chercheur
CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts
Communication dans un congrès
- Date de parution : 2024
Publication de chercheur
Layout Analysis Dataset with SegmOnto
Communication dans un congrès
- Date de parution : 2024
Publication de chercheur
Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts
Communication dans un congrès Nouveau
- Date de parution : 2024