You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

Revue : Journal of Data Mining and Digital Humanities (Historical Documents and...)

Consulter la fiche HAL

Résumé

Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1.

Partager sur les réseaux sociaux

Publications de chercheur

Voir la liste complète

Publication de chercheur

CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts

Communication dans un congrès
- Ariane Pinche,
  Thibault Clérice,
  Jean-Baptiste Camps,
  Malamatenia Vlachou-Efstathiou,
  Matthias Gille Levenson,
  Olivier Brisville-Fertin,
  Federico Boschetti,
  Franz Fischer,
  Michael Gervers,
  Agnès Boutreux,
  Avery Manton,
  Simon Gabay,
  Wouter Haverals,
  Mike Kestemont,
  Caroline Vandyck,
  Patricia O'Connor,
  Alix Chagué
- Date de parution : 2024
Publication de chercheur

Layout Analysis Dataset with SegmOnto

Communication dans un congrès
- Thibault Clérice,
  Juliette Janes,
  Hugo Scheithauer,
  Sarah Bénière,
  Laurent Romary,
  Benoît Sagot
- Date de parution : 2024
Publication de chercheur

Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts

Communication dans un congrès Nouveau
- Thibault Clérice
- Date de parution : 2024

Publications aux éditions de l’École

Voir la liste complète

Publication de l'École

La véridique histoire de l’arobase
- Marc H. Smith
Publication de l'École

L’Ordinaire mestre Tancré
- Frédéric Duval
Publication de l'École

Le malheur d’être femme
- Pascale Bourgain
Publication de l'École

Abécédaire insolite du livre ancien
- Christine Bénévent
Publication de l'École

La bibliothèque de Thou et ses catalogues
- Valérie Neveu
Publication de l'École

Positions des thèses 2023
- Promotion 2023
Publication de l'École

Des archives considérées comme une substance hallucinogène
- Michel Melot
Publication de l'École

L’historien face à l’animal
- Michel Pastoureau
Voir la liste complète

Sur les mêmes thématiques

Voir la liste complète

Applications, éditions et jeux de données

Voir la liste complète

Applications, éditions et jeux de données

DicoTopo

Production
- Porté par le CJM
Applications, éditions et jeux de données

Elec

Production, dev, bêta
- Édition de texte
- Porté par le CJM
Applications, éditions et jeux de données

Pyrrha

Production
- Traitement automatique de la langue
- Porté par le CJM

You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

Résumé

Partager sur les réseaux sociaux

Publications de chercheur

CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts

Layout Analysis Dataset with SegmOnto

Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts

Publications aux éditions de l’École

La véridique histoire de l’arobase

L’Ordinaire mestre Tancré

Le malheur d’être femme

Abécédaire insolite du livre ancien

La bibliothèque de Thou et ses catalogues

Positions des thèses 2023

Des archives considérées comme une substance hallucinogène

L’historien face à l’animal

Sur les mêmes thématiques

Une « dissimulation profonde » : l’insondable duc de Marlborough

Où va l’État « à la française » ?

L’amitié au Moyen Âge. Le cas Joinville-saint Louis

L’ekphrasis dans la «Viennis» de Ioannes Damascenus (1717)

Applications, éditions et jeux de données

DicoTopo

Elec

Pyrrha