medieval-french

HTRomance, Medieval French corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation ===================== characters badge regions badge lines badge files badge Tests

Introduction

This ground-truth dataset has been carefully built around the idea of having generic data for building a strong and reliable model for HTR of Latin manuscripts. Each manuscript should have around 10 columns (5 bi-columns pages or 10 pages of single column).

Data follow the Segmonto guidelines.

[!NOTE] The repository contains two XML files per image. The ones suffixed with .chocomufin.xml are normalized in order to be compliant with other datasets following the same guidelines. The others are more specific to this repository. We recommend using the normalized documents.

Credits

Transcription guidelines

The transcription guidelines are described in a paper available on HAL and published in the Journal for Open Humanities Data. The paper provides specific details about the selection process, the transcription methods and choices, as well as details about the output (mainly the Generic CREMMA Model for Medieval Manuscripts (Latin and Old French) for Kraken)

Data

ALTO and images can be found in the directory called data/. Each subfolder of data/ corresponds to a single manuscript, identified by its shelfmark.

Shelfmark Links Range Type Century Color Pages Main Zones Lines Characters Genre Content
BnF, NAF 23686 📁 112ra-114rb prose 13 5 10 424 17817   Vie de saint Alexis
BnF, fr. 1443 📁 1ra-3rb vers 13 5 11 418 10829   Garin le Loherain
BnF, fr. 1553 📁 506r-508v vers 13 5 10 506 11154   Le Meunier d’Arleux
BnF, fr. 1635 📁 fol. 4v-5v vers 13 3 7 217 4833   Testament de l’âne
BnF, fr. 12581 📁 373r-375v vers 13 4 8 306 9289   Li Fabliaus des Treces
BnF, fr. 20050 📁 4r-5v vers 13 4 4 84 3793   Le chansonnier de saint Germain
BnF, fr. 1669 📁 1r-3v prose 13 3 11 484 10183 Narratives roman
BnF, fr. 747 📁 1v-2v prose 13 1 2 91 4351   Estoire du Roman del Saint Graal
BnF, fr. 104 📁 1r-2v prose 13 4 10 404 15398   Roman de Tristan
BnF, fr. 2168 📁 88rb vers 13 5 10 370 7964   Le sacristain
BnF, NAF 10039 📁 1r-3r verse 13 4 4 118 3165   Roman d’Aspremont
BnF, fr. 1450 📁 1r-2v verse 13 4 14 711 14855   Roman de Troie
BnF, fr. 17229 📁 127r-129r prose 13 3 12 479 12511   Legendier
BnF, fr. 23117 📁 299vc-304rb prose 13 5 21 736 19858   Vie de saint Martin
BnF, fr. 6447 📁 270r-271v prose 13 4 8 383 13246   Vie de saint Martin
BnF, fr. 2173 📁 96r-97v vers 13 4 8 240 5269   La Mal Honte
BnF, fr. 19152 📁 120vd-122rc vers 13 4 13 529 11087   C’est li Romanz des Braies
BnF, fr. 12615 📁 230v-231r vers 13 2 5 65 3336   chansonnier de Noailles _ Chanson d’amour d’Adam le bossu
BnF, fr. 12603 📁 203ra-205ra vers 13 5 16 442 14125   Fierabras
BnF, fr. 12554 📁 1r-2v prose 14 4 4 184 7183 Narratives roman
BnF, fr. 5024 📁 1r-3r prose 14 4 16 238 10631   Le formulaire d’Odart Morchesne
BnF, fr. 13568 📁 1r-3v historique 14 5 10 199 3373   Mémoires de saint Louis
BnF, fr. 13568 📁 1r-5r prose 14 5 10 199 3373   Mémoires de Froissart
BnF, fr. 574 📁 4v-5v religieux 14 3 6 116 2026   Image du monde
BnF, Arsenal, ms. 3525 📁 88v-91v vers 14 7 8 185 4377   Dit des trois Dames de Paris_
BnF, fr. 12558 📁 1ra-3ra vers 14 5 10 440 14016   Chevalier du cygne
BnF, fr. 840 📁 266r-267v didactique 14 4 8 256 6374   Art de Dictier
BnF, fr. 619 📁 1ra-4vb prose 14 6 12 356 11147   Gaston Phébus, Livre de chasse
BnF, Arsenal, ms. 3346 📁 1r-3v prose 15 5 11 286 7194   Garin le lorrain
BnF, fr. 1357 📁 1v-5r prose 15 4 10 320 12682   Simon de Phares, Recueil des plus celebres astrologues
BnF, fr. 2701 📁 121r-121v prose 15 2 11 551 11654   Epitre de Juvénal des ursins
BnF, Arsenal, ms. 3350 📁 1v-3v - 15 4 8 271 7959 Narratives _
BnF, fr. 11610 📁 1r-4r prose 15 7 8 166 5431   Roman du comte d’Artois.
BnF, fr. 1881 📁 93r-95r verse 16 4 11 194 3941 Narratives Vie de saint Alexis
BnF, fr. 1881 📁 93r-96r vers 16 4 11 194 3941 Narratives chanson

Metrics

Total number of pages

147

Regions

Lines

Funding

This project was funded by the Bibliothèque nationale de France through the 2022 project calls from Datalab for 2023.

Cite the project

Clérice, T., Chagué, A., Gille-Levenson, M., Brisville-Fertin, O., Pinche, A., Camps, J., Fischer, F., Boschetti, F., Guadagnini, E., Guilhem Couffignal, G., Canteaut, O., Romary, L., Reboul, M., Perreaux, N., Poibeau, T., Smith, M., Norindr, J., Glaise, A., Navas Farré, M., Bordier, J., Leroy, N., Alba, R., & Rubin, G. HTRomance [Data set]. https://htromance-project.github.io/

@misc{Clerice_HTRomance,
author = {Clérice, Thibault and Chagué, Alix and Gille-Levenson, Matthias and Brisville-Fertin, Olivier and Pinche, Ariane and Camps, Jean-Baptiste and Fischer, Franz and Boschetti, Federico and Guadagnini, Elisa  and Guilhem Couffignal, Gilles and Canteaut, Olivier and Romary, Laurent and Reboul, Marianne and Perreaux, Nicolas and Poibeau, Thierry and Smith, Marc and Norindr, Jade and Glaise, Anthony and Navas Farré, Marina and Bordier, Julie and Leroy, Noé and Alba, Rachele and Rubin, Giorgia},
title = ,
url = {https://htromance-project.github.io/}
}

Infrastructure

This project relied on the CREMMA infrastructure.