Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography - Conférences TALN RECITAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

Mika Hämäläinen
Niko Partanen
  • Fonction : Auteur
  • PersonId : 1029028
Khalid Alnajjar
  • Fonction : Auteur
  • PersonId : 1102657

Résumé

Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3% accuracy in texts written by Agricola and 87.7% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.
Fichier principal
Vignette du fichier
47.pdf (201.1 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte

Dates et versions

hal-03265899 , version 1 (23-06-2021)

Identifiants

  • HAL Id : hal-03265899 , version 1

Citer

Mika Hämäläinen, Niko Partanen, Khalid Alnajjar. Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography. Traitement Automatique des Langues Naturelles, 2021, Lille, France. pp.189-198. ⟨hal-03265899⟩
45 Consultations
29 Téléchargements

Partager

Gmail Facebook X LinkedIn More