Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts

Elaine Uí Dhonnchadha, Kevin Scannell, Ruairí Ó hUiginn, Eilís Ní Mhearraí, Máire Nic Mhaoláin, Brian Ó Raghallaigh, Gregory Toner, Séamus Mac Mathúna, Déirdre D’Auria, Eithne Ní Ghallchobhair, Niall O’Leary

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Abstract
    This paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which have
    been captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical and
    dialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using the
    modern standard annotations, the texts are processed using an existing finite-state morphological analyser and part-of-speech tagger.
    This method enables us to retain the original historical text, and at the same time have full corpus-searching capabilities using modern
    lemmas and inflected forms (one can also use the historical forms). It also makes use of existing NLP tools for modern Irish, and
    enables integration of historical and modern Irish corpora.
    Original languageEnglish
    Title of host publicationLREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage,
    EditorsKristín Bjarnadóttir
    Place of PublicationReykjavik, Iceland
    Pages12
    Number of pages18
    Publication statusPublished - May 2014
    EventLREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, Reykjavik, Iceland, May, 2014 -
    Duration: 26 May 201426 May 2014

    Workshop

    WorkshopLREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, Reykjavik, Iceland, May, 2014
    Period26/05/1426/05/14

    Keywords

    • historical corpus
    • normalisation
    • standardisation
    • natural language processing
    • Irish
    • Gaeilge

    Fingerprint Dive into the research topics of 'Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts'. Together they form a unique fingerprint.

    Cite this