Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts

Elaine Uí Dhonnchadha, Kevin Scannell, Ruairí Ó hUiginn, Eilís Ní Mhearraí, Máire Nic Mhaoláin, Brian Ó Raghallaigh, Gregory Toner, Séamus Mac Mathúna, Déirdre D’Auria, Eithne Ní Ghallchobhair, Niall O’Leary

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Abstract
    This paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which have
    been captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical and
    dialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using the
    modern standard annotations, the texts are processed using an existing finite-state morphological analyser and part-of-speech tagger.
    This method enables us to retain the original historical text, and at the same time have full corpus-searching capabilities using modern
    lemmas and inflected forms (one can also use the historical forms). It also makes use of existing NLP tools for modern Irish, and
    enables integration of historical and modern Irish corpora.
    LanguageEnglish
    Title of host publicationLREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage,
    EditorsKristín Bjarnadóttir
    Place of PublicationReykjavik, Iceland
    Pages12
    Number of pages18
    Publication statusPublished - May 2014
    EventLREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, Reykjavik, Iceland, May, 2014 -
    Duration: 26 May 201426 May 2014

    Workshop

    WorkshopLREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, Reykjavik, Iceland, May, 2014
    Period26/05/1426/05/14

    Fingerprint

    software

    Keywords

    • historical corpus
    • normalisation
    • standardisation
    • natural language processing
    • Irish
    • Gaeilge

    Cite this

    Uí Dhonnchadha, E., Scannell, K., Ó hUiginn, R., Ní Mhearraí, E., Nic Mhaoláin, M., Ó Raghallaigh, B., ... O’Leary, N. (2014). Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts. In K. Bjarnadóttir (Ed.), LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, (pp. 12). Reykjavik, Iceland.
    Uí Dhonnchadha, Elaine ; Scannell, Kevin ; Ó hUiginn, Ruairí ; Ní Mhearraí, Eilís ; Nic Mhaoláin, Máire ; Ó Raghallaigh, Brian ; Toner, Gregory ; Mac Mathúna, Séamus ; D’Auria, Déirdre ; Ní Ghallchobhair, Eithne ; O’Leary, Niall . / Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts. LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, . editor / Kristín Bjarnadóttir. Reykjavik, Iceland, 2014. pp. 12
    @inproceedings{f614526889564c348dacf4acca433c7d,
    title = "Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts",
    abstract = "AbstractThis paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which havebeen captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical anddialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using themodern standard annotations, the texts are processed using an existing finite-state morphological analyser and part-of-speech tagger.This method enables us to retain the original historical text, and at the same time have full corpus-searching capabilities using modernlemmas and inflected forms (one can also use the historical forms). It also makes use of existing NLP tools for modern Irish, andenables integration of historical and modern Irish corpora.",
    keywords = "historical corpus , normalisation, standardisation , natural language processing, Irish, Gaeilge",
    author = "{U{\'i} Dhonnchadha}, Elaine and Kevin Scannell and {{\'O} hUiginn}, Ruair{\'i} and {N{\'i} Mhearra{\'i}}, Eil{\'i}s and {Nic Mhaol{\'a}in}, M{\'a}ire and {{\'O} Raghallaigh}, Brian and Gregory Toner and {Mac Math{\'u}na}, S{\'e}amus and D{\'e}irdre D’Auria and {N{\'i} Ghallchobhair}, Eithne and Niall O’Leary",
    year = "2014",
    month = "5",
    language = "English",
    pages = "12",
    editor = "Krist{\'i}n Bjarnad{\'o}ttir",
    booktitle = "LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage,",

    }

    Uí Dhonnchadha, E, Scannell, K, Ó hUiginn, R, Ní Mhearraí, E, Nic Mhaoláin, M, Ó Raghallaigh, B, Toner, G, Mac Mathúna, S, D’Auria, D, Ní Ghallchobhair, E & O’Leary, N 2014, Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts. in K Bjarnadóttir (ed.), LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, . Reykjavik, Iceland, pp. 12, LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, Reykjavik, Iceland, May, 2014, 26/05/14.

    Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts. / Uí Dhonnchadha, Elaine; Scannell, Kevin ; Ó hUiginn, Ruairí ; Ní Mhearraí, Eilís ; Nic Mhaoláin, Máire ; Ó Raghallaigh, Brian ; Toner, Gregory ; Mac Mathúna, Séamus; D’Auria, Déirdre ; Ní Ghallchobhair, Eithne ; O’Leary, Niall .

    LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, . ed. / Kristín Bjarnadóttir. Reykjavik, Iceland, 2014. p. 12.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    TY - GEN

    T1 - Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts

    AU - Uí Dhonnchadha, Elaine

    AU - Scannell, Kevin

    AU - Ó hUiginn, Ruairí

    AU - Ní Mhearraí, Eilís

    AU - Nic Mhaoláin, Máire

    AU - Ó Raghallaigh, Brian

    AU - Toner, Gregory

    AU - Mac Mathúna, Séamus

    AU - D’Auria, Déirdre

    AU - Ní Ghallchobhair, Eithne

    AU - O’Leary, Niall

    PY - 2014/5

    Y1 - 2014/5

    N2 - AbstractThis paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which havebeen captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical anddialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using themodern standard annotations, the texts are processed using an existing finite-state morphological analyser and part-of-speech tagger.This method enables us to retain the original historical text, and at the same time have full corpus-searching capabilities using modernlemmas and inflected forms (one can also use the historical forms). It also makes use of existing NLP tools for modern Irish, andenables integration of historical and modern Irish corpora.

    AB - AbstractThis paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which havebeen captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical anddialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using themodern standard annotations, the texts are processed using an existing finite-state morphological analyser and part-of-speech tagger.This method enables us to retain the original historical text, and at the same time have full corpus-searching capabilities using modernlemmas and inflected forms (one can also use the historical forms). It also makes use of existing NLP tools for modern Irish, andenables integration of historical and modern Irish corpora.

    KW - historical corpus

    KW - normalisation

    KW - standardisation

    KW - natural language processing

    KW - Irish

    KW - Gaeilge

    M3 - Conference contribution

    SP - 12

    BT - LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage,

    A2 - Bjarnadóttir, Kristín

    CY - Reykjavik, Iceland

    ER -

    Uí Dhonnchadha E, Scannell K, Ó hUiginn R, Ní Mhearraí E, Nic Mhaoláin M, Ó Raghallaigh B et al. Corpas na Gaeilge 1882-1926: Integrating Historical and Modern Irish Texts. In Bjarnadóttir K, editor, LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, . Reykjavik, Iceland. 2014. p. 12