Abstract
Abstract
This paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which have
been captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical and
dialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using the
modern standard annotations, the texts are processed using an existing finite-state morphological analyser and part-of-speech tagger.
This method enables us to retain the original historical text, and at the same time have full corpus-searching capabilities using modern
lemmas and inflected forms (one can also use the historical forms). It also makes use of existing NLP tools for modern Irish, and
enables integration of historical and modern Irish corpora.
This paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which have
been captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical and
dialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using the
modern standard annotations, the texts are processed using an existing finite-state morphological analyser and part-of-speech tagger.
This method enables us to retain the original historical text, and at the same time have full corpus-searching capabilities using modern
lemmas and inflected forms (one can also use the historical forms). It also makes use of existing NLP tools for modern Irish, and
enables integration of historical and modern Irish corpora.
Original language | English |
---|---|
Title of host publication | LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, |
Editors | Kristín Bjarnadóttir |
Place of Publication | Reykjavik, Iceland |
Pages | 12 |
Number of pages | 18 |
Publication status | Published (in print/issue) - May 2014 |
Event | LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, Reykjavik, Iceland, May, 2014 - Duration: 26 May 2014 → 26 May 2014 |
Workshop
Workshop | LREC 2014 Workshop LRT4HDA: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage, Reykjavik, Iceland, May, 2014 |
---|---|
Period | 26/05/14 → 26/05/14 |
Keywords
- historical corpus
- normalisation
- standardisation
- natural language processing
- Irish
- Gaeilge