Example Corpus of Language Registers



Welsh is a very rich language with many different registers. People don’t write in one single way when creating an official report and writing an entry for Facebook or Twitter, and they don’t speak in one way when giving a lecture and while speaking with friends. There are different characteristics with regards to vocabulary and grammar in these different registers.

We are researching to see if there is a way for a computer to recognize some of the registers automatically. This would assist many fields with regards to Welsh technology, including translation memory systems and machine translation. For the our work on developing speech recognition for Welsh, we have an interest in the difference between the speech registers and written registers.

We have been using the corpus collected from our internal Cysill Ar-lein (on online Welsh language spelling and grammar checker) corpus as raw material to recognize these different registers. Some of the characteristics of the different registers can be seen in our language register matrix, and we have taken a selection of the appropriate segments out of Cysill Arl-lein corpus which have been tagged appropriately and included them in the example corpus below.

You can use some of the diagnostic characteristics that can be seen in the language register matrix to find sentences which show these characteristics and the relevant register by using the search function