Corpus

During the project, we compiled a corpus of observed and staged communicative events collected in the three villages in which the project was based. The corpus can be accessed here.

This rich annotated multimedia corpus of multilingual language use at the Crossroads is the centre piece of our joint research. The corpus contains  a total of 516 session bundles, including ca. 150 hours of time-aligned, transcribed and translated speech data with ELAN annotations for speaker and language. Of this total number of hours of recorded spoken data, 95 hours were obtained within the lifespan of the current project. Each session bundle consists of a folder, labelled with a mnemonic session name that indicates the location in which the recording took place, the date, and the researcher’s initials. All sessions consist minimally of a sound file and its transcription and annotation, together with a metadata file which lists the languages spoken and the participants in the recording, their age, gender, reported language repertoires, and ethnicity/village. As the corpus coordinator, Abbie Hantgan spent the third project year creating a corpus structure and uploading and unifying data and metadata formats.

Within the LAT corpus structure, corpus nodes have the initials of the researchers whose data they contain as their name. Most data were collected during the Crossroads corpus, with the exception of Friederike’s data in the node FL. These data are integrated in the corpus to facilitate project-internal work; they contain data collected during the DoBeS project “Pots, plants and people – a documentation of Baïnounk knowledge systems” which are accessible in the DoBeS archive. The corpus node SNA contains the data collected as part of a social network study during which two focal participants per village wore a Lavalier microphone for two days each.

All data have been transcribed and translated by a multilingual local transcriber team composed of Aimé Césaire Biagui, Alpha Naby Mané, Laurent Manga, Jérémie Sagna, David Sagna and Lina Sagna. The language tagging reflects their perspectives. It is part of our continuing research to understand which linguistic and non-linguistic features different transcribers use to identify languages. In the light of the great variability of repertoires and of language use, this offers crucial insight on the factors that determine how language is categorised. Several of our existing publications (Goodchild 2019, Lüpke & Watson forthcoming, Watson 2019, Weidl 2019) address this issue, and we continue to work with the corpus.

The sound-grapheme associations used in transcriptions are those of the official alphabet for languages of Senegal. The transcription has not been standardised but preserves the variation inherent in language use and in the transcriber’s individual writing practices. This transcription method has given rise to the language-independent literacy programme LILIEMA, inspired by the positive experience of the transcribers to be able to write their entire repertoire, and developed together with them and other local research participants (Lüpke forthcoming, Lüpke et al forthcoming). LILIEMA is a method steeped in existing grassroots literacy practices (Lüpke 2018c).