Research at the Lacito
Operations and Seminars
PhD Practical Seminar LACITO-LLACAN
Organizers : Jakob Lesage et Neige Rochant
The PhD Practical Seminar LACITO-LLACAN is dedicated to the technical (software, audiovisual recording, data archiving, etc.) and practical issues (fieldwork preparation, working with communities, collecting data remotely, etc.) encountered by field linguists.
The spirit of the seminar is horizontal. After a presentation given by one or several participants sharing their experience or skills, attendees are encouraged to participate in a roundtable .
The seminar meets one Monday a month between 14:00 and 15:30 (by default). Although it is more particularly targeted towards PhD students, it is open to everyone. Write the organizers at firstname.lastname@example.org if you wish to participate and be added to the mailing list.
As of now, the seminar meets remotely on Zoom. In the future, it will probably be hybrid. By default, talks are given in English, but they can sometimes be given in French. Participants in the round tables following the talks are free to express themselves in English or French.
Next meetings :
More information to be announced
Monday 10 October 2022, 14:00 – 15:30. Talk given by Tessa Vermeir : “Creating subtitles”.
Monday 14 November 2022, 14:00 – 15:30. TBA.
Monday 12 December 2022, 14:00 – 15:30. Talk given by Said Guerrab : “Mapping tools (geolinguistics)”.
Monday 12 September 2022, 14:00 – 15:30. TBA.
Monday 11 July 2022, 14:00 – 15:30. Roundtable: “How to prepare for fieldwork: joint production of a booklet targetted towards PhD students in France (but useful to any field linguist)” – 3rd session
13 June 2022. Talk given by Galla Althabégoity (University of Orléans) : “Cocoon collection showcase”.
The COCOON platform hosts many oral corpora. It allows to archive documents and make them publicly available, and also to add annotations to audio files. During this talk, I will show how the platform works et how I used it for my deposit. Data archiving and diffusion raises several questions: Which platform(s) to use? Must all oral data be made available? If not, what are the criteria for making data available? What is the purpose of archiving annotations? Which public(s) can be interested in the archived corpus? The meeting will be the occasion of discussing these topics.
23 May 2022. Talk given by Ekaterina Aplonova (Lacito) and Izabela Jordanoska (Lacito) : “Searching in ELAN”.
ELAN is one of the most frequently used tools for linguistic annotation of audio and video recordings. A high number of linguists are familiar with the process of annotating files in ELAN, but surprisingly few are aware of its remarkable potential as a tool for searching said files. In our tutorial, we will show how the structural search multiple Elan files function works. We will begin with simple queries for when you need to find a specific word / gloss / translation in multiple Elan files and then step by step we will complexify our queries to ones with several layers and ones with regular expressions.
Target audience of this tutorial are linguists who already have files annotated in Elan and who intend to learn how to use Elan as a corpus query tool. However, if you just start to use Elan and do not have many annotated files or just thinking of starting using it, the tutorial will show the potential of the software.
9 May 2022. Talk given by Songfolo Lacina Silué (Inalco ; Lacito) : “Format Factory”
Imagine that you recorded some very important data in the field and you later discovered that they cannot be opened by your working softwares. Imagine that you have short audio files and you want to merge them into one single long audio file. Imagine that the sound quality of your videos is better than the one of your audios and you need to extract that sound from the video Imagine that the sound of the video is bad and you need to replace that sound by a better quality audio file. Imagine that you have heavy videos and you want to compress them without losing the original quality. Imagine that you saw some pictures or graphics in a research paper that you need to use in your study, but you cannot because they are embedded in the PDF. Imagine that you have lot of PDF files that you need to join in one file. Imagine that you have a secret document that you want a restricted person to be able to access it. If you are in one of the above situations, then format factory, multimedia files converter is what you need. In this session you will learn how to:
- Convert your multimedia data in various format depending on your need
- Merge several audio files in one
- Extract an audio file from a video
- Separate an audio and video
- Extract some pictures from a pdf file
- Encrypt a document
- Download some videos online
11 April 2022. Talk given by Chika Ajede Kennedy (Inalco ; Llacan) : “Migrating from Toolbox to FLEx”
The linguistic community regularly updates our tools for language description and documentation. Sometimes, an upgrade is so big that a new tool is set to replace an important tool that has been used for decades. FLEx (FieldWorks Language Explorer), for instance, is steadily replacing Toolbox as the most commonly used dictionary building application. In these cases, it is good to get used to the newer tool sooner rather than later, as developers and support teams reduce the maintenance of older tools and put more energy towards developing and supporting newer ones.
One major challenge is to migrate our data from one environment to the next. We want to keep our data and our analyses, and we want to avoid duplicating work by manually typing entries in our FLEx database that we already have in Toolbox. Unfortunately, it is sometimes difficult to find tutorials on how to do this.
In this session, we will demonstrate how to migrate a lexical database from Toolbox to FLEx. We will illustrate this with a lexicon of Dijim [cfa, diji1241]. The demonstration will take about 30 minutes. If time allows it, we can help others who are struggling to transfer their lexical data to FLEx and find solutions to any questions the participants may have.
You can download FLEx here: https://software.sil.org/fieldworks/ . Some tutorials are available here: https://software.sil.org/fieldworks/resources/tutorial/ .
14 March 2022. Talk given by Christian Chanard (Llacan) « Introduction to ELAN » (in French)
ELAN is a computer program developed by the Max Planck Institute of Nijmegen in the Netherlands. It allows the annotation of audio and/or video recordings.
Independent annotation lines can be time-aligned or hierarchically dependent on a parent. This allows users to create annotations on different levels.
LLACAN has developed a separate module and released ELAN-CorpA, a derivative version. This version simplifies morpho-syntactic annotation by building up a lexicon and an affix-based parser while the user provides annotations.
ELAN allows the import and export of annotated files in different formats: tab-separated, HTML, Toolbox, Flex, Praat, etc.
We will present:
- the CorpA morpho-syntactic annotation model, introducing the notions of independent tiers(segmentation, overlapping information) and hierarchical tiers (stereotype, dependence, levels of analysis),
- how a lexicon helps annotation,
- searching the data,
- examples of importing and exporting data
14 February 2022. Talk given by Jakob Lesage (Humboldt-Universität zu Berlin) : “Processing recordings”
Language documentation, archiving and dissemination of recordings in communities require basic knowledge of audio and video processing. In this session, we will discuss the technical issues that come up related to receiving (in case of a remote project), converting and making accessible audio and video recordings made during a language documentation project. We will introduce tools such as:
- HJsplit, for splitting up large files so they can be transferred over an intermittent internet connection.Mega.nz, a cloud storage website
- ffmpeg, a powerful video and audio processing tool that is operated from the command prompt
- Handbrake, a more accessible video processing tool that asks a lot from your processor
- BES (Battle Encoder Shirasé), a tool that prevents your computer’s processor from being fried like Jakob’s was when converting video files
- ELAN’s subtitle exporter, which allows you to export subtitle files that can be fed to YouTube or can be hard-coded into a video file using Handbrake
This will be an introductory session, with plenty of discussion of our experiences and workflow needs. Depending on the interest of participants, we could organize more specific tutorials on how to use (a subset of) these tools.
6 December 2021. Roundtable: “How to prepare for fieldwork: joint production of a booklet targetted towards PhD students in France (but useful to any field linguist)” – 2nd session
Resumes the first session
22 November 2021. Roundtable: “How to prepare for fieldwork: joint production of a booklet targetted towards PhD students in France (but useful to any field linguist)” – First session
For this session, we invite you to collaborate on a brochure on fieldwork preparation. This brochure is primarily intended for linguists doing a PhD in France (in particular in a UMR, and in particular at LLACAN, LACITO or SeDyL). It will also be of interest to other field linguists of any level of experience.
We will collect the ideas of all seminar participants in a GoogleDoc document. The document will consist of three main sections:
- Administrative preparation (CNRS and university)
- Scientific preparation
- Practical preparation
The brochure will also contain other practical aspects, including the return from the field (e.g. justifying costs with the CNRS), as well as an indicative schedule for the preparation to a field trip.
The session will be an opportunity to discuss various aspects of field work preparation and execution.
All participants are encouraged to raise any questions they wish to discuss on the topic. We do not aim to finish the brochure by the end of this meeting. We can continue during another meeting of the seminar if the activity turns out to be interesting and productive.
The organizers of the seminar will facilitate the activity and edit the document so it can be easily consulted and distributed.
11 October 2021. Talk given by Jakob Lesage (Humboldt-Universität zu Berlin) : “Phonology assistant”
13 September 2021. Talk given by Neige Rochant (Sorbonne Nouvelle University ; Lacito ; Llacan) : “Integrating audio into FLEx”
We will present a method to upload your recording sound files into FLEx and align them with your texts. You don’t need to know a lot about FLEx, but this meeting will be especially useful for you if you annotate (or intend to annotate) your texts in FLEX and want to be able to easily play any sentence while you’re working. This does not exist as a feature in FLEx, but it can be made possible.
2 July 2021. Talk given by Cécile Macaire (Université Grenoble Alpes ; Lig / Getalp), Séverine Guillaume (Lacito) et Alexis Michaud (Lacito) : “Computational tools for language documentation: Explorations of Automatic Speech Recognition on field data”
The LLACAN and LACITO laboratories host a number projects exploring the potential of computational methods to facilitate (endangered) language documentation. Machine learning-based tools can effectively assist with linguistic annotation tasks including transcription, glossing and translation. Still, automatic processing remains little used, especially because the technology is still new (and evolving rapidly), and there is a lack of simple and user-friendly interfaces. Our laboratories aim to co-construct models and tools between field linguists and computer scientists.
In this context, explorations in Automatic Speech Recognition on field data are ongoing. After a general presentation of the “Elpis” project, current experiments will be presented (which use the Transformers from Huggingface and wav2vec Unsupervised from Facebook AI). The goal is to introduce the workings of the tools to a linguistic audience and to discuss the challenges of interdisciplinary collaboration with computer scientists.
Some references :
- Providing “field” linguists with access to automatic transcription
- User-friendly automatic transcription of low-resource languages: Plugging ESPnet into Elpis
- Integrating automatic transcription into the language documentation workflow: Experiments with Na data and the Persephone toolkit
14 June 2021. Talk given by Maxime Fily (Université Grenoble-Alpes ; Sorbonne Nouvelle ; Lacito) : “Conversion from TextGrid to XML”
Maxime will present an essential tool for submitting texts or word lists to the Pangloss collection (https://pangloss.cnrs.fr/). The tool is a Python script that converts textgrid files (produced in Praat or ELAN) into XML files supported by Pangloss. The tool greatly simplifies the depositing process. Using this script is very simple but requires some configuration and a very basic understanding of how Python works.
Maxime will demonstrate the script with a guinea pig that already has Python installed (Neige). All participants are welcome to try the tool during the presentation, but there may not be time for technical support. Consequently, the session will essentially be a presentation, not a full tutorial. If you have never worked with Python, the presentation will allow you to judge whether it could be useful for you to use the tool and to learn the basics of Python (which is simple but will not be the subject of this session).
If, at the end of the session, enough participants are interested in this script or in learning Python, we will consider organizing a session (perhaps beyond the usual times of the seminar) devoted to getting started with Python and using pre-made scripts. This may also be relevant if you wish to use the script for identifying minimal pairs in a word list. That script will be the subject of a later session.
10 May 2021. Talk given by Evgeniya (Jenia) Gutova (Lacito) : “Remote field work”
12 April 2021. Talk given by Jakob Lesage (Humboldt-Universität zu Berlin) : “ELAN-FLEx-ELAN workflow”