CLARIAH Fellowship Call: WP3 Data and Tools
CLARIAH has many data and tools on offer for doing linguistic research. We provide a brief introduction to these data and tools here. The full overview can be found here: Link
There are actually more data and tools than described here. This description focuses on tools and data that the CLARIAH-CORE and CLARIAH-PLUS projects actually worked or work on. These are the focus of the CLARIAH Fellowship Call 2021.
We will first briefly describe what kinds of research can be carried out with these data and tools, and what educational materials are needed for students and researchers to learn to select the right tools and to optimally use them. After that, we will describe which data and tools are offered, distributed over three sections: Data and tools for non-textual data, data and tools for text corpora, and data and tools for lexica.
Many different forms of linguistic research can be supported by the CLARIAH WP3 data and tools. These include theoretical linguistics research (e.g. a detailed study of a particular construction), corpus-linguistic studies based on the many large corpora made available by CLARIAH, research into language acquisition (e.g. using the CHILDES treebank in the PaQu application), research into language variation (e.g. by using the MIMORE application) , lexicological and lexicographic research (e.g. analysing collocations in SoNaR via the OpenSoNaR application), and undoubtedly many others. Many of the applications and data can also be used for other types of research in the humanities, e.g. for literary studies, history, religious studies, and philosophy.
We provide here several publications on (mostly linguistic)research using the CLARIAH tools and data. Full references can be found here: https://www.clariah.nl/nieuw/artikelen/overzicht-publicaties:
- On the use of corpora in linguistics: Odijk 2020
- Research on specific constructions: Hoeksema & Napoli 2019, Bouma 2018, Olthof et al. 2017, Arabic: Zwaan et al 2020
- Research using PaQu: Bloem 2021 (PhD, to appear), Dros-Hendriks 2018 (PhD), Odijk et al. 2017, Bloem 2016, Odijk 2016, Odijk 2015
- Wablieft corpus in PaQu: Vandeghinste et al 2019
- Research using PaQu and GrETEL: Scholten 2018 (PhD), van der Wouden et al. 2017, 2016, Bouma et al 2015
- Research using GrETEL: Odijk & Zwitserlood 2019, Augustinus et al 2017, Augustinus 2015
- Research using PICCL: Betti et al 2017
- More research: van Erp et al 2018, van Erp et al 2017 (2x)
- Research and data on dialects: van Hout et al 2018
There are also many presentations. We list just a few, especially the ones for which there is no corresponding publication. Full references can be found here: https://www.clariah.nl/papers/Presentations_CLARIAH-CORE.pdf
- Odijk, J. (2015) at Drongo, van Noord & Odijk 2016 (2x), Odijk 2017 Language Science Day, Odijk 2017 TIN-dag, Lange 2018 (2x), Odijk 2018 LREC, Odijk 2018 ESSLLI, Hoeksema et al 2019, Lange 2019 (2x), Odijk et al. 2019, Odijk 2019 Grote Taaldag,
Education and Training
Many of the tools are web applications so that one does not have to download software or data, to install software. Many have dedicated user interfaces that for many uses do not require any programming skills or even knowledge of query languages. Nevertheless, it requires some training and exercise to get to know the functionality offered by an application, and in order to get to know all of the options an application offers, so that it is used in the most effective and efficient way. Therefore, it is important that all kinds of educational material become available and get used in the regular curricula of the linguistics departments. One can think of tutorials, short explanations (in text, or video) of specific features offered by an application, short explanations of how a particular type of query can be formulated, as well as methodological considerations on the advantages, disadvantages and dangers of the use of corpora in general, automatically generated annotations, etc.
Data and Tools for Non-textual data
If your data are audiodata or audiovisual data with recorded speech, you can use the speech recognition tools (Automatic Transcription of (Oral History) Interviews / KALDI Speech recogniser for Dutch, English Automatic Speech Recognition) to automatically transcribe the speech into text.
You can manually annotate such recordings with the Oral History Annotation Tool.
If you have a speech recording and a transcript of the speech, you can use Forced Alignment to align the speech with the transcript. This can be very useful for many purposes, e.g. for efficient searching in speech data.
Audio-visual data can be annotated with ELAN. New extensions for dealing with audio-visual data are being developed in WP5, so take a look there as well.
If a picture (image) contains text (but as a picture), the PICCL pipeline can be used to turn the text in the form of a picture into a text as a sequence of characters (and even to enrich the resulting text with linguistic annotations).
Data and Tools for Text Corpora
If you have textual data, you can do various things with it in particular:
- process the text, for example to enrich it with additional information, automatically, manually or semi-automatically
- search in the (possibly enriched) textual data
Many textual corpora are made available via CLARIAH.
CLARIAH offers many text corpora, via a web application, a web service, and just as data. They include:
- SoNar (via OpenSoNaR)
- CGN (via OpenSoNaR)
- Lassy-Small (via PaQu and GrETEL)
- CGN Treebank (via PaQu and GrETEL)
- CHILDES Treebank (via PaQu, partially via GrETEL)
- BasiScript Treebank (via PaQu)
- Delpher Data (via NederLab)
- Other Dutch text corpora covering 900-2000 (via Nederlab)
Several corpora in the TextFabric format: NENA, Greek New Testament Tischendorf edition, Uruk, Old Babylonian Letters, Old Assyrian documents, Quran, Fusus Al Hikram, Generale Missieven, Fusus, Greek Literature, Dead Sea Scrolls, Biblia Hebraica Stuttgartensia Amstelodamensis, Peshitta.
And many other text corpora are made available.
Text corpora, their annotations and their metadata must be encoded in some way. An important de facto standard format for (mainly Dutch) annotated text corpora in the Netherlands is the Folia format, which comes with a lot of accompanying software, among them.
A more generally used format internationally is TEI, subsets of which are supported by many text corpus applications (e.g. there is a tei2folia converter, and Autosearch, PaQu and GrETEL support uploads of TEI-encoded text corpora.)
Another approach to text corpora is TextFabric: a Python3 package for Text plus Annotations. It provides a data model, a text file format, and a binary format for (ancient) text plus (linguistic) annotations.
Computers are too slow to efficiently search in text corpora when they are encoded as text. The text has to be restructured in a different way in order to make efficient search possible. Search engines make this possible. Usually existing open source search engines or indexing systems are used (e.g. Lucene, BerkelyDB (in PaQu), Xbase (in GrETEL))
Some institutions have built their own engine on top of such indexing systems.
IVDNT developed BlackLab, Meertens Institute MTAS (both build on top of Lucene), which is used in Nederlab.
Usually, a linguist does not have to know about these engines. The information about them is stated here just for completeness sake.
There are many tools for processing text. They are often used to enrich the text with additional annotations, but many of them can be used for other purposes as well. Enriching text with additional annotations can of course also be done manually.
Automatic Processing of Text
There are several tools for orthographic normalisation, e.g. for correcting errors of optical character recognition (PICCL), or for detecting spelling errors and suggestions for correction Gecco (Generic Environment for Context-Aware Correction of Orthography) and Valkuil.
Tokenisation converts a sequence of characters into a sequence of tokens. A token is an occurrence of a word, symbol (e.g. interpunction) or symbol sequence.
The Ucto Engine is an engine for tokenisation and Ucto runs this engine with tokenisation models for a whole range of languages.
Most of the other applications for processing text have tokenisation built in.
Most also include sentence splitting functionality (converting a sequence of tokens into a sequence of sentences).
N-Gram analysis and more
A n-gram is a sequence of n tokens. Skip-grams and flex-grams are more sophisticated variations of n-grams. Colibri Core is software to extract and analyse patterns in the form of n-grams, skip-grams and flex-grams from large corpus data.
Alpino is software to automatically assign a syntactic structure to a sentence (parsing). There is a web service and application of Alpino provided by CLST. Frog can be used to generate a dependency analysis of a sentence (the grammatical relations that hold between words)
Part of speech tagging, lemmatisation:
Part of speech tagging (assigning a part of speech tag to a token taking into account its context) and lemmatisation (assigning a lemma, i.e. the form of word as one finds it in a traditional dictionary) to a token can be done with UDPipe Frysk for Frisian, and with Frog for Dutch.
(one could also use Alpino for this, but that may be a bit overkill)
Part of speech taggers for new languages or language varieties can be created by tools such as Toad (trainer of all data) and Mbt (memory-based tagger) provided a sufficiently large corpus of training and test data is available.
A treebank is a text corpus in which each sentence has been assigned a syntactic structure.
PaQu and GrETEL 4.0 are treebank search applications. They offer several treebanks in which one can search (inter alia Lassy-SMALL, Spoken Dutch Corpus Treebank, parts of Lassy-LARGE, CHILDES treebank, and more).
Both PaQu and GrETEL also enable you to upload a text corpus. Each sentence in this text corpus will then be automatically assigned a syntactic structure (by Alpiono), resulting in a treebank, after which it is made available for search.
SPOD and T-Scan are tools to characterise a text or text corpus grammatically (syntactic profiling). Both make use of Alpino.
In many cases one does not want to use a single service or tool, but combine multiple of these in a pipeline. CLARIAH offers several predefined pipelines, e.g. PICCL, the AntiLoPe Pipeline, Nederlab Pipeline, and the VU-reading-machine.
The GM-Processor provides functionality for preprocessing of the VOC Generale Missiven
There are also services and software distributions to support the development of new pipelines, e.g. CLAM (for turning a tool into a web service) and LaMachine.
The CLARIAH-CORE NewsGac project developed the Newsgac Text Genre Classification system, and one experiment with different
There are various conversion tools, in particular in OpenConvert (which is somewhat outdated), and especially in Piereling. CHAMD converts CHAT data into the PaQu Enhanced text format (which was developed in PaQu and is supported by PaQu and GreTEL4). GrETEL4 includes CHAMD, so one can upload files in CHAT-format there directly.
Manual Annotation of Text
One can use FLAT for manual annotation of text.
Search in Text
Token-annotated corpora search applications
Many corpora have the token as the most important annotation unit. We call these token-annotated corpora. There are many search applications for such corpora, among which Nederlab, AutoSearch, CHN, Corpus Gysseling, OpenSONAR, Bridge the Gap Arabic Corpora, and SHEBANQ.
MIMORE belongs to this class but has many features specific for searching in collections of dialect data.
There are also many corpus search applications that enable only searching for strings and metadata. These include Delpher Magazines, Delpher Newspapers, Delpher Books, KBNL Google Books, and DBNL.
Data and Tools for Lexical Data
CLARIAH offers many lexicons, among them: NAMES, GiGant, Open Dutch Wordnet & RBN, and Dutch FrameNet.
Lexicon search Applications
CLARIAH also offers many lexicon search applications, among them: MNW, ONW, VMNW, WNT, WebCELEX, DiaMaNT, Elektronisch Woordenboek van de Brabantse Dialecten (e-WBD).
Elektronisch Woordenboek van de Limburgse Dialecten (e-WLD), Elektronisch Woordenboek van de Gelderse Dialecten (e-WGD), and Elektronisch Woordenboek van de Achterhoekse en Liemerse Dialecten (e-WALD).