In the proposal for CLARIAH-PLUS (p. 8) it is argued that: “The CLARIAH infrastructure will increase our empirical base, options for analysing […] data, and the efficiency of research by orders of magnitude (data-intensive science).”
Ok, but is it true?
Geert Wilder, leader of the Dutch populist party PVV, known for standing up for ordinary people, recently published a tweet (see below) in which he used the completely unknown word 'difficulteren' (doing difficult). Remarkable, because his party is known for their straightforward use of language that even 'ordinary' people can understand.
Linguist Marc van Oostendorp, professor of Dutch Language and Literature at Radboud University in Nijmegen and a passionate blogger, wrote a nice blog about this tweet and formulated a conjecture about the use of this word. Marten van der Meulen, PhD student and writer, responded to this blog by conducting corpus searches in data that have been made accessible in the CLARIAH infrastructure in order to test Marc’s conjecture. Marten tried to find when this unknown word ‘difficulteren’ was used for the first time, how often it has been used at all in recent years, and in what contexts it mainly occurred?
‘increase our empirical base'
Marten searched in 6 corpora (Staten Generaal Digitaal, Corpus Gesproken Nederlands, Corpus Hedendaags Nederlands, Brieven als Buit Corpus, Sonar en in the corpora of Nederlab (where it mainly occurs in Early Dutch Books Online). A prominent feature of CLARIAH is that it allows every humanities scholar to search these resources: you don't have to be a corpus linguist, you don't have to be able to code, you don't have to download corpora or software. CLARIAH offers web applications with user-friendly interfaces that make searching in those corpora easy. See below for links.
'increase options for analysing … data'
These resources make it possible to search by lemma rather than by word, which makes the search and analysis of the search results a lot easier and results in a larger number of relevant data. Moreover, many of the sources contain metadata such as genre, time and place, so that it can also be quickly determined where, when and in which genres this word occurs frequently or less frequently.
'increase the efficiency of research'
Marten did this research within 1 day, something that was not possible before CLARIAH, except perhaps for a select group of corpus linguists.
Of course, you can also search the internet, via Google or Twitter. This complements the search in specific corpora, especially since the empirical basis is then even larger. But then one has to look up all the word forms of this verb separately and the analysis of the results requires more (manual) work, especially because there are hardly any relevant metadata. Marten has also searched with Google, but he has not yet been able to analyse the results in that one day. He also searched the Corpus of the Web (COW) for Dutch, smaller than the whole internet but still quite large (7 billion words), and there were fewer hits, so they could be analysed further.
The search query in question concerns a one-word lemma, and that is a relatively simple task. But the CLARIAH infrastructure also allows much more complex searches, with combinations of words, word pairs with a grammatical dependency relationship, and complete grammatical constructions.
My conclusion is therefore that CLARIAH facilitates and already substantiates the above claim.
|Corpus Hedendaags Nederlands||http://corpushedendaagsnederlands.inl.nl/|
(searching for word pairs with a grammatical dependency relationship)