In the proposal for CLARIAH-PLUS (p. 8) it is argued that: “The CLARIAH infrastructure will increase our empirical base, options for analysing […] data, and the efficiency of research by orders of magnitude (data-intensive science).”
Ok, but is it true?
Geert Wilder, leader of the Dutch populist party PVV, known for standing up for ordinary people, recently published a tweet (see below) in which he used the completely unknown word 'difficulteren' (doing difficult). Remarkable, because his party is known for their straightforward use of language that even 'ordinary' people can understand.
Linguist Marc van Oostendorp, professor of Dutch Language and Literature at Radboud University in Nijmegen and a passionate blogger, wrote a nice blog about this tweet and formulated a conjecture about the use of this word. Marten van der Meulen, PhD student and writer, responded to this blog by conducting corpus searches in data that have been made accessible in the CLARIAH infrastructure in order to test Marc’s conjecture. Marten tried to find when this unknown word ‘difficulteren’ was used for the first time, how often it has been used at all in recent years, and in what contexts it mainly occurred?
‘increase our empirical base'
Marten searched in 6 corpora (Staten Generaal Digitaal, Corpus Gesproken Nederlands, Corpus Hedendaags Nederlands, Brieven als Buit Corpus, Sonar en in the corpora of Nederlab (where it mainly occurs in Early Dutch Books Online). A prominent feature of CLARIAH is that it allows every humanities scholar to search these resources: you don't have to be a corpus linguist, you don't have to be able to code, you don't have to download corpora or software. CLARIAH offers web applications with user-friendly interfaces that make searching in those corpora easy. See below for links.
'increase options for analysing … data'
These resources make it possible to search by lemma rather than by word, which makes the search and analysis of the search results a lot easier and results in a larger number of relevant data. Moreover, many of the sources contain metadata such as genre, time and place, so that it can also be quickly determined where, when and in which genres this word occurs frequently or less frequently.
'increase the efficiency of research'
Marten did this research within 1 day, something that was not possible before CLARIAH, except perhaps for a select group of corpus linguists.
Of course, you can also search the internet, via Google or Twitter. This complements the search in specific corpora, especially since the empirical basis is then even larger. But then one has to look up all the word forms of this verb separately and the analysis of the results requires more (manual) work, especially because there are hardly any relevant metadata. Marten has also searched with Google, but he has not yet been able to analyse the results in that one day. He also searched the Corpus of the Web (COW) for Dutch, smaller than the whole internet but still quite large (7 billion words), and there were fewer hits, so they could be analysed further.
The search query in question concerns a one-word lemma, and that is a relatively simple task. But the CLARIAH infrastructure also allows much more complex searches, with combinations of words, word pairs with a grammatical dependency relationship, and complete grammatical constructions.
My conclusion is therefore that CLARIAH facilitates and already substantiates the above claim.
|Corpus Hedendaags Nederlands||http://corpushedendaagsnederlands.inl.nl/|
(searching for word pairs with a grammatical dependency relationship)
Last week, the 16th International Semantic Web Conference (ISWC 2017) took place in Vienna, Austria. Around 600 researchers from all over the world came together to exchange knowledge and ideas in 7 tutorials, 18 workshops, and 3 full days of keynotes, conference talks, and a big poster & demo session. Needless to say, I only saw a small part of it, but all the papers and many of the tutorial materials are avaialble through the conference website.
First of all, kudos to the organising committee for putting together a fantastic programme and great overall surroundings. The WU Campus (workshops, posters & demos and jam session) has a really gorgeous campus with a marvellous spaceship-like library.
The main conference took place next door at the Messe, where the Wifi worked excellently (quite a feat at a CS conference where most participants carry more than one device). The bar for next year is set high!
But back to the conference:
On Sunday, I got to present the SERPENS CLARIAH research pilot during the Second Workshop on Humanities in the Semantic Web (WHISE II). There were about 30 participants in the workshop, and a variety of projects and topics was presented. I particularly liked the presentation by Mattia Egloff on his and Davide Picca's work on DHTK: The Digital Humanities ToolKit. They are working on a python module that supports analysis of books and they are developing and testing it for an undergraduate course for humanities students. I really think that by providing (humanities) students with tools to start doing their own analyses, we can get them enthusiastic about programming, as well as thinking about the limitations of such tools, which can lead to better projects in the long run.
In the WHISE workshop, as well as in the main conference, there were several presentations on multimedia datasets for the Semantic Web. The multimedia domain is not new to Semantic Web, but some of the work (such as Rick Meerwaldt, Albert Meroño-Peñuela and Stefan Schlobach. Mixing Music as Linked Data: SPARQL-based MIDI Mashups Mashups) doesn't just focus on the metadata but actually encodes the MIDI signal as RDF and then uses it for a mashup.
Another very interesting resource is IMGpedia, created by Sebastián Ferrada, Benjamin Bustos and Aidan Hogan, which was presented in a regular session (winner best student resource paper) as well as during the poster session (winner best poster). The interesting thing about this resource is that it doesn't only allow you to query on metadata elements, but also on visual characteristics.
Metadata and content features are also combined in The MIDI Linked Data Cloud by Albert Meroño-Peñuela, Rinke Hoekstra, Victor de Boer, Stefan Schlobach, Berit Janssen, Aldo Gangemi, Alo Allik, Reinier de Valk, Peter Bloem, Bas Stringer and Kevin Page which would for example make studies in ethnomusicology possible. I think such combinations of modalities is super exciting for humanities research where we work with extremelty rich information sources and often need to/want to combine sources to answer our research questions.
Enriching and making available cultural heritage data is also a topic that keeps popping up at ISWC, this year there was for example "Craig Knoblock, Pedro Szekely, Eleanor Fink, Duane Degler, David Newbury, Robert Sanderson, Kate Blanch, Sara Snyder, Nilay Chheda, Nimesh Jain, Ravi Raju Krishna, Nikhila Begur Sreekanth and Yixiang Yao: Lessons Learned in Building Linked Data for the American Art Collaborative". This project was a pretty big undertaking in terms of aligning and mapping museum collections. I really like that the first lesson learnt to create reproducible workflows:
This doesn't only hold for conversion of museum collections, but for all research. But it's still nice to see mentioned here. Reproducibility is also a motivator in "Tobias Kuhn, Egon Willighagen, Chris Evelo, Núria Queralt Rosinach, Emilio Centeno and Laura Furlong: Reliable Granular References to Changing Linked Data" which investigates the use of nanopublications to enable referring to items or subsets within data collections for finegrained referencing of previous work.
My favourite keynote at this conference (and they had three excellent ones) was by Jamie Taylor, formerly of Freebase, now Google. He argued for more commonsense knowledge in our knowledge graphs. While I do think that is a great vision, as many of our resources lack this leading to all sorts of weird outcomes in for instance named entity linking (you can ask Filip Ilievski for the funniest examples) it was unclear how to go about this this and whether this would be possible at all. The examples he gave in the keynote for toasters and kettles would work out just fine (kettles heat up water, toasters heat up baked goods) but for complex concepts such as murders (Sherlock Holmes anyone?) I'm not sure how this would work. But enough food for thought. See also Pascal Hitzler's take on this keynote.
See you in Monterey, California next year?
Submitted by Karolina Badzmierowska on 23 October 2017
Tour de CLARIN
“Tour de CLARIN” is a new CLARIN ERIC initiative that aims to periodically highlight prominent User Involvement (UI) activities of a particular CLARIN national consortium. The highlights include an interview with one or more prominent researchers who are using the work of national consortium’s infrastructure and can tell us more about their experience with CLARIN in general; one or more use cases that the consortium is particularly proud of and any relevant user involvement activities carried out. “Tour de CLARIN“ helps to increase the visibility of the national consortia, reveal the richness of the CLARIN landscape, and to display the full range of activities throughout the network. The content is disseminated via the CLARIN Newsflash, blog posts and linked to on our social media: Twitter and Facebook.
CLARIAH-NL is a project in the Netherlands that is setting up a distributed research infrastructure that provides humanities researchers with access to large collections of digital data and user-friendly processing tools. The Netherlands is a member of both CLARIN ERIC and DARIAH ERIC, so CLARIAH-NL contributes therefore not only to CLARIN but also to DARIAH. CLARIAH-NL not only covers humanities disciplines that work with natural language (the defining characteristics of CLARIN) but also disciplines that work with structured quantitative data. Though CLARIAH aims to cover the humanities as a whole in the long run, it currently focusses on three core disciplines: linguistics, social-economic history, and media studies.
CLARIAH-NL is a partnership that involves around 50 partners from universities, knowledge institutions, cultural heritage organizations and several SAB-companies, the full list of which can be found here. Currently, the data and applications of CLARIAH-NL are managed and sustained at eight centres in the Netherlands: Huygens Ing, the Meertens Institute, DANS, the International Institute for Social History, the Max Planck Institute for Psycholinguistics, the Netherlands Institute for Sound and Vision, the National Library of the Netherlands, and the Institute of Dutch Language. Huygens Ing, The Meertens Institute, the Max Planck Institute for Psycholinguistics, and the Institute of Dutch Language are Certified CLARIN Type B centres. The consortium is led by an eight-member board and its director and national coordinator for CLARIN ERIC is Jan Odijk.
The research, development and outreach activities at CLARIAH-NL are distributed among five work packages: Dissemination and Education (WP1) and Technology (WP2) deal respectively with User Involvement and the technical design and construction of the infrastructure, whereas the remaining three work packages focus on three selected research areas: Linguistics (WP3), Social and Economic History (WP4) and Media Studies (WP5).
The full blog can be read here: https://www.clarin.eu/blog/tour-de-clarin-netherlands
17 october 2017, Christian Olesen
Early September, Liliana Melgar and I (Christian Olesen) received an invitation from Barbara Flückiger, Professor in Film Studies at the University of Zürich, to participate in the “Colloquium Visualization Strategies for the Digital Humanities”. The aim of the day was to bring together experts to discuss film data visualization opportunities in relation to Professor Flückiger’s current research projects on the history of film colors. Currently, Flückiger leads two large-scale projects on this topic: the ERC Advanced Grant FilmColors (2015-2020) and the Filmfarben project funded by the Swiss National Science Foundation (2016-2020). A presentation of the projects’ team members can be found here.
As a scholar, Barbara Flückiger has in-depth expertise on the interrelation between film technology, aesthetics and culture covering especially aspects of film sound, special effects, film digitization and film colors in her research. In recent years, her research has increasingly focussed on film colors, especially since the launch of the online database of film colors Timeline of Historical Film Colors in 2012 after a successful crowdfunding campaign. The Timeline of Historical Film Colors has since grown to become one of the leading authoritative resources on the history and aesthetics of film colors – it is presented as “a comprehensive resource for the investigation of film color technology and aesthetics, analysis and restoration”. It is now consolidating this position as it is being followed up by the two large-scale research projects mentioned above which merge perspectives from film digitization, restoration, aesthetic and cultural history.
These projects are entering a phase in which the involved researchers are beginning to conceive ways of visualizing the data they have created so far and need to consider the potential value which data visualization may have for historical research on film color aesthetics, technology and reception.
In the full report with a lot of impressions from the vist can be read here.
On Friday, October 6th 2017 an enthusiastic group of engineers and digital humanities scholars gathered for the third annual CLARIAH Tech Day. There was an activist mood, this time we would do things differently!
Many developers in the project wanted a meeting in which building stuff would be the focus instead of listening to presentations on how other people had built stuff. The weeks before had seen a flurry of emails on the contents of such a day and the agenda, but also on doubts and concerns. And the truth was: none of us actually had the foggiest idea of how to do this.
I was asked to take the lead, and together with Roeland Ordelman, Richard Zijdeman and Marieke van Erp we sat down during the CLARIN Meeting in Budapest to kick around some ideas. We settled on a hackathon/unconference-style format. The agenda would be open to suggestions from the community and not be set until the meeting itself. And I’ll confess - I had some prior hesitations on this open format: what if nobody would come up with anything? Wouldn’t people want to know what the meeting was about before making time in busy schedules? But this was what the community itself had repeatedly asked for, so damn the torpedoes - full steam ahead.
And we were not disappointed! The ideas, suggestions and questions poured in and were eventually gathered into four main topics:
- Integration and modelling of shared data between the various domains and the generic CLARIAH infrastructure;
- Continued development of GRLC;
- A discussion on workflows, and how tool selection based on data mime-type can provide guidance for users;
- TEI/exist-db/TEIPublisher and Oxygen as the basis for digital editions and linguistic querying.
The enthusiastic response continued into the event itself. It became immediately obvious that the restyled Tech Day would also be a lot of fun. The smiles, enthusiasm and flexibility were fantastic. The number of developers who had come from all over CLARIAH had brought many guests, turning this into a truly international day that generated a very positive vibe of its own.
After a five minute pitch for each topic, the community basically took over the pantry, restaurant and meeting rooms at the IISH building. You could find groups of engineers working, discussing and building stuff everywhere. And these groups were extremely varied: people from Media Studies discussing GRLC with engineers working in the field of Social Economic History, and Linguists and Lexicographers getting stuff done with developers working on generic infrastructure. Many new ideas were born that day.
A lot of progress was made on the four main topics. Both Open Dutch Wordnet and the first version of the diachronous lexical corpus Diamant (INT, Kathrien Depuydt and Jesse de Does) were connected to the generic infrastructure, as were catalogues provided by NISV, and the Gemeente Geschiedenis dataset on Dutch municipalities (by Hic Sunt Leones). Carlos Martinez and a group of engineers added to GRLC the automatic inclusion of SPARQL queries stored in github. And there were plenty of discussions on planned and unplanned subjects. Jan Odijk and Jesse de Does ran a very interesting meeting on workflow systems and Eduard Drenth (Fryske Akademy) presented his ideas on digital editions followed by a very detailed open discussion on the pro’s and cons of the software stack he proposed.
Completely spontaneous, Richard Zijdeman showed us a new way of implementing HDMI for the improvement of health in CLARIAH, and Roeland Ordelman and Liliana Melgar came up with very interesting ideas on a user workspace that may eventually become part of the generic infrastructure. Although interest in the first was quite short-lived, the latter we are definitely going to test.
In short: the CLARIAH tech community rallied around the open format! During the final meeting I was happy to announce that given the excitement and energy, the board had decided right then and there, that we could run another Tech-meeting in late winter, early spring 2018. And with illustrating enthusiasm the first ideas for this meeting are already coming in.
- 07-06-2018 CLARIAH facilitates!
- 29-10-2017 ISWC Trip Report
- 23-10-2017 Tour de CLARIN: The Netherlands
- 18-10-2017 CLARIAH Media Studies and MIMEHIST in Zürich – A Report
- 16-10-2017 CLARIAH-Tech day blog
- 28-09-2017 CLARIN 2017 Annual Conference
- 29-06-2017 LDK Trip report
- 20-06-2017 Report CLARIAH Linked Data Workshop 2
- 25-05-2017 Catching Speech in Arezzo: A Clarin workshop for developing a transcription-chain for Oral History
- 23-02-2017 LD4LR: Linked Data for Linguistic Research