The CLARIAH CORE infrastructure was developed in three separate work packages (language technology, semantic structured data technology, and audio-video technology) serving largely separate user communities and running on separate platforms. The most important intangible success during the course of the project was the establishment of trust among various partners. It took until the end of CLARIAH CORE for work package 2 and several centrally funded cross-discipline projects to achieve some success in bringing the technology of the three work packages together. This document is a plan for continuing the integration of the infrastructure development in CLARIAH PLUS, and to present it as a unified whole under the CLARIAH-as-a-Service (CLaaS) denominator.
The ideas presented here are based on the CLARIAH PLUS brainstorm sessions in Fall 2018 where a number of technical topics were defined of which the development went beyond the needs of single work packages. The original topics were:
- Geo technology
- Image technology
- Speech technology
- Text technology
- Recommender system technology
- Triplestore/Graph database technology
- Graphical User Interfaces
- Security, authentication, authorization, logging
- VRE/workflow management, pipelines and recipes
- Data standards, Vocabularies & Ontologies
The list appeared to be a selection of interests that required more fundamental classification. This was discussed in 2019-Q1 with the participation of Antal van den Bosch, Ronald Dekker, Marieke van Erp, Jauco Noordzij, Marijn Koolen and Arno Bosse.
The discussion commenced from Unsworth’s scholarly primitives (http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html).
The primitives were found to be applicable as a guide to classifying the components. They were eventually reduced to four classes:
- Models such as geo/image/speech/text data standards, vocabularies and ontologies;
- Prerequisites such as security, authentication, authorization, logging, etc.;
- Transformations such as VRE, workflow management, pipelines, recipes, etc.; and
- Interactions, which focus not only on visualisations and graphical user interfaces, but also on non-GUI interactions, effectively addressing the “80-20 discussion” in CLARIAH PLUS.
More and more researchers also desire to directly interact with the infrastructure via code, in solutions such as Jupyter Notebooks. Although these may not have yet reached the illustrious status of 20% - adoption is increasing rapidly.
The classification (interactions, transformations, models, and prerequisites) can be layered from specific to generic. Interactions are closest to the user and often require custom solutions to answer specific research questions within an academic (sub-)discipline. Transformations, such as NLP pipelines and recipe workflows, are more generic and applicable to multiple use cases across SSH disciplines, while Models on the other hand transcend the boundaries of academic fields. Prerequisites are even more universal and can no longer be considered part of the academic domain.
The result of these discussions led to a first version of the CLARIAH-as-a-Service (CLaaS) plan presented to the CLARIAH Board in Spring 2019. The design was developed further with the CLARIAH TB in 2019Q2 and presented to a range of (international) partners as well as the Dutch Network for Digital Heritage (NDE). This led to the addition of a new class: DevOps, which is a collection of services and components required by engineers to develop, test and deploy software on top of the infrastructure. The DevOps class is inherently more universal than Prerequisites as it is generic enough to develop any of the other services.
The scale from specific-to-generic naturally led to the conclusion that the five component classes all rest on an ultimate generic class of infrastructure services: those that serve computational resources such as storage, processing, memory and network. These apply equally to cars, washing machines, doorbells, and space stations, as much as they do to analytical software for research questions in highly specialised academic sub-disciplines.
The architectural model behind CLaaS consists of layers in which the component classes described earlier, are grouped together. The lowest layer contains the Computational Resources hosted by institutions. The Container Orchestration layer dynamically distributes these resources to software services. This layer allows for the fluid upscaling and downscaling of resource allocation, meaning that resources – such as processor capacity and memory – of dormant service containers can be used by active services. This makes the infrastructure not only more cost effective but also more reliable and performant, and, as a result, more sustainable.
The actual CLARIAH software is divided over two layers. A layer for Provisioning Services contains components classified as Prerequisites and Devops. A further layer contains Domain Services and consists of the Models, Transformations, and Interactions classes. Finally, in accordance with the scale of specific-to-generic, a final layer contains both experiments and tools developed by scientific programmers in other projects (NWO Open Competition, NWO Groot, NWA, etc.). Both can make use of the CLARIAH infrastructure. This layer serves the 20% digitally oriented researchers as well as the 80% more traditional scholars. Although the layer is outside the scope of this plan it is likely to contain software components developed in CLARIAH WP3 to 6.
Definition and Reference Implementation
Conducive to interoperability, all components in the architecture are primarily considered to be sets of conceptual agreements. A commonly accepted agreement within CLARIAH on a definition, e.g. ‘workflow engine’, is fundamental before the actual development of the software implementing such an engine commences. Definitions are formulated as one or more sets of APIs and data models, allowing anyone to adopt the infrastructure without being forced to run a specific software implementation. Over the course of the CLARIAH PLUS project each of the components in the CLaaS architecture is intended to be both precisely defined and implemented in at least one running software component. The CLaaS Reference Implementation is the full set of all these components.
In many instances definitions already exist. For the lower layers of the architecture the necessity to come to a commonly accepted new definition is small. There is clearly no need to draw up a different concept of e.g. a processor or memory bank in CLARIAH. By principle: the more specific a layer of infrastructure becomes, the more thorough the analysis of the concepts will have to be, the larger the effort required, and the likelier CLARIAH will have to be involved. Since the greatest effort in terms of resources and time is the development of unique software, the development of CLARIAH-specific implementations in any layer below Domain Services, is considered to be controversial. This principle aids in focussing the development effort. Although CLARIAH would never consider defining a new processor, there have been discussions on whether or not to develop software for e.g. security, preservation, cataloguing, etc. Whenever unique feature requests for such services occur, CLARIAH should challenge these requests instead of automatically building and maintaining a new and uniquely specific codebase.
CLARIAH Interest Groups
The goal of the CLARIAH Interest Groups (IGs) is to precisely define the concepts of the components in the CLaaS-architecture and come to a wide agreement on the definitions as well as the selection of working implementations. In other words, based on relevant use cases the groups select sets of APIs and models, as well as the software tools that implement them. The selection made in the groups should carry wide support in the CLARIAH community and especially within the relevant work packages. Specialists, both engineers and users, work together in the IGs and a coordinator is appointed to facilitate and encourage the debate. Since there are likely to be very few specialists who can discuss with equal knowledge both e.g. data models in Audio/Video and GIS, it is virtually impossible for the IGs to follow the architectural classification outlined above. For this reason, the IGs are grouped thematically, and considered ‘topical views’ on the components in the CLaaS architecture.