The CLARIAH Call for Research Pilots states:
"A CLARIAH research pilot is a small project in which the CLARIAH infrastructure or particular components of it are tested by carrying out small research activities."
About CLARIAH Components" it states that these includes "generic infrastructure services, data such as databases, textual resources and audio-visual resources, and software applications and services that can be applied to these data for searching, analysis, enrichment, conversion, combining, visualization and other purposes."
This page provides a (non-exhaustive) list of such CLARIAH components. It is intended as an aid to researchers who want to make a proposal. If a component is mentioned on this page, one can be sure it falls in the scope of the call. Components not mentioned on this page are not excluded, but then the researcher has to make clear what the relevance of the component is for CLARIAH.
The componens are listed in relation to the CLARIAH Work Packages (WP2 through WP5), and to a category "Other".
For each component a contact person is mentioned.
|Jauco Noordzij (HI)||ANANSI back-end infrastructure|
|Jauco Noordzij (HI)||ANANSI front-end infrastructure|
|Daan Broeder (MI)||CMD2RDF|
|Themis Karavellas & Jaap Blom (B&G)||CLARIAH OpenConext suite|
|Jan Theo Bakker (ivdNt)||PICCL production pipeline|
|Sebastiaan Derks (HI)||Add new Person datasets from other fields in the humanities than Linguistics, Social & Econ. History and Media Studies|
|Richard Zijdeman (IISG)||Add new Location datasets from other fields in the humanities than Linguistics, Social & Econ. History and Media Studies|
|Katrien Depuydt (ivdNt)||Add new Diachronous corpora from other fields in the humanities than Linguistics, Social & Econ. History and Media Studies|
Most of these components are in full development and in various stages of maturity.
There are many different components. The components are mentioned under the research phase where they are (most) relevant, more or less following the overall linguistics plan for CLARIAH. Most components are hyperlinks to more information on the component and with details about a contact person. One can also always contact Jan Odijk for assistance.
- Obtaining Data
- Search for and select data already contained in the CLARIAH infrastructure
- Incorporate existing data into the CLARIAH infrastructure
- Create new data, inter alia via crowd sourcing, and incorporate them into CLARIAH
- Obtaining tools (software services)
- Search for and select tools that are already part of the CLARIAH infrastructure
- Incorporate existing tools into the CLARIAH infrastructure
- Create new tools, and incorporate them into CLARIAH
- Enriching data incorporated in CLARIAH with various annotations
- TTNWW and components used in it (Frog, Alpino , NERD, ..), Frog, PICCL, TICCL, CLAM
Robust Semantic Parsing Dutch (state-of-the-art natural language processing pipeline). Here is a list (and there are even more). contact for more information.
- Word Sense Disambiguation: system based on Support Vector Machines to assign senses and a system confidence score to words.
- Entity recognition, classification & linking: identifies names in text, assigns a type such as person, location or organisation and tries to anchor it to its DBpedia resource (DBpedia is a graph database that contains the structured information from Wikipedia)
- Ontotagger: module that inserts ontological labels to Wordnet synsets associated with terms or directly to the lemmas of the term based on the external resources provided.
- Semantic Role Labelling (event extraction): identifies and classifies the semantic arguments in a sentence, for example who was the perpetrator of an action and who was the subject, as well as locations which can be used to generate event descriptions
- Factuality/Attribution: qualifies the certainty (certain/probable/possible) of an event, whether the event is confirmed or denied (pos/neg) and whether it is in the future (future/non-future)
- Opinion extraction: detects opinion entities (holder, target and expression)
- Simple tagger: generic tagger that identifies concepts in text and links this to external references. It currently comes with a basic version to tag Historical Occupations (from Hisco) and identify family relations for Dutch
- NLP2RDF crystallisation strategies NewsReader and BiographyNet: resolve coreference of entities and events within and across documents to generate event descriptions based on the Simple Event Model and source perspectives according to GRaSP.
- Manually, possible automatically bootstrapped
- Linking to external resources
- Searching in and analysis of the data
- Upload (possibly enriched) data into a search engine
- Search and browse in the data, analyze the data and the results of searching
- Chaining Search (ivdNt)
- Federated Content search (FCS): SRU/CQL, CLARIN Protocol, CLARIN FCS workplan
- Lexical data: CELEX, CGN lexicon(s), Cornetto, Open Source Dutch Wordnet, Duelme-LMF, GTB, GrNe, …
- Token-annotated corpora: CGN, SONAR-100, SONAR-500 and SONAR New Media, VU-DNC, Childes Corpora, FESLI and other SLI databases, VALID databases, Basilex, Nederlab data, MIMORE data, Corpus Hedendaags Nederlands, Corpus Gysseling, etc. ; SHEBANQ for Biblical Hebrew
- Treebanks: CGN-treebank, LASSY-Small, LASSY-LARGE, SONAR Treebank, treebanks created with PaQu, and CHILDES treebanks in production by UU
- Corpora with other annotations: certain annotations in VU-DNC, Discan, possibly some annotations in CHILDES corpora, UU learner corpora, UU and other correctness corpora
- Visualising search and analysis results
- Publishing the data and software in the CLARIAH infrastructure
- Make them visible (through metadata), accessible, and referable
- Ensure safe long term storage
- Creating and publishing enhanced publications (scientific article plus associated data and tools)
Structured Data (Social and Economic History)
Work Package 4 (WP-4) provides tools for creating, managing and querying Linked Data. WP-4 also contributes a wide variety of social and economic history datasets as Linked Data. For the call for research pilots, we welcome anyone who wants to work with our tools and/or our data. Below follows a non-technical description of our tools and their purpose as well as our datasets. Technical users can find the github pages for all our tools, by clicking on the names of the tools below.
If you have any questions, please contact Richard Zijdeman.
|QBer||QBer allows researchers to convert their csv or excel files into Linked Data. The benefit of having your data as Linked Data is that it allows easy connection to other data sources, even if you are unware of those sources. For example, if you have data with a place and date, you might be able to retrieve information on economic or even weather conditions at those places and times, providing important contextual information for your topic of study.
Moreover, by providing your data as Linked Data, it is possible for others to benefit from your data by linking your data to their data, provided you agree with that ofcourse.
|Iribaker||An important facet of creating Linked Data is creating so called Uniform Resource Identifiers (URI's). Such URI's identify concepts, so we can use those to link information to. For example this URI indicates to the concept of age (the length of time that a person has lived or a thing has existed).
IRI's are like URI's, but the important difference is that IRI's can contain Unicode characters, while URI's were limited to a subset of ASCII Code charaters.
The tool iribaker allows you to create such IRI's from strings (literals).
|inspector||When linking data, it is important to evalute what data was linked and by whom. Inspector provides such a visualization for all datasets linked via QBer.|
|grlc||When doing research, we often take a selection of all the data we started out with to perform our analysis on. We basically subset or 'query' our data. When working with Linked Data, people write SPARQL queries in order to subset their data. Such queries can be very tedious to write, especially when just starting out with Linked Data.
grlc provides away to share queries across the web. This basically means you borrow queries from others in order to get the subset of data you are interested in. grlc allows you also to edit those queries if your interests deviate from the person creating the orignal query. grlc provides the resulting subset of data in various format, including .csv, allowing for easy upload to for example MS Excel. A final benefit of grlc is that by sharing your queries via grlc, it is very easy for others to replicate the research sample and replicate research results.
|WP-4 currently provides various datasets as Linked Data. From 2017 onwards we will provide those at an open endpoint. Currently, the endpoint is only available on request however, as we're still in testing phase.
Obviously, researchers are free to use any dataset they would like to transpose into Linked Data, but to provide some examples of typical historical data sources, we refer repositories at the International Institute of Social History:
The contact person for all tools and data for WP5 is Julia Noordegraaf.
An overview of the data available in WP5 is provided here.
Full details are provided in the WP5 Collection Registry
An overview of the WP5 tools (or functionalities, as we call them) is provided in the spreadsheet below.
Further information about the use of the WP5 data and tools in the Research Pilots is provided in this PPT-presentation. (PDF-version)
In addition to the below functionalities (the ‘ingredients’), the Media Suite also provides four tools which combine several functionalities in one interface, or ‘recipe’. This facilitates use and has additional analytical value. Testing the added value of these recipes, compared to working with individual functionalities, could be a topic of a Research Pilot.
An overview of:
One possibility for other disciplines than the CLARIAH core disciplines is to take one of the components developed earlier in the CLARIN-NL project. The CLARIN-NL portal provides a list of software components and data. One can use the faceted browsing to get an overview of the components that one is interested in by selecting the relevant research domain.
- Research Pilots
- ADAH Projects
- Finished projects