Input Pipeline
The maDMPs we use as raw input data for our project are taken directly from the Zenodo Community Data Stewardship 2021 - DMPs.
- Start with raw maDMPs from the Zenodo Community.
- Ensure schema conformity, uniform formatting and indenting
- Normalization: Establish uniform, alphabetical sorting of JSON properties.
- Convert JSON/Turtle maDMPs to a DCSO instance in a JSON-LD serialization using the
dcsojon
tool. - Apply postprocessing to the JSON-LD maDMPs (again, establish uniform, alphabetical sorting of JSON-LD properties).
Regarding step 2, the following changes had to be made in order to achieve schema conformity for all input maDMPs:
4.json
- removed line breaks within JSON string literals
6.json
- some closing brackets were missing
- object nesting hierarchy in the
distribution
field was incorrect (we could only make a guess as to the author’s original intentions)
10.json
- objects instead of arrays with one element were used
- sometimes, strings instead of numerical values (e.g.
int
) were used - two of the four data sets did not exhibit the required
dataset_id
field - on one occasion, a datetime format was used instead of a date
11.json
- correction regarding datetime format:
2021-04-12T25:10:16.8
->2021-04-12T25:10:16.8Z
- correction regarding incorrect time value:
2021-04-12T25:10:16.8Z
->2021-04-12T23:10:16.8Z
(a day does not have more than 24 hours)
- correction regarding datetime format:
Step 4 has been performed automatically via the convert.sh shell script. For more information on the dcso-json
tool invoked by this script, please refer to the dcso-json overview.
Processing of the Semantic maDMP Representation
After having brought the maDMPs into a semantically enriched JSON-LD format, we were ready to express requirements from the evaluation rubric mentioned in the Project Overview. We developed queries that project certain subsets of the data into a customized view (SELECT
queries) as well as ones that simply indicate whether some criteria are satisfied (ASK
queries).
A more in-depth, but still summarized, overview of the queries we created can be observed in Covered Criteria. The queries themselves are available in the queries directory.
During our experiment, we used a local GraphDB instance as triple store and SPARQL endpoint. Other triple stores such as Jena Fuseki are of course eligible as well. However, as a result of previous experiences with it, we opted for GraphDB.
Report on Quality of Input maDMPs
Finally, after having created the queries, we applied them to the maDMPs that made up our input and which had previously been imported into a GraphDB repository. The results of this assessment can be found in the Assessment Report.