Back

3 Methodology

Input Pipeline

The maDMPs we use as raw input data for our project are taken directly from the Zenodo Community Data Stewardship 2021 - DMPs.

  1. Start with raw maDMPs from the Zenodo Community.
  2. Ensure schema conformity, uniform formatting and indenting
  3. Normalization: Establish uniform, alphabetical sorting of JSON properties.
  4. Convert JSON/Turtle maDMPs to a DCSO instance in a JSON-LD serialization using the dcsojon tool.
  5. Apply postprocessing to the JSON-LD maDMPs (again, establish uniform, alphabetical sorting of JSON-LD properties).

Regarding step 2, the following changes had to be made in order to achieve schema conformity for all input maDMPs:

  • 4.json
    • removed line breaks within JSON string literals
  • 6.json
    • some closing brackets were missing
    • object nesting hierarchy in the distribution field was incorrect (we could only make a guess as to the author’s original intentions)
  • 10.json
    • objects instead of arrays with one element were used
    • sometimes, strings instead of numerical values (e.g. int) were used
    • two of the four data sets did not exhibit the required dataset_id field
    • on one occasion, a datetime format was used instead of a date
  • 11.json
    • correction regarding datetime format: 2021-04-12T25:10:16.8 -> 2021-04-12T25:10:16.8Z
    • correction regarding incorrect time value: 2021-04-12T25:10:16.8Z -> 2021-04-12T23:10:16.8Z (a day does not have more than 24 hours)

Step 4 has been performed automatically via the convert.sh shell script. For more information on the dcso-json tool invoked by this script, please refer to the dcso-json overview.

Processing of the Semantic maDMP Representation

After having brought the maDMPs into a semantically enriched JSON-LD format, we were ready to express requirements from the evaluation rubric mentioned in the Project Overview. We developed queries that project certain subsets of the data into a customized view (SELECT queries) as well as ones that simply indicate whether some criteria are satisfied (ASK queries).

A more in-depth, but still summarized, overview of the queries we created can be observed in Covered Criteria. The queries themselves are available in the queries directory.

During our experiment, we used a local GraphDB instance as triple store and SPARQL endpoint. Other triple stores such as Jena Fuseki are of course eligible as well. However, as a result of previous experiences with it, we opted for GraphDB.

Report on Quality of Input maDMPs

Finally, after having created the queries, we applied them to the maDMPs that made up our input and which had previously been imported into a GraphDB repository. The results of this assessment can be found in the Assessment Report.