Challenges in Open Data
School of Engineering, University of Edinburgh
Wednesday, 4nd of September, 2024
A little about me:
We need high quality experimental measurements in research and industry
And what is FAIR data anyway?!
Photo by Kelly Sikkema on Unsplash
For the past two decades there has been significant discussion on the reproducibility crisis in science
Distinction between Reproducibility & Replicability
Reproducibility: Can you get the same answers as me when you analyze my data?
Replicability: Can you get the same answers when you do my experiment and collect your own data?
AI will not make this better!
Dependent on the quality of the data provided to it.
Managing research data is an often forgotten aspect of research. Traditionally datasets were quite small:
Significant changes in recent decades:
Ad-hoc management no longer sufficient
Create: Collect, Generate, Process, Clean
Use: Manipulate, Analyse, Visualise
Re-Use: By colleagues or starting point for new dataset
Data management best practices involve the entire data lifecycle from project start to end
Plan: Data Management Plan in grant submissions.
Create: Experiment, Simulation, Survey, Merge, etc…
Document: Describe the data collected in detail. Sooner rather than later!
Use: Analyse/Discover/Collaborate. Document the process.
Preserve: Store for future use. Version Control. Databases & Archives.
Share: Essential for translation of results into knowledge. Open Data? Repositories?
Re-Use: Collaborate / Derive / Develop. Teach. Policy.
FAIR is an acronym for Findable, Accessible, Interoperable, Reusable, which are the principles which should apply to scientific data management and guardianship.
Findable: The first part of making data re-useable is to make the data findable. Detailed and accurate metadata is key
Accessible: Data could be openly available or it could require prior authentication and authorisation
Interoperable: Data needs to be able to be used in different programs or workflows
Reusable: Well defined data is essential as it makes it easier to understand and therefore use, combine and/or extend the dataset
Guidance on how to assess the “fairness” of your data: https://bit.ly/yourFIP
The FAIR principles also emphasise machine-actionability
Machine readable and digitally accessible are not the same thing!
For example, traditional word processing documents and PDF files are easily read by humans but typically are difficult for machines to interpret.
A machine readable format is a file in a standard computer language (not English text) that can be read automatically by a web browser or computer system. (e.g.; xml, json).
Photo by Vitaly Gariev on Unsplash
TUSAIL (www.tusail.eu) is an Innovative Training Network funded by Horizon 2020
“Training in Upscaling particle Systems: Advancing Industry across Length-scales”
Zenodo is a general-purpose open repository developed under the European OpenAIRE program and operated by CERN
TUSAIL Community on Zenodo
Photo by RDNE Stock project
Do not use proprietary formats for data or metadata files
Use machine readable formats
csv
, json
, xml
or (open) Scientific formats like HDF5
and netCDF
Structured data is preferred over unstructured
Avoid image only datasets if possible
Use sensible variables names
Avoid use of commas as decimal separators in numbers, these can easily be mistaken as csv files
675454453.00
or 0.00007654
Need to include README file that describes the dataset
Why? Assist other researchers to understand:
Recommended Content:
Licencing can be complicated. Licensing considerations for datasets and for software/code may be very different.
Data management should adhere to FAIR principles. How do we achieve that?
Findable:
Accessible: Open access?
Interoperable: Uses common, open formats for data
Reusable: Document the data carefully, Licence appropriately.
Photo by Martin Adams on Unsplash
A well-executed Research Data Management Plan enables data to be Accessible, Interoperable and Reusable.
Making datasets (for specific materials or test equipment) Findable is a challenge:
We can consider a dataset as two separate components:
GrainDB is an Experimental Measurements Database that collates and stores the relevant metadata for the material and test equipment that is missing from repositories, alongside a link to the full dataset which has been stored in an appropriate manner.
This becomes a searchable database of all available data, enabling the end-user to find detailed experimental datasets that match their needs
Enhancing dissemination
Provide high-quality machine-readable datasets suitable for use in AI/ML
Interface guides people on what information should be stored in accordance with best experimental practice
Variety of test types in use to measure material properties
Experiment or Measurement?
In a measurement, one performs parameter inference
In an experiment, one performs hypothesis tests or model selection
Databases provide efficient and safe multi-user access to data
Accessible via a website: www.graindb.org
Need to be prescriptive
Define carefully what information is required for all aspects
Currently no Ontology or common descriptive language for defining granular materials and their measurement
Web interface to Relational Database
Database backend utilises PostgreSQL
Why?
All schemas are versioned so we can:
Schemas allow greater use of templating for dynamic page generation
Available at: git.ecdf.ed.ac.uk/jmorrise/tusail-experimental-database-schemas
{
"$id": "equipment/linear_shear_jenike.v1-0-0.json",
"$schema": "https://raw.githubusercontent.com/Vidminas/python-jsonschema-minmax/main/metaschema/minmax-metaschema.json",
"type": "object",
"properties": {
"test_category": { "const": "Linear Shear" },
"test_subcategory": { "const": "Jenike" },
"class": { "label": "Test Regime", "const": "Quasi-static" },
"rating": { "label": "Repeatability Rating", "const": 4 },
"geometric_properties": { "$ref": "#/$defs/geometric_properties" },
"measurement_parameters": { "$ref": "#/$defs/measurement_parameters" }
},
"required": ["geometric_properties", "measurement_parameters"],
"$defs": {
"geometric_properties": { "label": "Geometric Properties",
"type": "object",
"properties": {
"cell_diameter": { "label": "Cell Diameter", "type": "number", "minimum": 0, "units": ["mm"] },
"base_height": { "label": "Base Height", "type": "number", "minimum": 0, "units": [ "mm" ] },
"ring_height": { "label": "Ring Height", "type": "number", "minimum": 0, "units": [ "mm" ] },
"wall_thickness": { "label": "Wall Thickness", "type": "number", "minimum": 0, "units": [ "mm" ] },
"max_translation_speed": { "label": "Max. Translation Speed", "type": "number", "minimum": 0, "units": [ "mm/s" ] },
"cell_material": { "label": "Cell Material", "type": "string" },
"stress_application": { "label": "Vertical Stress Application Method", "enum": [ "Constant Mass", "Servo-Controlled" ] }
},
"additionalProperties": false
},
"measurement_parameters": { "label": "Measurement Properties",
"type": "object",
"properties": {
"force_accuracy": { "label": "Force Accuracy", "type": "number", "minimum": 0, "units": [ "Pa" ] },
"force_resolution": { "label": "Force Resolution", "type": "number", "minimum": 0, "units": [ "Pa" ] },
"displacement_accuracy": { "label": "Displacement Accuracy", "type": "number", "minimum": 0, "units": [ "mm" ] },
"displacement_resolution": { "label": "Displacement Resolution", "type": "number", "minimum": 0, "units": [ "mm" ] }
},
"additionalProperties": false
}
}
}
RESTful API implemented using Swagger and OpenAPI Specification
We haven’t always done a very good job of preserving our datasets for future
We need to be able to find all datasets easily
A well defined set of metadata schemas will make recording experimental measurements for lots of different pieces of equipment more straightforward
Having detailed stored records with unique ids (urls) may make dissemination easier as well
could easily include in a publication much like a doi