GrainDB - Making Experimental Measurements FAIR

Challenges in Open Data

John P. Morrissey

School of Engineering, University of Edinburgh

Wednesday, 4^nd of September, 2024

Who am I…

A little about me:

Research Fellow at The University of Edinburgh
Member of the Granular & Geomechanical Processes Research Group
Focus on Granular Materials and Particulate Mechanics
- Experimental Characterisation
- DEM Simulations
- Data Analysis
Some supervision of undergraduate and postgraduate students

Granular Mechanics & Industrial Infrastructure Research Group

And why am I talking to you about FAIR Data?

We need high quality experimental measurements in research and industry

Material characterisation for parameter estimation
For calibrating simulations
For validation of simulations
To introduce into large AI/ML/Data-driven models

And what is FAIR data anyway?!

Reminder - Research Data

Photo by Kelly Sikkema on Unsplash

Reproducibility & Replicability Crisis in Science

For the past two decades there has been significant discussion on the reproducibility crisis in science

A pattern of scientists being unable to obtain the same results as previous researchers

Distinction between Reproducibility & Replicability

Reproducibility: Can you get the same answers as me when you analyze my data?
Replicability: Can you get the same answers when you do my experiment and collect your own data?

AI will not make this better!

Dependent on the quality of the data provided to it.

Research Data Life-cycle

Managing research data is an often forgotten aspect of research. Traditionally datasets were quite small:

Data was not widley shared beyond colleagues
Often only ~~documented~~ mentioned in publications

Significant changes in recent decades:

The amount of research data generated has grown
- much larger datasets (e.g. imaging, DEM, etc.)
Sharing & collaboration encouraged (impact)

Ad-hoc management no longer sufficient

Poorly documented datasets are of limited use
Are they preserved beyond the current cloud storage subscription?

Create: Collect, Generate, Process, Clean
Use: Manipulate, Analyse, Visualise
Re-Use: By colleagues or starting point for new dataset

Research Data Life-cycle

Data management best practices involve the entire data lifecycle from project start to end

Plan: Data Management Plan in grant submissions.
Create: Experiment, Simulation, Survey, Merge, etc…
Document: Describe the data collected in detail. Sooner rather than later!
Use: Analyse/Discover/Collaborate. Document the process.
Preserve: Store for future use. Version Control. Databases & Archives.
Share: Essential for translation of results into knowledge. Open Data? Repositories?
Re-Use: Collaborate / Derive / Develop. Teach. Policy.

What is FAIR Data?

FAIR is an acronym for Findable, Accessible, Interoperable, Reusable, which are the principles which should apply to scientific data management and guardianship.

Findable: The first part of making data re-useable is to make the data findable. Detailed and accurate metadata is key
Accessible: Data could be openly available or it could require prior authentication and authorisation
Interoperable: Data needs to be able to be used in different programs or workflows
Reusable: Well defined data is essential as it makes it easier to understand and therefore use, combine and/or extend the dataset

**“FAIR Principles”**. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

Fair Guiding Principles

Guidance on how to assess the “fairness” of your data: https://bit.ly/yourFIP

FAIR Data - Machine Actionable

The FAIR principles also emphasise machine-actionability

The capacity of computational systems to find, access, interoperate, and reuse data with minimal or no human intervention
We are increasingly dependent on computers as datasets grow and we move towards automation (AI/Deep Learning/etc.)

Machine readable and digitally accessible are not the same thing!

For example, traditional word processing documents and PDF files are easily read by humans but typically are difficult for machines to interpret.
A machine readable format is a file in a standard computer language (not English text) that can be read automatically by a web browser or computer system. (e.g.; xml, json).

Reality

Photo by Vitaly Gariev on Unsplash

TUSAIL

TUSAIL (www.tusail.eu) is an Innovative Training Network funded by Horizon 2020

Training 15 ESRs over 4 years through a combination of PhD research, scientific training and industrial secondments
Comprising of 15 academic and industrial partners and led by the University of Edinburgh

“Training in Upscaling particle Systems: Advancing Industry across Length-scales”

Where to Store Datasets - Accessible Data

Zenodo is a general-purpose open repository developed under the European OpenAIRE program and operated by CERN

Requirement in many EU funded projects to make data available here
Free
Citable - every dataset receives a DOI
Maximum 50GB per dataset
Not just limited to datasets (papers, posters, presentations, etc.)

TUSAIL Zenodo Community - Accessible Data

TUSAIL Community on Zenodo

Outputs from the TUSAIL project
Will accept deposits from the wider community who wish to share particle/granular data in a common forum
Start creating a centralised source for datasets

Dataset Preparation (A very brief overview)

Photo by RDNE Stock project

Enhancing Interoperability

Do not use proprietary formats for data or metadata files
- Open formats where possible
Use machine readable formats
- Ascii, csv, json, xml or (open) Scientific formats like HDF5 and netCDF
Structured data is preferred over unstructured
Avoid image only datasets if possible
- difficult to extract information from images
Use sensible variables names
Avoid use of commas as decimal separators in numbers, these can easily be mistaken as csv files
- Format numbers as 675454453.00 or 0.00007654

Metadata - The README - Interoperable and Re-useable

Need to include README file that describes the dataset

Suggestion: use markdown format (*.md)
Human readable
Easily rendered
Default across many repositories

Why? Assist other researchers to understand:

your dataset and its contents
provenance
licensing
how interact with it

Recommended Content:

Title for dataset
Investigator / Contact person
Collection Date
Method used
Deviations from standards
Description of dataset
Filenames, file format, etc
License
Keywords

Licencing for Data - Reusable Data

Licencing can be complicated. Licensing considerations for datasets and for software/code may be very different.

Do I need a license?
Is there a specific requirement from your funding body or institution?
Attribution, Copyleft, Non-Commercial, No-Derivatives …
Problems with each type:
- Attribution stacking is when a derivative work must acknowledge all contribution to each work from which it was derived. Can become an unwieldy mess.
- Copyleft licenses prevent combining with data from different license types
- Non-commercial can be ambiguous as to what is a commercial use

Achieving FAIR Data Management

Data management should adhere to FAIR principles. How do we achieve that?

Findable:
- Data is available in Repositories. Does that make it easy to find data for a shear test on limestone powder?
Accessible: Open access?
Interoperable: Uses common, open formats for data
- Experimental apparatus often store data in proprietary binary formats - not always easy to convert to different formats
- Some simulation tools are commercial with proprietary formats
Reusable: Document the data carefully, Licence appropriately.

So now we have ‘AIR’ Data

Photo by Martin Adams on Unsplash

GrainDB

Adding the ‘F’ in FAIR

A well-executed Research Data Management Plan enables data to be Accessible, Interoperable and Reusable.

Making datasets (for specific materials or test equipment) Findable is a challenge:

Datasets may be spread across lots of different repositories which can make checking time consuming and difficult
Repository keywords are limited in number and the type of information they can store
Typically repositories do not have test or material specific metadata
- Need to browse each individual README and hope that it includes sufficient detail
How to store metadata?
What metadata to store?

What is GrainDB

We can consider a dataset as two separate components:

raw data: the actual recorded measurement/observation. (Typically time series data)
metadata: the description of all aspects of the experiment

GrainDB is an Experimental Measurements Database that collates and stores the relevant metadata for the material and test equipment that is missing from repositories, alongside a link to the full dataset which has been stored in an appropriate manner.

This becomes a searchable database of all available data, enabling the end-user to find detailed experimental datasets that match their needs
Enhancing dissemination
Provide high-quality machine-readable datasets suitable for use in AI/ML
Interface guides people on what information should be stored in accordance with best experimental practice

Real-world Challenge

(Bulk Powder Flow Characterisation Techniques, 2019, https://doi.org/10.1039/9781788016100-00064 )

Variety of test types in use to measure material properties

How to record details relating to all of these tests?
And others…

What is an Experiment?

Experiment or Measurement?

In a measurement, one performs parameter inference

Estimate quantities from observations

In an experiment, one performs hypothesis tests or model selection

Determine the best model for explaining observations

How to Record?

Databases provide efficient and safe multi-user access to data

Data records can be made “immutable“
Better handling of data redundancy and duplicate avoidance
Reduced curation requirements
Databases require a well-defined metadata structure
This provides a mechanism to make links between a material, a piece of equipment and a measurement

Accessible via a website: www.graindb.org

Metadata Structure

Need to be prescriptive

Define carefully what information is required for all aspects

Avoid confusion about units, types, dimensions, etc.
What is minimum required data and what is optional data?
Enforce minimum quality of reported data

Currently no Ontology or common descriptive language for defining granular materials and their measurement

some examples in chemistry, physics

Entity Relationships

Material Description

Equipment Description

Capturing all Equipment?

Experiment Description

Result

Prototype Database

Database

Web interface to Relational Database

Individual page (record) for each material, equipment and experiment
- Provides permanent citeable description for material/equipment/experiment
Searchable
Downloadable metadata records
REST API

Database Implementation

Database backend utilises PostgreSQL

Why?

PostgreSQL (NoSQL) databases allow advanced data types to be stored in hybrid tables
This is flexible and allows all the data for a specific category to be stored as one entry
Very good for handling complex data
Each category is defined by a schema
Provision of schemas will allow future implementations of POST API

Database Schemas

All schemas are versioned so we can:
- Update schemas over time adding new metadata
- Track what schema version of the database an entry was made or edited with
Schemas allow greater use of templating for dynamic page generation
- All additional information to create the page can be found in the schema
- Easier maintenance
- Slightly less control over aesthetics
Available at: git.ecdf.ed.ac.uk/jmorrise/tusail-experimental-database-schemas

Example Schema - Linear Shear Test

{
  "$id": "equipment/linear_shear_jenike.v1-0-0.json",
  "$schema": "https://raw.githubusercontent.com/Vidminas/python-jsonschema-minmax/main/metaschema/minmax-metaschema.json",

  "type": "object",
  "properties": {
    "test_category": { "const": "Linear Shear" },
    "test_subcategory": { "const": "Jenike" },
    "class": { "label": "Test Regime", "const": "Quasi-static" },
    "rating": { "label": "Repeatability Rating", "const": 4 },
    "geometric_properties": { "$ref": "#/$defs/geometric_properties" },
    "measurement_parameters": { "$ref": "#/$defs/measurement_parameters" }
  },
  "required": ["geometric_properties", "measurement_parameters"],

  "$defs": {
    "geometric_properties": { "label": "Geometric Properties",
      "type": "object",
      "properties": {
        "cell_diameter": { "label": "Cell Diameter", "type": "number", "minimum": 0, "units": ["mm"] },
        "base_height": { "label": "Base Height", "type": "number", "minimum": 0, "units": [ "mm" ] },
        "ring_height": { "label": "Ring Height", "type": "number", "minimum": 0, "units": [ "mm" ] },
        "wall_thickness": { "label": "Wall Thickness", "type": "number", "minimum": 0, "units": [ "mm" ] },
        "max_translation_speed": { "label": "Max. Translation Speed", "type": "number", "minimum": 0, "units": [ "mm/s" ] },
        "cell_material": { "label": "Cell Material", "type": "string" },
        "stress_application": { "label": "Vertical Stress Application Method", "enum": [ "Constant Mass", "Servo-Controlled" ] }
      },
      "additionalProperties": false
    },
    "measurement_parameters": { "label": "Measurement Properties",
      "type": "object",
      "properties": {
        "force_accuracy": { "label": "Force Accuracy", "type": "number", "minimum": 0, "units": [ "Pa" ] },
        "force_resolution": { "label": "Force Resolution", "type": "number", "minimum": 0, "units": [ "Pa" ] },
        "displacement_accuracy": { "label": "Displacement Accuracy", "type": "number", "minimum": 0, "units": [ "mm" ] },
        "displacement_resolution": { "label": "Displacement Resolution", "type": "number", "minimum": 0, "units": [ "mm" ] }
      },
      "additionalProperties": false
    }
  }
}

Database API

RESTful API implemented using Swagger and OpenAPI Specification

An interface that two computer systems use to exchange information securely over the internet
Nicely documented, interactive API documentation
www.graindb.org/api/ui

Database Prototype Demo

Concluding Remarks

We haven’t always done a very good job of preserving our datasets for future
- but we have the tools to do so
We need to be able to find all datasets easily
- A searchable database will make this a much easier task
A well defined set of metadata schemas will make recording experimental measurements for lots of different pieces of equipment more straightforward
Having detailed stored records with unique ids (urls) may make dissemination easier as well
could easily include in a publication much like a doi

Thank You!

Any Questions?

Email: J.Morrissey@ed.ac.uk