GrainDB - Making Experimental Measurements FAIR

Challenges in Open Data

John P. Morrissey

School of Engineering, University of Edinburgh

Wednesday, 4nd of September, 2024

Who am I…

A little about me:

  • Research Fellow at The University of Edinburgh
  • Member of the Granular & Geomechanical Processes Research Group
  • Focus on Granular Materials and Particulate Mechanics
    • Experimental Characterisation
    • DEM Simulations
    • Data Analysis
  • Some supervision of undergraduate and postgraduate students


Granular Mechanics & Industrial Infrastructure Research Group

And why am I talking to you about FAIR Data?

We need high quality experimental measurements in research and industry

  • Material characterisation for parameter estimation
  • For calibrating simulations
  • For validation of simulations
  • To introduce into large AI/ML/Data-driven models

And what is FAIR data anyway?!

Reminder - Research Data

Photo by Kelly Sikkema on Unsplash

Reproducibility & Replicability Crisis in Science

For the past two decades there has been significant discussion on the reproducibility crisis in science

  • A pattern of scientists being unable to obtain the same results as previous researchers

Distinction between Reproducibility & Replicability

  • Reproducibility: Can you get the same answers as me when you analyze my data?

  • Replicability: Can you get the same answers when you do my experiment and collect your own data?

AI will not make this better!

Dependent on the quality of the data provided to it.

[Baker, 2015, Nature]

Research Data Life-cycle

Managing research data is an often forgotten aspect of research. Traditionally datasets were quite small:

  • Data was not widley shared beyond colleagues
  • Often only documented mentioned in publications

Significant changes in recent decades:

  • The amount of research data generated has grown
    • much larger datasets (e.g. imaging, DEM, etc.)
  • Sharing & collaboration encouraged (impact)

Ad-hoc management no longer sufficient

  • Poorly documented datasets are of limited use
  • Are they preserved beyond the current cloud storage subscription?

  • Create: Collect, Generate, Process, Clean

  • Use: Manipulate, Analyse, Visualise

  • Re-Use: By colleagues or starting point for new dataset

Research Data Life-cycle

Data management best practices involve the entire data lifecycle from project start to end

  • Plan: Data Management Plan in grant submissions.

  • Create: Experiment, Simulation, Survey, Merge, etc…

  • Document: Describe the data collected in detail. Sooner rather than later!

  • Use: Analyse/Discover/Collaborate. Document the process.

  • Preserve: Store for future use. Version Control. Databases & Archives.

  • Share: Essential for translation of results into knowledge. Open Data? Repositories?

  • Re-Use: Collaborate / Derive / Develop. Teach. Policy.

What is FAIR Data?

FAIR is an acronym for Findable, Accessible, Interoperable, Reusable, which are the principles which should apply to scientific data management and guardianship.

  • Findable: The first part of making data re-useable is to make the data findable. Detailed and accurate metadata is key

  • Accessible: Data could be openly available or it could require prior authentication and authorisation

  • Interoperable: Data needs to be able to be used in different programs or workflows

  • Reusable: Well defined data is essential as it makes it easier to understand and therefore use, combine and/or extend the dataset

“FAIR Principles”. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

Fair Guiding Principles

Guidance on how to assess the “fairness” of your data: https://bit.ly/yourFIP

FAIR Data - Machine Actionable

The FAIR principles also emphasise machine-actionability

  • The capacity of computational systems to find, access, interoperate, and reuse data with minimal or no human intervention
  • We are increasingly dependent on computers as datasets grow and we move towards automation (AI/Deep Learning/etc.)

Machine readable and digitally accessible are not the same thing!

  • For example, traditional word processing documents and PDF files are easily read by humans but typically are difficult for machines to interpret.

  • A machine readable format is a file in a standard computer language (not English text) that can be read automatically by a web browser or computer system. (e.g.; xml, json).

Reality

Photo by Vitaly Gariev on Unsplash

TUSAIL

TUSAIL (www.tusail.eu) is an Innovative Training Network funded by Horizon 2020

  • Training 15 ESRs over 4 years through a combination of PhD research, scientific training and industrial secondments
  • Comprising of 15 academic and industrial partners and led by the University of Edinburgh

Training in Upscaling particle Systems: Advancing Industry across Length-scales”

Where to Store Datasets - Accessible Data

Zenodo is a general-purpose open repository developed under the European OpenAIRE program and operated by CERN

  • Requirement in many EU funded projects to make data available here
  • Free
  • Citable - every dataset receives a DOI
  • Maximum 50GB per dataset
  • Not just limited to datasets (papers, posters, presentations, etc.)

TUSAIL Zenodo Community - Accessible Data

TUSAIL Community on Zenodo

  • Outputs from the TUSAIL project
  • Will accept deposits from the wider community who wish to share particle/granular data in a common forum
  • Start creating a centralised source for datasets

Dataset Preparation (A very brief overview)

Photo by RDNE Stock project

Enhancing Interoperability

  • Do not use proprietary formats for data or metadata files

    • Open formats where possible
  • Use machine readable formats

    • Ascii, csv, json, xml or (open) Scientific formats like HDF5 and netCDF
  • Structured data is preferred over unstructured

  • Avoid image only datasets if possible

    • difficult to extract information from images
  • Use sensible variables names

  • Avoid use of commas as decimal separators in numbers, these can easily be mistaken as csv files

    • Format numbers as 675454453.00 or 0.00007654

Metadata - The README - Interoperable and Re-useable

Need to include README file that describes the dataset

  • Suggestion: use markdown format (*.md)
  • Human readable
  • Easily rendered
  • Default across many repositories

Why? Assist other researchers to understand:

  • your dataset and its contents
  • provenance 
  • licensing 
  • how interact with it

Recommended Content:

  • Title for dataset
  • Investigator / Contact person
  • Collection Date
  • Method used
  • Deviations from standards
  • Description of dataset 
  • Filenames, file format, etc
  • License
  • Keywords

Licencing for Data - Reusable Data

Licencing can be complicated. Licensing considerations for datasets and for software/code may be very different.

  • Do I need a license?
  • Is there a specific requirement from your funding body or institution?
  • Attribution, Copyleft, Non-Commercial, No-Derivatives …
  • Problems with each type:
    • Attribution stacking is when a derivative work must acknowledge all contribution to each work from which it was derived. Can become an unwieldy mess.
    • Copyleft licenses prevent combining with data from different license types
    • Non-commercial can be ambiguous as to what is a commercial use

Achieving FAIR Data Management

Data management should adhere to FAIR principles. How do we achieve that?

  • Findable:

    • Data is available in Repositories. Does that make it easy to find data for a shear test on limestone powder?
  • Accessible: Open access?

  • Interoperable: Uses common, open formats for data

    • Experimental apparatus often store data in proprietary binary formats - not always easy to convert to different formats
    • Some simulation tools are commercial with proprietary formats
  • Reusable: Document the data carefully, Licence appropriately.

So now we have ‘AIR’ Data

Photo by Martin Adams on Unsplash

GrainDB

Adding the ‘F’ in FAIR

A well-executed Research Data Management Plan enables data to be Accessible, Interoperable and Reusable.

Making datasets (for specific materials or test equipment) Findable is a challenge:

  • Datasets may be spread across lots of different repositories which can make checking time consuming and difficult
  • Repository keywords are limited in number and the type of information they can store
  • Typically repositories do not have test or material specific metadata
    • Need to browse each individual README and hope that it includes sufficient detail
  • How to store metadata?
  • What metadata to store?

Photo by Jubbar J. on Unsplash

What is GrainDB

We can consider a dataset as two separate components:

  • raw data: the actual recorded measurement/observation. (Typically time series data)
  • metadata: the description of all aspects of the experiment

GrainDB is an Experimental Measurements Database that collates and stores the relevant metadata for the material and test equipment that is missing from repositories, alongside a link to the full dataset which has been stored in an appropriate manner.

  • This becomes a searchable database of all available data, enabling the end-user to find detailed experimental datasets that match their needs

  • Enhancing dissemination

  • Provide high-quality machine-readable datasets suitable for use in AI/ML

  • Interface guides people on what information should be stored in accordance with best experimental practice

Real-world Challenge

(Bulk Powder Flow Characterisation Techniques, 2019, https://doi.org/10.1039/9781788016100-00064 )

Variety of test types in use to measure material properties

  • How to record details relating to all of these tests?
  • And others…

What is an Experiment?

Experiment or Measurement?

In a measurement, one performs parameter inference

  • Estimate quantities from observations

In an experiment, one performs hypothesis tests or model selection

  • Determine the best model for explaining observations

How to Record?

Databases provide efficient and safe multi-user access to data

  • Data records can be made “immutable“
  • Better handling of data redundancy and duplicate avoidance
  • Reduced curation requirements
  • Databases require a well-defined metadata structure
  • This provides a mechanism to make links between a material, a piece of equipment and a measurement

Accessible via a website: www.graindb.org

Metadata Structure

Need to be prescriptive

Define carefully what information is required for all aspects

  • Avoid confusion about units, types, dimensions, etc.
  • What is minimum required data and what is optional data?
  • Enforce minimum quality of reported data

Currently no Ontology or common descriptive language for defining granular materials and their measurement

  • some examples in chemistry, physics

Entity Relationships

Material Description

Equipment Description

Capturing all Equipment?

Experiment Description

Result

Prototype Database

Database

Web interface to Relational Database

  • Individual page (record) for each material, equipment and experiment
    • Provides permanent citeable description for material/equipment/experiment
  • Searchable
  • Downloadable metadata records
  • REST API

Database Implementation

Database backend utilises PostgreSQL

Why?

  • PostgreSQL (NoSQL) databases allow advanced data types to be stored in hybrid tables
  • This is flexible and allows all the data for a specific category to be stored as one entry
  • Very good for handling complex data
  • Each category is defined by a schema
  • Provision of schemas will allow future implementations of POST API

Database Schemas

  • All schemas are versioned so we can:

    • Update schemas over time adding new metadata
    • Track what schema version of the database an entry was made or edited with
  • Schemas allow greater use of templating for dynamic page generation

    • All additional information to create the page can be found in the schema
    • Easier maintenance
    • Slightly less control over aesthetics
  • Available at: git.ecdf.ed.ac.uk/jmorrise/tusail-experimental-database-schemas

Example Schema - Linear Shear Test

{
  "$id": "equipment/linear_shear_jenike.v1-0-0.json",
  "$schema": "https://raw.githubusercontent.com/Vidminas/python-jsonschema-minmax/main/metaschema/minmax-metaschema.json",

  "type": "object",
  "properties": {
    "test_category": { "const": "Linear Shear" },
    "test_subcategory": { "const": "Jenike" },
    "class": { "label": "Test Regime", "const": "Quasi-static" },
    "rating": { "label": "Repeatability Rating", "const": 4 },
    "geometric_properties": { "$ref": "#/$defs/geometric_properties" },
    "measurement_parameters": { "$ref": "#/$defs/measurement_parameters" }
  },
  "required": ["geometric_properties", "measurement_parameters"],

  "$defs": {
    "geometric_properties": { "label": "Geometric Properties",
      "type": "object",
      "properties": {
        "cell_diameter": { "label": "Cell Diameter", "type": "number", "minimum": 0, "units": ["mm"] },
        "base_height": { "label": "Base Height", "type": "number", "minimum": 0, "units": [ "mm" ] },
        "ring_height": { "label": "Ring Height", "type": "number", "minimum": 0, "units": [ "mm" ] },
        "wall_thickness": { "label": "Wall Thickness", "type": "number", "minimum": 0, "units": [ "mm" ] },
        "max_translation_speed": { "label": "Max. Translation Speed", "type": "number", "minimum": 0, "units": [ "mm/s" ] },
        "cell_material": { "label": "Cell Material", "type": "string" },
        "stress_application": { "label": "Vertical Stress Application Method", "enum": [ "Constant Mass", "Servo-Controlled" ] }
      },
      "additionalProperties": false
    },
    "measurement_parameters": { "label": "Measurement Properties",
      "type": "object",
      "properties": {
        "force_accuracy": { "label": "Force Accuracy", "type": "number", "minimum": 0, "units": [ "Pa" ] },
        "force_resolution": { "label": "Force Resolution", "type": "number", "minimum": 0, "units": [ "Pa" ] },
        "displacement_accuracy": { "label": "Displacement Accuracy", "type": "number", "minimum": 0, "units": [ "mm" ] },
        "displacement_resolution": { "label": "Displacement Resolution", "type": "number", "minimum": 0, "units": [ "mm" ] }
      },
      "additionalProperties": false
    }
  }
}

Database API

RESTful API implemented using Swagger and OpenAPI Specification

  • An interface that two computer systems use to exchange information securely over the internet
  • Nicely documented, interactive API documentation
  • www.graindb.org/api/ui

Database Prototype Demo

Concluding Remarks

  • We haven’t always done a very good job of preserving our datasets for future

    • but we have the tools to do so
  • We need to be able to find all datasets easily

    • A searchable database will make this a much easier task
  • A well defined set of metadata schemas will make recording experimental measurements for lots of different pieces of equipment more straightforward

  • Having detailed stored records with unique ids (urls) may make dissemination easier as well

  • could easily include in a publication much like a doi

Thank You!

Any Questions?

Email: J.Morrissey@ed.ac.uk