Envisioning and proposing Data Mesh for Research Data Management in the Engineering Sciences

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Authors

Mario Moser, Tobias Hamann, Anas Abdelrazeq, Robert Schmitt

Abstract

In Research Data Management (RDM), data publishing infrastructures play a crucial role for efficient data provisioning and reusage. Data repositories (generic or discipline-specific) serve for this. Nevertheless, they focus rather on technical aspects without including sociological elements; they struggle to cover the heterogeneous nature of research data (formats, sources); and they are typically centralised, leading to increased complexity in operation and maintenance. In industrial data management, the Data Mesh concept as a decentralised and socio-technical approach has been introduced. Data is handled as products for increased usability, ownership is shifted to the respective domains experts, and a federated governance achieves standardisation while allowing discipline-specific decisions. Based on literature review, the distributed characteristics and further requirements of (engineering) research are mapped with the Data Mesh concept. In this envisioning, Data Mesh and its design principles overall appear appropriate as research data publishing infrastructure. A high level architecture is presented leveraging existing RDM components. Although, as differences in details become apparent, items for further adaptions of Data Mesh for RDM are pointed out.

Comments

Invited Review Comment #220 Anonymous @ 2025-10-07 17:32

The authors propose data mesh as a central element of RDM that is to adapted from industrial applications to the engineering domain. The idea is based on a decentralized structure where the domain expert is responsbile for structuring, publishing and maintaining the data set, integrated into a central platform that enables FAIR access to these objects.

The challenge of the paper is that it tries to discuss the concept on a very abstract level, primarily focusing on summarizing aspects from other papers in the concept. It is not fully clear, what the new contribution of the paper actually is, and the chapter 5 is very difficult to distinguish from the literatur review, even repeating many of the points already discussed there (e.g. the four principals are included in both).

In addition, the concept of ROCrates (https://www.researchobject.org/ro-crate/) as a way to bundle decentralized RO objects which allows to either directly add files into these ROCrates, or link other ressources such as git, time series databases, ... is not discussed at all. ROHub (https://www.rohub.org/) as a central platform that allows to handle these research objects, already integrated as a service in the EOSC, provides a python API, a webfrontend and a SPARQL endpoint together with a significant amount of storage ressources to handle these ROCrates. IMHO, the paper proposes on a very general concept level a platform and decentralized approach that is already well esablished and implemented.

As such, it is IMO important to highlight the new information the paper provides.

In addition, there are some minor formal comments.

· References at the end of a sentence should be before the period. This is currently inconsistent, sometimes “sentence [1].” Sometimes sentence.[1]

· Sometimes the article (the, a) is missing, e.g. lines 63, 144, 157, ..

· Check the sentences, sometimes comma is missing, an s, or a to e.g. lines 19, 280, 335, 429, 432

· I’m also not sure if a detailed description of the German scientific system is really worth to be explained in such a level of detail, rather than stating the general challenge (e.g. IMO it is not required to have a detailed figure on the different engineering domains in the DFG, or information about the WissZeitVG. I have difficulties understanding the link between these detailed descriptions and the subsequent proposed data mesh, IMO the message is rather general and could be summarized in a much shorter paragraph)

· The description of data in engineering science is also very general but at the same time very long, the discussion of the general principles (FAIR, fourth paradigma, data life cycle, ..) is not exceeding the current Sota and in the presented level repeats/cites many statements from other papers, thus IMO these general statements should be condensed and particularly focused only were needed to support the message of the paper

· The question of interoperability is not clearly discussed in the paper. On the top level with very general metadata (authors, license, keywords, title, ..), a general metadata standard can be established and used for the complete platform (as e.g. defined with ROCrates). However, on the level of the actual scientific metadata, (e.g. what was the material composition, what were the test conditions/parameters, ..), the interoperability is not so straightforward anymore. This is discussed in lines 419ff, but it is not completely clear how that is supposed to be handled in practice, since in particular even a single engineering domain might have completely different specialization areas (as you have highlighted also in the introduction). For a practical implementation, it would be relevant to discuss the proposed structure of the scientific metadata in more detail.

· I’m not fully sure why the concept of maintenance is an open problem (lines 430ff). IMO as soon as data is generated and archived (with the end of a project) maintenance is only needed in very few cases. If the data is restructured (e.g. in a new project), this would then just be a new dataset that is derived from the old one. As far as I understand the concept, this is rather a feature of a distributed data mesh, rather than a challenge.

· Lines 478ff. Why is it not possible to just link data (and that includes software) from different resources into a single RO?

· Why is the data mesh only related to public data (line 480ff)? IMO only the metadata has to be public, the actual data (or even the scientific metadata) can be placed under access control and in accordance with the IP of the data provider. I’m also not understanding the distinction between continuous data (such as weather data) and project data (with a project end), because even data continuous in time can be broken down into individual sets that can be published and referenced in a research object or a RO can be created with a link to the actual data that is accessible via a REST API.

Invited Review Comment #217 Jane Wyngaard @ 2025-08-28 22:41

The whole paper needs an edit by someone who's 1st language English. There are many grammatical errors and many paragraphs of text saying the same thing as elsewhere in buzz word filled repetitive sentences.

Chapters 1-4 are particularly bad.

Ch1:
States that the paper concludes that RDM while brownfield could and should use an industrial data mesh concept but fails to clearly articulate what that is.

Ch2:

There's a good section talking through the past with data warehouses and lakes and ecosystems in real terms, but then the authors return to buzz words and fail to make clear the distinction between for instance data ecosystems and data meshes.

There's an interesting idea being explored but the context and concept are not described clearly. How exactly a data mesh approach would differ from others and even how it would practically bring value is not made clear. All the discussed value and outcomes sound great but it's not at all clear how it would be possible to achieve. While it's appropriate for this section to not yet be discussing very low details (which finally come in Ch5) the chapter lacks a structure and wording that would indicate to the reader that these details are coming, it fails to even hint at the how at even a extremely high level which would give the reader a preview of where the paper is going and help a reader start thinking about this idea and these terms in the right frame of reference.

Ch4 is supposed to be methodology but doesn't describe a methodology and simply recaps again the previous sections of the paper

Ch5: Finally expands on the definition and how of the idea of a data mesh, putting it in terms that make the concept tractable

They emphasise the scope of gain is primarily to increase findability and allow the domain/source/host/expert closest to the data to define the schema, tooling, metadata, and even analysis and product generation rather than alternatives such as trying to enforce a global universal standard. The authors go into a helpful level of detail on how this would lead to all the gains originally promised in the introduction.

At various points in the paper reference is made to how the data mesh concept is not new and has been explored elsewhere but these other explorations are never described - that should be added to give this paper more depth and value if it is to become the current reference for data mesh application to research and specifically engineering research data management.

I would like to see this revised and published, it holds interesting ideas with potential value but needs to be restructured for that value to be realised.

Downloads

Download Preprint

Metadata

Published: 2025-04-28
Last Updated: 2025-04-28
License: Creative Commons Attribution 4.0
Subjects: Data Infrastructure
Keywords: Data Mesh, Research Data Management (RDM), Engineering Data, Engineering Sciences, Decentralised Data Architecture, Data Infrastructure, Data Publishing, Data Reuse

All Preprints