The steadily growing importance of research data management and the technologisation of almost all areas of the humanities and especially the natural sciences is leading to ever-increasing volumes of research data. Even if this is a very welcome development in terms of the reusability of existing research results, among other things, it also poses technical and organisational challenges for research data managers and researchers. These include an increase in the resources required in all phases of the research data lifecycle, both in terms of technology and personnel; legal and ethical challenges can also arise or simply a loss of overview.
Against this background, the Working Group for Research Data Management in Baden-Württemberg (AK FDM), in collaboration with the state initiative for research data management bwFDM, has published a guideline that provides practical approaches on how to reduce the necessary technical and organisational resources through digital data economy.
Challenges in digital data economy
Decision-making authority
It must be clear who has the authority to decide on the data’s use. Formally, this authority usually lies with the head of a research project, but the academic reality is less clear. The actual work and therefore also the knowledge about the significance of the data often lies with other academics. Furthermore, copyrights, rights of subsequent use and possible ownership rights to data must be taken into account. It is therefore advisable to pursue a joint decision at an early stage, documented in writing and with a clear allocation of roles.
Finality
Deleted data is usually irretrievably lost. Even though in special cases there are technical options for recovering deleted data, these are time-consuming, without guaranteed success and hardly practicable, especially when it comes to large amounts of data. This irreversibility requires a well-considered and conscious decision to delete data. Such decisions should by no means be made lightly or under time pressure.
Data formats
These differ in terms of their openness, long-term accessibility and space requirements. Compressing data helps to reduce storage requirements, but it must always be ensured that data can be decompressed again in the future. In addition, processes with a high compression rate for images or videos are almost always associated with a loss of quality. It is thus important to use formats that work either without loss or with an acceptable loss of information.
Technical obstacles
Storage systems are rarely completely transparent for their users. With cloud systems, for example, it is generally not possible to determine whether one or more internal copies of the data are stored on distributed servers as well as whether and for how long deleted files, for example, are still stored in backup systems. This is particularly problematic when sensitive data is promised to be deleted. Here, the operators of services are required to create appropriate transparency or to offer technical support with services such as a garbage collection. Knowing about storage redundancy allows researchers to make informed decisions about whether or not to store copies of their data elsewhere.
Feasibility
Research is very individual and so is the management of research data. This also refers to the possibility of cutting down on data or on deleting it early on in the course of research. Data that is generated during various steps of the research process differs greatly in terms of the effort required to reproduce it. This results in different options for cutting down on data or on deleting it early on in the research process. For example, aggregated data sets or text-analytical data can be reproduced quickly and easily and therefore deleted at an early stage or not even stored separately. On the other hand, raw data and data from time-intensive simulations, for example, offer hardly any cut-related potential due to their complex and timeconsuming reproducibility. For this reason, there can only be rudimentary general regulations on data economy; in all cases, case-by-case considerations with individual solutions are advisable. It makes sense to refer to the recommendations of, for example, the National Research Data Infrastructure (Nationale Forschungsdateninfrastruktur; NFDI) for the subject and to consult with subject-specifically qualified data stewards.
Awareness
Digital data is often not visible. In contrast to file folders on an office shelf, it is a visually intangible quantity that is stored on a computer, on remote IT infrastructures or in a cloud. It makes sense to create automated reminder routines wherever possible, for example to remind users when retention periods have expired and to submit data records for a new decision on their retention or deletion.
Principles
- How many copies of the data sets are required? What securing and backup functions do the storage systems used have? An automatic backup of a cloud service can, for example, make a separate copy on another storage system unnecessary.
- If data from other systems is obtained from external sources for your own research, simple redundancy is sufficient for local storage.
- Does it make sense to publish data on different systems, for example on an internationally recognized publication server and on your own university‘s local repository? This may be necessary as a backup if the external publication system’s sustainability is not guaranteed or if you are worried about the data potentially disappearing behind a paywall in the future.
- What resolution is required for the results of data-generating processes, can compression algorithms be useful? A lower resolution significantly reduces whatever storage may be required.
- How should older data that has become redundant or obsolete due to new data or versions be handled?
- Is it necessary to save the complete data set at every intermediate stage of data processing to ensure traceability? If the steps taken can be documented and reapplied to the original data set, interim data does not have to be saved.
- When working with subsets of data - e.g. for calculations or simulations - it must be considered whether each subset must be saved individually or whether the subset can be generated again from the data with little effort.
- When sharing data with project participants, it must be determined whether they always receive a complete copy or whether data is shared in a central location.
- How should unusable data be handled? If simulations or data processing yield results that cannot be used for further research or if measurements are invalid, a decision must be made as to whether these data should be deleted directly or retained for documentation purposes.
Practical examples
Natural sciences
In the fictitious Mayer working group for environmental analysis, mass spectrometry data is collected daily from environmental samples, often in the double-digit gigabyte range. This includes data from method development, test measurements and the actual measurements, which are ultimately intended for the publication of scientific papers. To ensure that the limited storage resources are not overused, all data is initially stored in a directory organized by project. Test measurements and other data not intended for archiving (temporary files) are stored in a separate folder, which is deleted regularly. All measurement data is accompanied by a description of the measurement method used (e.g. method file of the instrument software, .txt file) so that it can be reproduced and replicated. A .txt file is also attached to the measurement data, which provides the context of the measurement as well as further information (e.g. sampling, sample preparation, measuring individuals, etc.). In addition, all of Prof. Mayer's employees are required to immediately delete measurement data that has become redundant or obsolete due to new measurements or findings. At monthly intervals, Ms. Mayer deletes the folder for temporary files and conscientiously asks all employees to check their data with regard to data quality, reproducibility, traceability and whether it could be of use to other researchers in the future, and to delete it if necessary. In this way, Ms. Mayer always maintains an overview of her working group's research data and guarantees that other people can easily find and reuse this data.
Humanities
In a social science research project, thirty guided interviews are conducted with experts from community foundations. The aim of the study is a text-based evaluation of the survey results. The interviews are conducted online via video conferences and recorded as video files. For data protection reasons, the interviewees are assured that the recording will be deleted after it has been transcribed. Once all interviews have been completed, they are transcribed and saved in text files. These serve as the basis for further analysis. The video recordings of the interviews can be deleted after transcription has been completed.
Biodiversity
In a research project on the biodiversity of birds, audio data is collected via recordings in a forest over a longer period of time with the aim of monitoring the number of birds present. These data are irretrievable, i.e. if parts are lost, they cannot be recovered. The data is then analysed and further processed by several doctoral students to answer various scientific questions. Initially, the data is available in an uncompressed audio format (WAV file). After processing, this raw data must still be available as a reference, with a total of more than 100 TByte, but it is no longer necessary to access this data directly.
Initially, the data was stored on a large iSCSI storage system, which was provided as a network drive via a Linux virtual machine. This iSCSI system has built-in RAID redundancy but no other backup mechanisms. The data was therefore completely copied into another storage system. This already had built-in security mechanisms, which meant that the absolute amount of data increased many times over as a result of this measure. No versioning etc. was taken into consideration, primarily the rsync tool was used for the data transfer.
After having been consulted on data management, the project management decided to switch to a new storage solution. From now on, all data that is in acute use will be stored on a secure central storage system. All data that is no longer in acute use and should only be stored as a reference is moved to an object storage system with higher redundancy and simultaneously higher storage efficiency. This step ultimately led to a significant reduction in the gross data volume. In further steps, lossless audio compression could be considered.