Massive simulation data dissemination through Galactica
Massively parallel astrophysical codes used in various fields of research often produce massive raw data in the GBytes-PBytes range. Such datasets are valuable since they contain all the information on the underlying numerical models, but they also have many disadvantages :
- high data volumes, difficult to process by non HPC experts and very difficult to transfer over the network.
- custom data format: each simulation code data format is more often than not specific to each code. The lack of data standardisation is a major obstacle to data dissemination and reuse. In most cases, the use of dedicated data-processing library is mandatory.
- custom file format: some legacy codes still use custom binary file format instead of standard files like NetCDF, HDF5, ROOT, etc. Dedicated data readers are also necessary to extract scientific information by non experts in this particular case.
The Galactica database do not store these massive simulation datasets on its own filesystem. In the vast majority of use cases, third-party research groups who wish to analyze simulation data do not require low-level raw simulation data directly. They would rather obtain high-level data products they can analyze more easily (preferentially in a standard data/file format). The Galactica web application offers the possibility to provide access to this kind data product generators working on raw simulation data by means of WebServices deployed on the platform.
The Terminus ecosystem
As a backend to the Galactica application WebServices, the database communicates with an ecosystem of data-processing servers, hosted by universities, research institutes or supercomputing facilities. These servers, or Terminus nodes, implement data-processing services and host the massive astrophysical simulation datasets, hence eliminating the need to move them around.
Each server is connected to the Galactica database and wait for incoming data-processing job requests posted on the web application by authenticated users. Once completed, the data products generated on the Terminus servers are uploaded back on the Galactica database filesystem to be downloaded by members of the scientific community.
This distributed data-processing architecture offers good scalability at pretty low cost while guaranteeing data sovereignty. In addition, each project can apply its own data dissemination policy, in line with its own Data Management Plan (DMP).
How to deploy a Terminus server ?
If you need to share massive simulation datasets on Galactica, you need to deploy the galactica-terminus python package on a server with storage and computational resources to share. For further details, see:
What kind of data-processing service ?
Once configured, you are free to install on this Terminus node any data-processing service you want, using any type of library handling the custom format of datasets produced by any type of simulation code. The great versatility of Terminus is a powerful tool for massive simulation data dissemination in all kind of research fields. Typical data-processing services are:
- 1D profile sampler,
- 2D projection generators,
- 2D slice generator,
- 3D datacube extractor,
- spectrum generators,
- mock observation synthesizers.
but you are free to implement any type of data treatment you wish to make available to the scientific community. We encourage the generation of data products written in standard formats (VOTables, FITS files, PNG images, HDF5 files, CSV files, etc).
Need support ?
If you need support to help you share massive simulation datasets on Galactica with Terminus, please contact the Terminus support team.