Data formats
Research data can take many different forms: text, numerical data, models, software, multimedia and data specific to a discipline or characteristic for the instrument used to measure them.
A data format or file format is the form in which information is coded. Information is coded in such a way that a programme or application can recognize, read and use the data.
There are three ways to lose data:
- Lose the bits
the carrier, bit rot - Lose the documentation
versions, metadata - Lose the access
operating system, hardware, application
The history of digital storage1 gives a good overview of the transience of information carriers. When software/hardware becomes obsolete, data can become unreadable. In order to prevent this, it is important to select a data format that is independent of any particular software producer. In other words: no proprietary software. Standard formats can be read both by people and by computers. The formats that best ensure longevity are standard formats, exchangeable formats, open formats.
Data archives often have a list with preferential formats for the submission of data. This is the list of preferential formats used by DANS2.
3TU.Datacentrum does not restrict the data formats allowed to be used, but encourages open and standard formats. On the left you can see a list of formats included in 3TU.Datacentrum at the moment.
When learning about data formats, you do not need to know all technical ins and outs. You need a sense of the possible complexities of the different data formats, and when talking to a researcher you need some general knowledge of the best data formats for the submission of datasets. If it gets too complicated, you can always refer to the expert in this field. You can at least explain that long-term, permanent storage requires certain data formats. If researchers still want to submit their data in a different format that is possible, but long-term accessibility of the data cannot be guaranteed.
Data formats are described by their MIME types. A MIME type consists of two expressions divided by a slash (MIME type/subtype). For example: text/plain is the MIME type for ordinary text. MIME is short for Multipart (Multipurpose) Internet Mail Extension. MIME tells a web browser how to deal with a certain file.
Perhaps you usually recognise data formats by their extensions. They are the three or four letters following a file name. A video on your computer could, for example, have the extension .avi. The associated MIME type is video/msvideo. If the .avi-video is placed on a website, the URL does not have to end in .avi. However, the MIME type can always be found on the website. It is forwarded “in the background” and can be read by computers.
There are three formats for text in the list: plain, html and xml. Plain text is text without any formatting. HTML (HyperText Markup Language) is a format that includes information about how the information will look in a website. In the code you can, for example, indicate whether text should be bold or italic. These options are not available in plain text. In XML (eXtensible Markup Language) you do not describe the formatting, but you give information about the contents of the file, for example, by adding metadata.
When exchanging information between programmes, formatting can often be lost or damaged. In order to prevent this, there are applications that ensure a universal representation of a document. Take PDF (Portable Document Format). That is an open and universal file format for the electronic exchange of document which preserves the formatting. Convenient for sender and receiver (Application/pdf). Strictly speaking, applications are data formats that can be read by a particular application or programme.
The preferential formats for audiovisual material are .mpeg and .msvideo. Mpeg is a compression standard for video and audio invented by the Moving Picture Experts Group. Msvideo stands for Microsoft Video. That does not seem to be an open format (after all, it includes the name Microsoft), but it is, because it is widely used and established.
For images 3TU.Datacentrum prefers the format .png. PNG stands for Portable Network Graphics. It is a file format that allows for compression without loss.
Data may be written in many different languages.
The more universal the language, the easier the conversation.
Application means that the file is associated with a specific application or programme. For example:
- Application/vnd.google-earth.kml +xml. The geographic data are coded in such a way that they can be read in a so-called Earth Browser, like Google Earth, Google Maps, and Google Maps on your smartphone.
- Application/gml+xml indicates Geographic Markup Language: a standard way to describe geographic information in plain text and independent of any form of visualisation of the data (as opposed to the data coding for an Earth Browser, where the specific aim is to visualise the data).
- Application/x-java-archive describes a dataset related to the Java programming language.
- Application/octet-stream indicates a general type of binary data which is not defined any further. It is a rest category for any unclear datasets.
Numerical data
HDF5 (Application/x-hdf5) and NetCDF (Application/x-netcdf) are both data formats which are often used for storing large volumes of numerical data (data in the form of numbers). A file format containing numerical data is also called a binary file: a file with information consisting of ones and zeros. The words ‘binary digits’ are compressed to ‘bits’. The way in which the ones and zeros are combined represents the information. The information can be anything that can be described digitally, like sound waves, high resolution MRI-scans, etc.
In a simple table units are presented in two dimensions. However, in many datasets three, four, five or even more dimensions are used, in a multidimensional array3. You can imagine that the file size increases exponentially with the number of dimensions represented. These data formats allow metadata to be added to a complete dataset, but also to the variables and dimensions in the dataset. Within both HDF5 and NetCDF standard definitions are used for the units placed on the axes.
Application/x-matlab-data is another example of numerical data: Matlab is an advanced programme for scientific calculations.
Application/zip and Application/x-gzip are discussed in module 3.
1. Vasilev, M. (2011). The history of digital storage. Retrieved 8-12-2011 from http://mashable.com/2011/10/08/digital-storage-infographic/
2. DANS. (2011). Overzicht van de Preferred Formats bij DANS. Retrieved 8-12-2011 from
http://www.dans.knaw.nl/sites/default/files/file/archief/Preferred%20formats%20NL.pdf
3. Folk, M., Kozial, Q. (1996). HDF - The next generation.Retrieved 8-12-2011 from
http://access.ncsa.illinois.edu/Archive/backissues/96.1/hdf-tng.html
