Data Organisation


Paul van Genuchten


November 10, 2022

This article describes the data organisation step in the Soil Information Workflow and references various cookbook recipes related to this step in the workflow.

In this step a data curator prepares the organisation for incoming soil observations from field studies and laboratory work. Data from external organisations is prepared and made available to augment and/or validate local data. Relevant measures are put in place to authorise relevant staff, prevent data loss and data redundancy and assess privacy concerns. Concrete deliverables of this phase are: - data model materialised in a physical database - implemented procedures to prevent data loss - access grants to relevant staff - metadata document in which these aspects are described - procedures to evaluate quality of the data and infrastructure

Note that in this phase no effort is made to publish data to remote partners or the general public, but it does make sense to capture on the metadata if any access constraints apply which may prevent publication to a wider audience in a later stage.

Types of soil data

Within the soil domain (and wider earth sciences) typically 4 types of data are identified:

  1. ‘As-is’ data, legacy datasets in their original form or field and analytical data as it is received from field surveyors and laboratories, including reporting about the applied procedures.
  2. Soil analytical data and field descriptions of a) soil profiles and b) soil plots, standardised to a common model (iso28258). The data includes the lineage, the definition of the soil in space and time (xyzt), the attribute concerned, the procedure used and the unit of expression. Domain values are selected from common code lists.
  3. A dataset in which the soil parameter values are harmonised as if they were analysed using a single procedure. The metadata of individual observations can be left out, resulting in a abbreviated dataset, typically presented as multiple parameter values (texture fractions, pH, Ca content, Vegetation, Groundwater level) of a feature of interest (horizon, profile, plot defined by xyzt) in a flat table.
  4. Predicted distribution of soil properties or soil classes, either as grid cells through a statistical model or as vector polygons drawn with expert knowledge in the field.

Code lists

Another aspect to consider as part of data standardization is the adoption of common code lists that support interoperability of data. Datasets which use for example different texture classes are hard to compare. Within the soil community there are a number of common code lists available.

  • World Reference Base (WRB), available in digital form via Agrovoc.
  • FAO Guidelines of Soil Description (these practices are integrated in the latest version of WRB), digitised as part of the GLOSIS Web Ontology effort.
  • The GLOSIS Web Ontology includes code lists of common soil properties and analysis procedures. This list was originally collected in A compilation of georeferenced and standardised legacy soil profile data for Sub-Saharan Africa (Africa Soil Profiles Database; Leenaars J.G.B et al 2014). And later extended by the soil community.

In some cases it is relevant to define code lists at a country or even regional level. A regional code list repository supports practitioners in standardizing their data early in the data lifecycle. Extending a codelist is then a relevant recipe.

Metadata management

Metadata documents are preferably based on a common ontology such as DCAT, DataCite or iso19115. Roughly 2 approaches, with various implementation options, exist to maintain metadata within an organisation:

  • Metadata is included in the data file or is stored as a sidecar file next to the actual data file. At intervals a crawler imports metadata to make the data findable from a central location.
  • Metadata is captured in a separate system and linkage is maintained between the actual dataset and its metadata.

The second option is relatively easy to adopt, however it comes with a potential big effort to keep the metadata system in sync within the organisation. In practice organisations tend to implement a combination of both options.

This step includes the organisation of data arriving from external sources and the standardisation of that data with local data (transform to a common model). As soon as a user imports a remote dataset for use in a project, it is useful to include a metadata record about that dataset in the local repository, so colleagues understand where/when the data was imported. If the dataset is available online, it is better to link to the remote dataset, so users would always use the latest version.