A pythonic & participatory metadata workflow

This recipe presents a participatory, integrated and standardised approach to metadata management. Each data file on a file system will be accompagnied by a minimal metadata file. Crawler scripts will pick up these metadata files and publish them as iso19139 (or alternative models) on a searchable catalogue. iso19139 is the metadata model currently mandated by INSPIRE and very common in the GeoSpatial domain. Other communities tend to use different standards, such as STAC (Earth Observation), DCAT (Open Data), DataCite (Academia), which can benefit from the metadata workflow etc.

The recipe introduces you to the participatory metadata workflow step by step.

Initial

The inital step assumes a folder of data files on a network drive, sharepoint or git repository. Datasets stored on a database will not be considered for now, but can follow a similar workflow.

For each data file in the folder we will create a metadata control file (MCF). MCF is a metadata format from the pygeometa community. It is a YAML encoded subset of thr iso19139:2007 model. YAML is easy to read by humans and an optimal for content versioning (in git). The pygeometa library is able to export the metadata control file into various common metadata formats.

Consider to set up a virtual environment for the workshop:

virtualenv pygeometa && cd pygeometa && . bin/activate

Then install the pygeometa library.

pip install pygeometa

Create an MCF

A minimal example of MCF is (see also a more extended version):

mcf:
    version: 1.0

metadata:
    identifier: 3f342f64-9348-11df-ba6a-0014c2c00eab
    language: en
    hierarchylevel: dataset
    datestamp: 2023-01-01

spatial:
    datatype: grid

identification:
    language: eng
    title: Soilgrids sample Dataset
    abstract: This is a sample dataset for the EJP Soil Dataset Assimilation Masterclass
    dates:
        creation: 2023-01-01
    keywords:
        default:
            keywords: ["sample"]
    topiccategory:
        - geoscientificInformation
    extents:
        spatial:
            - bbox: [2,50,4,52]
              crs: 4326
    fees: None
    accessconstraints: otherRestrictions
    rights: CC-BY

contact:
    pointOfContact: 
        organization: ISRIC - World Soil Information
        url: https://www.isric.org
        city: Wageningen
        country: The Netherlands
        email: info@isric.org

distribution:
    wms:
        url: https://maps.isric.org
        type: OGC:WMS
        rel: service
        name: soilgrids

Copy the content above into a new text file and save it with the same name as the dataset, but with an extension .yml. Use an advanced text editor such as notepad++ or Visual studio code to benefit from yaml syntax highlighting and yaml validation. Notice that parsing in for example pygeometa fails if yaml is incorrectly formatted.

MDME

Model Driven Metadata Editor (MDME) is an online visual editor for mcf files. You can create or load an existing mcf file, populate relevant fields and save it locally.

Generate iso19139:2007

As soon as you have a folder of MCF’s, you can use pygeometa generate to convert them to iso19139:2007.

pygeometa metadata generate path/to/file.yml --schema=iso19139 --output=some_file.xml

Or for a folder of files, save the content below to a .sh (mac/linux) or .bat (windows) file and run it:

FILES="/path/to/*.yml"
for f in $FILES
do
  echo "Processing $f file..."
  pygeometa metadata generate $f --schema=iso19139 --output=$f.xml
done

Notice that you can also create your own schema for the iso19139 generation. By using a customised template you’re able to include additional properties for example to have better INSPIRE complience.

pygeometa metadata generate path/to/file.yml --schema_local=/path/to/my-schema --output=some_file.xml

Import existing metadata

If a data file already has a metadata document (for example with a shapefile, if it contains a file with extension .shp.xml), you can try to convert it to MCF using pygeometa. pygeometa requires to indicate the metadata schema in advance.

For iso19139:2007 use:

pygeometa metadata import path/to/file.xml --schema=iso19139

For fgdc (typically used with shapefiles) use:

pygeometa metadata import path/to/file.xml --schema=fgdc

Import generated metadata to a searchable catalogue

pycsw is a python based OGC reference implementation of Catalog Service for the Web and an early adaptor of OGC API Records and STAC Catalog. We’ll use pycsw via a docker image to publish the metadata records in a search service. We run it in detach mode so we can interact with the running container, type docker stop pycsw to stop the container.

docker run -d --rm --name pycsw -p 8000:8000 geopython/pycsw

We now have a running pycsw at http://localhost:8000/collections with some sample data. We will now remove the sample data and insert our metadata. For that reason we mount our current folder with xml files into the container

docker stop pycsw
docker run -d --rm --name pycsw -v ${PWD}:/metadata -p 8000:8000 geopython/pycsw

Now we can trigger pycsw admin to remove the default records and import our metadata. As part of the calls we reference the config file, which contains the connection details to the database.

docker exec -ti pycsw pycsw-admin.py delete-records -c /etc/pycsw/pycsw.cfg
docker exec -ti pycsw pycsw-admin.py load-records -c /etc/pycsw/pycsw.cfg -p /metadata -r

Check out the new content at http://localhost:8000/collections. Note that if you restart the container, all records are removed, because the database is currently not persisted on a volume.

Try to mount also a customised configuration file into the container, so you can optimise the configuration of the catalogue. Also have a look at the INSPIRE extension for pycsw.

pyGeoDataCrawler

pyGeoDataCrawler is a tool under active development which builds on top of the mechanism described above, internally using the libraries mentioned. It crawls a folder structure for metadata and data files, if no metadata file exists for a dataset, it will create one based on derived metadata from the dataset. Discovered metadata files are exported to iso19139 ready to be imported into a catalogue like pycsw or GeoNetwork.

Automated workflows

The tasks above are carried out manually. However they can also be set up to run automated at file changes or at regular intervals using cron jobs. With such an approach you can automatically update the catalogue content when for example some metadata records are updated or added to the GIT repository (CI-CD). To facilitate the participatory approach, consider to include a link from the dataset page in the catalogue back to the git source, to invite users to suggest optimizations to the metadata records (as GIT issue or Pull Request).

Evaluate Metadata and Discovery Service

You can evaluate individual iso19139 records in the INSPIRE reference validator. The validator can also evaluate the discovery service itself. If a service is running on localhost, use the tunnel approach to evaluate it. GeoHealthCheck also includes a probe for testing availability of a CSW service.

Access the service from QGIS

QGIS contains a default plugin called MetaSearch which enables catalogue searches from within QGIS. You can find the plugin in the web menu or on the toolbar as a set of binoculars. Open the plugin. First you need to set up a new service connection. On the services tab, click new, choose a name and add the url http://localhost:8080/csw. Click the serviceinfo button to view the metadata of the service. Now return to the Search tab and perform a search. Notice that if you select a search result, it highlights on the map and may trigger the Add data button in the footer (this depends on if QGIS recognises the protocol mentioned in the metadata).

Read more

At masterclass edition 2023 Tom Kralidis presented the geopython ecosystem, including pycsw.