A pythonic & participatory metadata workflow
This recipe presents a participatory, integrated and standardised approach to metadata management. Each data file on a file system will be accompagnied by a minimal metadata file. Crawler scripts will pick up these metadata files and publish them as iso19139 (or alternative models) on a searchable catalogue. iso19139 is the metadata model currently mandated by INSPIRE and very common in the GeoSpatial domain. Other communities tend to use different standards, such as STAC (Earth Observation), DCAT (Open Data), DataCite (Academia), which can benefit from the metadata workflow etc.
The recipe introduces you to the participatory metadata workflow step by step.
Initial
The inital step assumes a folder of data files on a network drive, sharepoint or git repository. Datasets stored on a database will not be considered for now, but can follow a similar workflow.
For each data file in the folder we will create a metadata control file
(MCF). MCF is a metadata format from the pygeometa community. It is a YAML encoded subset of thr iso19139:2007 model. YAML is easy to read by humans and an optimal for content versioning (in git). The pygeometa library is able to export the metadata control file into various common metadata formats.
Consider to set up a virtual environment for the workshop:
virtualenv pygeometa && cd pygeometa && . bin/activate
Then install the pygeometa library.
pip install pygeometa
Create an MCF
A minimal example of MCF is (see also a more extended version):
mcf:
version: 1.0
metadata:
identifier: 3f342f64-9348-11df-ba6a-0014c2c00eab
language: en
hierarchylevel: dataset
datestamp: 2023-01-01
spatial:
datatype: grid
identification:
language: eng
title: Soilgrids sample Dataset
abstract: This is a sample dataset for the EJP Soil Dataset Assimilation Masterclass
dates:
creation: 2023-01-01
keywords:
default:
keywords: ["sample"]
topiccategory:
- geoscientificInformation
extents:
spatial:
- bbox: [2,50,4,52]
crs: 4326
fees: None
accessconstraints: otherRestrictions
rights: CC-BY
contact:
pointOfContact:
organization: ISRIC - World Soil Information
url: https://www.isric.org
city: Wageningen
country: The Netherlands
email: info@isric.org
distribution:
wms:
url: https://maps.isric.org
type: OGC:WMS
rel: service
name: soilgrids
Copy the content above into a new text file and save it with the same name as the dataset, but with an extension .yml
. Use an advanced text editor such as notepad++ or Visual studio code to benefit from yaml syntax highlighting and yaml validation. Notice that parsing in for example pygeometa fails if yaml is incorrectly formatted.
MDME
Model Driven Metadata Editor (MDME) is an online visual editor for mcf files. You can create or load an existing mcf file, populate relevant fields and save it locally.
Generate iso19139:2007
As soon as you have a folder of MCF’s, you can use pygeometa generate
to convert them to iso19139:2007.
pygeometa metadata generate path/to/file.yml --schema=iso19139 --output=some_file.xml
Or for a folder of files, save the content below to a .sh (mac/linux) or .bat (windows) file and run it:
FILES="/path/to/*.yml"
for f in $FILES
do
echo "Processing $f file..."
pygeometa metadata generate $f --schema=iso19139 --output=$f.xml
done
Notice that you can also create your own schema for the iso19139 generation. By using a customised template you’re able to include additional properties for example to have better INSPIRE complience.
pygeometa metadata generate path/to/file.yml --schema_local=/path/to/my-schema --output=some_file.xml
Import existing metadata
If a data file already has a metadata document (for example with a shapefile, if it contains a file with extension .shp.xml), you can try to convert it to MCF using pygeometa. pygeometa requires to indicate the metadata schema in advance.
For iso19139:2007 use:
pygeometa metadata import path/to/file.xml --schema=iso19139
For fgdc (typically used with shapefiles) use:
pygeometa metadata import path/to/file.xml --schema=fgdc
Import generated metadata to a searchable catalogue
pycsw is a python based OGC reference implementation of Catalog Service for the Web and an early adaptor of OGC API Records and STAC Catalog. We’ll use pycsw via a docker image to publish the metadata records in a search service. We run it in detach
mode so we can interact with the running container, type docker stop pycsw
to stop the container.
docker run -d --rm --name pycsw -p 8000:8000 geopython/pycsw
We now have a running pycsw at http://localhost:8000/collections with some sample data. We will now remove the sample data and insert our metadata. For that reason we mount
our current folder with xml files into the container
docker stop pycsw
docker run -d --rm --name pycsw -v ${PWD}:/metadata -p 8000:8000 geopython/pycsw
Now we can trigger pycsw admin to remove the default records and import our metadata. As part of the calls we reference the config file, which contains the connection details to the database.
docker exec -ti pycsw pycsw-admin.py delete-records -c /etc/pycsw/pycsw.cfg
docker exec -ti pycsw pycsw-admin.py load-records -c /etc/pycsw/pycsw.cfg -p /metadata -r
Check out the new content at http://localhost:8000/collections. Note that if you restart the container, all records are removed, because the database is currently not persisted on a volume.
Try to mount also a customised configuration file into the container, so you can optimise the configuration of the catalogue. Also have a look at the INSPIRE extension for pycsw.
pyGeoDataCrawler
pyGeoDataCrawler is a tool under active development which builds on top of the mechanism described above, internally using the libraries mentioned. It crawls a folder structure for metadata and data files, if no metadata file exists for a dataset, it will create one based on derived metadata from the dataset. Discovered metadata files are exported to iso19139 ready to be imported into a catalogue like pycsw or GeoNetwork.
Automated workflows
The tasks above are carried out manually. However they can also be set up to run automated at file changes or at regular intervals using cron jobs. With such an approach you can automatically update the catalogue content when for example some metadata records are updated or added to the GIT repository (CI-CD). To facilitate the participatory approach, consider to include a link from the dataset page in the catalogue back to the git source, to invite users to suggest optimizations to the metadata records (as GIT issue or Pull Request).
Evaluate Metadata and Discovery Service
You can evaluate individual iso19139 records in the INSPIRE reference validator. The validator can also evaluate the discovery service itself. If a service is running on localhost, use the tunnel approach to evaluate it. GeoHealthCheck also includes a probe for testing availability of a CSW service.
Access the service from QGIS
QGIS contains a default plugin called MetaSearch which enables catalogue searches from within QGIS. You can find the plugin in the web
menu or on the toolbar as a set of binoculars. Open the plugin. First you need to set up a new service connection. On the services tab, click new, choose a name and add the url http://localhost:8080/csw. Click the serviceinfo
button to view the metadata of the service. Now return to the Search
tab and perform a search. Notice that if you select a search result, it highlights on the map and may trigger the Add data
button in the footer (this depends on if QGIS recognises the protocol mentioned in the metadata).
Read more
At masterclass edition 2023 Tom Kralidis presented the geopython ecosystem, including pycsw.
- github the geopython community welcomes your questions and contributions.
- pygeometa
- pycsw
- pyGeoDataCrawler is a set of scripts to manage MCF’s. It supports importing MCF from a CSV, MCF inheritence, generate MCF from a data file, etc.
- Model Driven Metadata Editor A web based GUI for populating MCF’s.