Metadata and attributes#
Note at the moment this tutorial is only applicable to surface data and not to the other data types.
0. Using the tutorial object store#
To avoid adding the example data we use in this tutorial to your normal
object store, we need to tell OpenGHG to use a separate sandboxed object
store that we’ll call the tutorial store. To do this we use the
use_tutorial_store
function from openghg.tutorial
. This sets the
OPENGHG_TUT_STORE
environment variable for this session and won’t
affect your use of OpenGHG outside of this tutorial.
from openghg.tutorial import use_tutorial_store
use_tutorial_store()
1. Add example observation data#
For this tutorial, we can add some example Tacolnestion data to the tutorial object store.
from openghg.tutorial import retrieve_example_data
data_url = "https://github.com/openghg/example_data/raw/main/timeseries/tac_example.tar.gz"
tac_data = retrieve_example_data(url=data_url)
from openghg.standardise import standardise_surface
standardise_surface(filepaths=tac_data, source_format="CRDS", site="TAC", network="DECC")
We’ll also retrieve some dummy alternative data which we can use to demonstrate how to deal with mismatching metadata and attributes.
from openghg.tutorial import retrieve_example_data
data_url = "https://github.com/openghg/example_data/raw/main/timeseries/tac_dummy_attr_mismatch_example.tar.gz"
dummy_data = retrieve_example_data(url=data_url)
We will demonstrate how to add this to the object store below.
2. Metadata and attributes#
Within OpenGHG, metadata is used as a way to categorise a data source by applying a unique set of keys. This is used to distinguish between stored data and allows this to be searchable.
For CF-compliance and ease of use, we also provide a set of internal attributes stored alongside the data and within the netcdf file (as Dataset.attrs when using xarray for instance).
The metadata and attributes stored do not necessarily need to be the same for a given data source. Indeed, we may want different information stored as tags in the metadata and more extensive details included in the attributes. However, for any overlapping tags (keys) between the metadata and the attributes these values must match.
For the current observation data we have available we can retrieve the carbon dioxide data for the Tacolneston Site at the 185m inlet.
from openghg.retrieve import get_obs_surface
co2_data = get_obs_surface(site="tac", species="co2", inlet="185m")
Previously, we have focussed on the data attribute to access the stored data for this output. We can access the metadata details in a similiar way by looking at the metadata attribute:
co2_data.metadata
Output:
{'data_type': 'surface',
'site': 'tac',
'instrument': 'picarro',
'sampling_period': '3600.0',
'inlet': '185m',
'port': '10',
'type': 'air',
'network': 'decc',
'species': 'co2',
'calibration_scale': 'wmo-x2007',
'long_name': 'tacolneston',
'inlet_height_magl': '185m',
'data_owner': "Simon O'Doherty",
'data_owner_email': 's.odoherty@bristol.ac.uk',
'station_longitude': 1.13872,
'station_latitude': 52.51775,
'station_long_name': 'Tacolneston Tower, UK',
'station_height_masl': 50.0,
'uuid': 'f3e1ef46-8907-4096-8215-19bd6e1c55e3',
'comment': 'Cavity ring-down measurements. Output from GCWerks',
'conditions_of_use': 'Ensure that you contact the data owner at the outset of your project.',
'source': 'In situ measurements of air',
'Conventions': 'CF-1.8',
'file_created': '2022-12-13 10:23:34.956121+00:00',
'processed_by': 'OpenGHG_Cloud',
'sampling_period_unit': 's',
'scale': 'WMO-X2007'}
You will see this is stored as a dictionary containing the unique keys associated with this data. These details are is what allows openghg to search and retrieve specific data sources.
The attributes are associated internally with the data itself:
co2_ds = co2_data.data
co2_ds
Output:
xarray.Dataset
Dimensions:
time: 39114
Coordinates:
time
(time)
datetime64[ns]
2013-01-31T00:13:28 ... 2017-12-...
Data variables:
mf (time) float64 401.6 403.4 403.1 ... 411.1 411.1
mf_variability (time) float64 0.155 0.088 0.204 ... 0.421 0.325
mf_number_... (time) float64 259.0 251.0 252.0 ... 596.0 596.0
Indexes: (1)
Attributes: (25)
To access the “Attributes” we can use the attrs
keyword for xarray Datasets.
co2_ds.attrs
Output:
{'data_type': 'surface',
'site': 'tac',
'instrument': 'picarro',
'sampling_period': '3600.0',
'inlet': '185m',
'port': '10',
'type': 'air',
'network': 'decc',
'species': 'co2',
'calibration_scale': 'wmo-x2007',
'long_name': 'tacolneston',
'inlet_height_magl': '185m',
'data_owner': "Simon O'Doherty",
'data_owner_email': 's.odoherty@bristol.ac.uk',
'station_longitude': 1.13872,
'station_latitude': 52.51775,
'station_long_name': 'Tacolneston Tower, UK',
'station_height_masl': 50.0,
'uuid': 'f3e1ef46-8907-4096-8215-19bd6e1c55e3',
'comment': 'Cavity ring-down measurements. Output from GCWerks',
'conditions_of_use': 'Ensure that you contact the data owner at the outset of your project.',
'source': 'In situ measurements of air',
'Conventions': 'CF-1.8',
'file_created': '2022-12-13 10:23:34.956121+00:00',
'processed_by': 'OpenGHG_Cloud',
'sampling_period_unit': 's',
'scale': 'WMO-X2007'}
Storing attributes in this way means it’s easy to create a CF-compliant netcdf file from the standardised data in the object store, for example using the to_netcdf() method on our Dataset:
# co2_ds.to_netcdf(...)
Here we would uncomment this and substitue for a filepath.
3. Resolving mismatches#
When the metadata and attributes are created as part of the openghg standardisation process (and when using retrieve_atmospheric), these sets of details are often collated from different sources.
In general:
attributes are drawn from internal attributes from the data
metadata is drawn from additional external details including user inputs and the openghg/openghg_defs
data repository.
Depending on the standardisation procedure, there are cases where there may be a mismatch between these two sets of details. For instance, you may wish to specify a station long name when adding a new data file as an input but this conflicts with attributes stored within the data file itself. You may also find when retrieving data from an external source, such as the ICOS Carbon Portal, the attributes stored alongside retrieved data do not match to our definitions stored within the openghg/openghg_defs site_info details for that site.
Though overlapping details stored in the attributes and metadata must match, how these mismatches are handled is up to the user. When adding new data either via standardise_surface (or pulling data using retrieve_atmopsheric) this can be done through the update_mismatch keyword.
In Step 1, you should have already retrieved some dummy data we can use to demonstrate this. This will have been created as a variable called dummy_data which we will use below. Check this has been run if you’re unable to access this variable.
standardise_surface(filepaths=dummy_data,
source_format="openghg",
site="TAC",
network="DECC",
inlet="998m",
instrument="picarro",
sampling_period="1H")
Output:
---------------------------------------------------------------------------
AttrMismatchError Traceback (most recent call last)
...
AttrMismatchError: Metadata mismatch / value not within tolerance for the following keys:
- 'station_long_name', metadata: Tacolneston Tower, UK, attributes: ATTRIBUTE DATA
- 'station_height_masl', metadata: 64, attributes: 50.0
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
If we try to add this dummy data, we’ll see that this fails with a AttrMismatchError. This is because some details stored within the input file (the attributes) don’t match to the details within our stored data within the openghg/openghg_defs [site_info details](openghg/openghg_defs) for that site. By default update_mismatch is set to “never” which means this will produce an error rather than guessing how to resolve this.
The error message above also tells us what doesn’t match:
station_long_name
metadata: Tacolneston Tower, UK
attributes: ATTRIBUTE DATA
station_height_masl
metadata: 64
attributes: 50.0
We can choose how we want to resolve this using the options for the update_mismatch keyword:
“from_source” (or “attributes”) - use the value(s) included within the current attributes
“from_definition” (or “metadata”) - use the value(s) included within the current metadata
In this case, we choose to use the details from the metadata (derived from site_info details) by running standardise_surface again but this time using update_mismatch=”from_definition”.
standardise_surface(filepaths=dummy_data,
source_format="openghg",
site="TAC",
species="co2",
network="DECC",
inlet="998m",
instrument="picarro",
sampling_period="1H",
update_mismatch="from_definition")
This should now run without error (warnings will be printed and logged instead).
dummy_data = get_obs_surface(site="tac", species="co2", inlet="998m")
We can look at the station_long_name stored within the metadata:
dummy_data.metadata["station_long_name"]
Output:
'Tacolneston Tower, UK'
and attributes:
dummy_data.data.attrs["station_long_name"]
Output:
'Tacolneston Tower, UK'
to check this is what we expected.
4. Cleanup#
If you’re finished with the data in this tutorial you can cleanup the
tutorial object store using the clear_tutorial_store
function.
from openghg.tutorial import clear_tutorial_store
clear_tutorial_store()