Adding observation data#

This tutorial demonstrates how OpenGHG can be used to process new measurement data, search the data present and to retrieve this for analysis and visualisation.

What is an object store?#

Each object and piece of data in the object store is stored at a specific key, which can be thought of as the address of the data. The data is stored in a bucket which in the cloud is a section of the OpenGHG object store. Locally a bucket is just a normal directory in the user’s filesystem specified by the path given in the configuration file at ~/.config/openghg/openghg.conf.

0. Using the tutorial object store#

An object store is a folder with a fixed structure within which openghg can read and write data. To avoid adding the example data we use in this tutorial to your normal object store, we need to tell OpenGHG to use a separate sandboxed object store that we’ll call the tutorial store. To do this we use the use_tutorial_store function from openghg.tutorial. This sets the OPENGHG_TUT_STORE environment variable for this session and won’t affect your use of OpenGHG outside of this tutorial.

from openghg.tutorial import use_tutorial_store

use_tutorial_store()

1. Adding and standardising data#

Adding and standardising surface data#

Note

Outside of this tutorial, if you have write access to multiple object stores you will need to pass the name of the object store you wish to write to to the store argument of the standardise functions.

Source formats#

OpenGHG can process and store several source formats in the object store, including data from the AGAGE, DECC, NOAA, LondonGHG, BEAC2ON networks. The process of adding data to the object store is called standardisation.

To standardise a new data file, you must specify the source format and other keywords for the data. Which keywords need to be specified may be dependent on the source format itself as some details can be inferred from the data or may not be relevant. For the full list of accepted observation inputs and source formats, call the function summary_source_formats:

from openghg.standardise import summary_source_formats

summary = summary_source_formats()

## UNCOMMENT THIS CODE TO SHOW ALL ENTRIES
# import pandas as pd; pd.set_option('display.max_rows', None)

summary
Source format Site code Instrument Network Species file_format Long name Platform
0 CRDS RPB CRDS AGAGE NaN NaN Ragged Point, Barbados surface site
1 CRDS HFD CRDS DECC NaN NaN Heathfield, UK surface site
2 CRDS BSD CRDS DECC NaN NaN Bilsdale, UK surface site
3 CRDS TTA CRDS DECC NaN NaN Angus Tower, UK surface site
4 CRDS RGL CRDS DECC NaN NaN Ridge Hill, UK surface site
... ... ... ... ... ... ... ... ...
357 NIWA BHD NaN NaN NaN NaN Baring Head, New Zealand surface site
358 NIWA LAU NaN NaN NaN NaN Lauder, New Zealand surface site
359 NIWA MKG NaN NaN NaN NaN Rainbow Mountain, New Zealand surface site
360 NIWA MKH NaN NaN NaN NaN Manakau Heads, New Zealand surface site
361 NIWA WIN NaN NaN NaN NaN Winchmore, New Zealand surface site

362 rows × 8 columns

There may be multiple source formats for a given site. For instance, the Tacolneston site in the UK (site code “TAC”) has four entries:

summary[summary["Site code"] == "TAC"]
Source format Site code Instrument Network Species file_format Long name Platform
5 CRDS TAC CRDS DECC NaN NaN Tacolneston Tower, UK surface site
32 GCWERKS TAC GCMD DECC NaN NaN Tacolneston Tower, UK surface site
34 GCWERKS TAC medusa DECC NaN NaN Tacolneston Tower, UK surface site
315 NOAA TAC NaN NaN NaN NaN Tacolneston Tower, UK surface site

Let’s see what data is available for a given source. First, we’ll list all source formats.

summary["Source format"].unique()
array(['CRDS', 'GCWERKS', 'AGAGE', 'ICOS', 'NOAA', 'NPL', 'NIWA'],
      dtype=object)

Now we’ll find all data with source format "CRDS".

summary[summary["Source format"] == "CRDS"]
Source format Site code Instrument Network Species file_format Long name Platform
0 CRDS RPB CRDS AGAGE NaN NaN Ragged Point, Barbados surface site
1 CRDS HFD CRDS DECC NaN NaN Heathfield, UK surface site
2 CRDS BSD CRDS DECC NaN NaN Bilsdale, UK surface site
3 CRDS TTA CRDS DECC NaN NaN Angus Tower, UK surface site
4 CRDS RGL CRDS DECC NaN NaN Ridge Hill, UK surface site
5 CRDS TAC CRDS DECC NaN NaN Tacolneston Tower, UK surface site

DECC network#

We will start by adding data to the object store from Tacolneston, which is a surface site in the DECC network. (Data at surface sites is measured in-situ.)

First we retrieve the raw data.

from openghg.tutorial import retrieve_example_data

data_url = "https://github.com/openghg/example_data/raw/main/timeseries/tac_example.tar.gz"

tac_data = retrieve_example_data(url=data_url)

Now we add this data to the object store using standardise_surface, passing the following arguments:

  • filepath: list of paths to .dat files

  • site: "TAC", the site code for Tacolneston

  • network: "DECC"

  • source_format: "CRDS", the type of data we want to process

from openghg.standardise import standardise_surface

decc_results = standardise_surface(filepath=tac_data, source_format="CRDS", site="TAC", network="DECC")

decc_results
[{'uuid': '251497ae-2790-4bcc-a2f7-df29f67d92c7',
  'new': True,
  'site': 'tac',
  'instrument': 'picarro',
  'sampling_period': '3600.0',
  'inlet': '54m',
  'network': 'decc',
  'species': 'ch4',
  'data_source': 'internal',
  'source_format': 'CRDS',
  'file': 'tac.picarro.hourly.54m.dat'},
 {'uuid': '20aaa696-c415-4fca-9966-62ded2f4a83a',
  'new': True,
  'site': 'tac',
  'instrument': 'picarro',
  'sampling_period': '3600.0',
  'inlet': '54m',
  'network': 'decc',
  'species': 'co2',
  'data_source': 'internal',
  'source_format': 'CRDS',
  'file': 'tac.picarro.hourly.54m.dat'},
 {'uuid': '5e58a6d0-063e-4bc2-a091-e6ca83d51730',
  'new': True,
  'site': 'tac',
  'instrument': 'picarro',
  'sampling_period': '3600.0',
  'inlet': '100m',
  'network': 'decc',
  'species': 'ch4',
  'data_source': 'internal',
  'source_format': 'CRDS',
  'file': 'tac.picarro.hourly.100m.dat'},
 {'uuid': 'b07b189d-4acc-40cd-88e2-e74937a581e0',
  'new': True,
  'site': 'tac',
  'instrument': 'picarro',
  'sampling_period': '3600.0',
  'inlet': '100m',
  'network': 'decc',
  'species': 'co2',
  'data_source': 'internal',
  'source_format': 'CRDS',
  'file': 'tac.picarro.hourly.100m.dat'},
 {'uuid': 'b2bf58e1-bce6-4ed4-bffb-3ccff1a8114b',
  'new': True,
  'site': 'tac',
  'instrument': 'picarro',
  'sampling_period': '3600.0',
  'inlet': '185m',
  'network': 'decc',
  'species': 'ch4',
  'data_source': 'internal',
  'source_format': 'CRDS',
  'file': 'tac.picarro.hourly.185m.dat'},
 {'uuid': '131f8f2e-e1b4-43ab-85da-037a3a4150e9',
  'new': True,
  'site': 'tac',
  'instrument': 'picarro',
  'sampling_period': '3600.0',
  'inlet': '185m',
  'network': 'decc',
  'species': 'co2',
  'data_source': 'internal',
  'source_format': 'CRDS',
  'file': 'tac.picarro.hourly.185m.dat'}]

This extracts the data and metadata from the files, standardises them, and adds them to our object store. The keywords of site and network, along with details extracted from the data itself allow us to uniquely store the data.

The returned decc_results dictionary shows how the data has been stored: each file has been split into several entries, each with a unique ID (UUID). Each entry is known as a Datasource (see Note on Datasources).

The decc_results output includes details of the processed data and tells us that the data has been stored correctly. This will also tell us if any errors have been encountered when trying to access and standardise this data.

AGAGE data#

OpenGHG can also process data from the AGAGE network.

Historically, the AGAGE network produces output files from GCWERKS alongside a seperate precisions file. If you wish to use this form of input file, we create a tuple with the data filename and the precisions filename. For example:

First we retrieve example data from the Cape Grim station in Australia (site code “CGO””).

cgo_url = "https://github.com/openghg/example_data/raw/main/timeseries/capegrim_example.tar.gz"

capegrim_data = retrieve_example_data(url=cgo_url)

capegrim_data is a list of two file paths, one for the data file and one for the precisions file:

from pathlib import Path

base_path = Path.home() / "openghg_store" / "tutorial_store" / "extracted_files"
files = [
    base_path / "capegrim.18.C",
    base_path / "capegrim.18.precisions.C"
]

We put the data file and precisions file into a tuple:

capegrim_tuple = (capegrim_data[0], capegrim_data[1])

We can add these files to the object store in the same way as the DECC data by including the right arguments:

  • filepath: tuple (or list of tuples) with paths to data and precision files

  • site (site code): "CGO"

  • network: "AGAGE"

  • instrument: "medusa"

  • source_format (data type): "GCWERKS"

agage_results = standardise_surface(filepath=capegrim_tuple, source_format="GCWERKS", site="CGO",
                              network="AGAGE", instrument="medusa")
agage_results
[{'uuid': '76f0c1d7-dbd1-44e4-9d3f-899470dafec1',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'ch4',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': '93bc1191-c6e5-4131-8ddb-8160972076fa',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'cfc12',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': '925aaf01-957c-49f1-aa02-546d4f260540',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'n2o',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': 'cbd5e422-2f6b-441b-b5ea-c4bc686b39a9',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'cfc11',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': '33468231-e9e9-4ad1-be79-a4d3c9abf28b',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'cfc113',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': '988aab7e-4a0e-47d7-9052-2638b9213491',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'chcl3',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': 'fe675202-afde-49d5-8548-52d004e76ae6',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'ch3ccl3',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': '44c48902-9b2f-4af0-9b36-f988b6254366',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'ccl4',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': '60d04003-eef1-4a72-b3ae-ea243c8a0e07',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'h2',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': '29db5230-1753-443f-a753-36ee92c24426',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'co',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'},
 {'uuid': 'df1a12cc-4bcc-4504-9170-e192f035e660',
  'new': True,
  'instrument': 'medusa',
  'site': 'cgo',
  'network': 'agage',
  'sampling_period': '1200',
  'inlet': '70m',
  'species': 'ne',
  'data_source': 'internal',
  'source_format': 'GCWERKS',
  'file': 'capegrim.18.C'}]

When viewing agage_results there will be a large number of Datasource UUIDs shown due to the large number of gases in each data file

However, recently the AGAGE network has begun to also produce netCDF files, which are processed by Matt Rigby’s agage-archive repository. These files are split by site, species and instrument and do not need an accompanying precisions file. These can also be read in by the openghg.standardise.standardise_surface function, with the arguments:

  • filepath: filepath to the .nc file

  • site (site code): "CGO"

  • source_format (data type): "AGAGE"

  • network: "AGAGE"

  • instrument: "medusa"

The data will be processed in the same way as the old AGAGE data, and stored in the object store accordingly. Ensure that the source_format argument matches the input filetype, as the two are not compatible.

Note on Datasources#

Datasources are objects that are stored in the object store that hold the data and metadata associated with each measurement we upload to the platform.

For example, if we upload a file that contains readings for three gas species from a single site at a specific inlet height OpenGHG will assign this data to three different Datasources, one for each species. Metadata such as the site, inlet height, species, network etc are stored alongside the measurements for easy searching.

Datasources can also handle multiple versions of data from a single site, so if scales or other factors change multiple versions may be stored for easy future comparison.

Other keywords#

When adding data in this way there are other keywords which can be used to distinguish between different data sets as required including:

  • instrument: Name of the instrument

  • sampling_period: The time taken for each measurement to be sampled

  • data_level: The level of quality control which has been applied to the data.

  • data_sublevel: Optional level to include between data levels. Typically for level 1 data where multiple steps of initial QA may have been applied.

  • dataset_source: Name of the dataset if data is taken from a larger source e.g. from an ObsPack

See the standardise_surface documentation for a full list of inputs.

Informational keywords#

In addition to the keywords demonstrated for adding data and described above which are used to distinguish between different data sets being stored, the following informational details can also be added to help describe the data.

Using the tag keyword#

The tag keyword allows one or multiple short labels to be specified which can be the same across multiple data sources. For instance, data from different sites which is associated with a particular project could all be added using the same tag. For example below we show how to add the same data as above with a tag:

  • Tacolneston (TAC) data with a tag of “project1”

  • Cape Grim (CGO) data with a tag of both “project1” and “project2”

from openghg.standardise import standardise_surface

decc_results = standardise_surface(filepath=tac_data,
                                   source_format="CRDS",
                                   site="TAC",
                                   network="DECC",
                                   tag="project1",
                                   force=True)

agage_results = standardise_surface(filepath=capegrim_tuple,
                                    source_format="GCWERKS",
                                    site="CGO",
                                    network="AGAGE",
                                    instrument="medusa",
                                    tag=["project1", "project2"],
                                    force=True)

Note: here we included the force=True keyword as we are adding the same data which has been added in a previous step of the tutorial - see “Updating existing data” tutorial for more details of this.

As will be covered in the 2. Searching for data section, these keywords can then used when searching the object store. For the tag keyword this can be used to return all data which includes the chosen tag.

Adding informational keys#

Informational keys and associated values can also be added using the info_metadata input. The most common example for this would be to add a comment input. For example:

decc_results = standardise_surface(filepath=tac_data,
                                   source_format="CRDS",
                                   site="TAC",
                                   network="DECC",
                                   info_metadata={"comment": "Automatic quality checks have been applied."})

Note that for both info_metadata and tag that these options are available for all data types (not just observations).

Multiple stores#

If you have write access to more than one object store you’ll need to pass in the name of that store to the store argument. So instead of the standardise_surface call above, we’ll tell it to write to our default user object store. This is our default local object store created when we run openghg --quickstart.

from openghg.standardise import standardise_surface

decc_results = standardise_surface(filepath=tac_data, source_format="CRDS", site="TAC", network="DECC", store="user")

The store argument can be passed to any of the standardise functions in OpenGHG and is required if you have write access to more than one store.

Adding and standardising column data#

Similar to the surface data, we can also add column data to the object store. The column data can comprise of 2 platforms - “site-colum” and “satellite” data.

The input formats supported for standardise_column are:

The raw GOSAT data from University of Leicester can be pre-processed to match to our expected internal “openghg” format using the ACRG repository and added to an object store as the “openghg” source_format. These routines also allow satellite data points to be selected within a specific area, downsampled onto a specific grid or filtered based on criteria or flags.

To demonstrate this we will retrieve some example data (pre-processed methane column data from the GOSAT satellite)

satellite_data_url = "https://github.com/openghg/example_data/raw/main/column/gosat-fts_gosat_20160101_ch4-column.nc.tar.gz"

satellite_data = retrieve_example_data(url=satellite_data_url)

Now we add this data to the object store using standardise_column, passing the below arguments:

Note

For column site data the satellite argument is replaced with the site argument and platform is set to “site-column”. (Inversions check the platform value to determine whether the data is satellite or site-column data)

from openghg.standardise import standardise_column

standardise_column(
                filepath=satellite_data[-1],
                species="ch4",
                platform="satellite",
                satellite="gosat",
                obs_region="brazil",
                network="gosat",
            )
[{'uuid': 'a8951864-2f3a-4afc-90ad-c366606b7312',
  'new': True,
  'satellite': 'gosat',
  'species': 'ch4',
  'network': 'gosat',
  'platform': 'satellite',
  'obs_region': 'brazil',
  'file': 'gosat-fts_gosat_20160101_ch4-column.nc'}]

Note

For this GOSAT data we have selected measurements points over Brazil only. To describe this we have used the keyword obs_region=”brazil”. For our ancillary data (Adding ancillary spatial data) we show how the domain keyword is used to describe a specific area covered by our 2D lat-lon maps. If the observation region we’ve selected from our satellite data corresponds exactly to a known domain we can also use this term instead of obs_region when adding the data.

Adding In-memory dataset to the object store#

Up to this point, we have seen how to specify a file path and add data to the object store. In many workflows, however, you may already have your data available in memory as an xarray.Dataset. For the appropriate source_format inputs, the standardise_surface function also supports this workflow, allowing you to pass a dataset directly without needing to first write it to disk.

This approach is especially useful when your data has already been processed in Python or retrieved from another source (such as a remote server or API) and you want to store it straight away.

Let’s start by importing xarray and converting an example file into an xarray.Dataset:

import xarray as xr

data_url = "https://github.com/openghg/example_data/raw/main/timeseries/decc-picarro_co2.tar.gz"

tac_openghg_data = retrieve_example_data(url=data_url)
data = xr.open_dataset(tac_openghg_data[0])

Now that we have our dataset in memory, we can provide it directly to the dataset argument of standardise_surface. This will standardise the data and add it to the object store just as if we had supplied a file path:

from openghg.standardise import standardise_surface

decc_results = standardise_surface(
    data=data,
    source_format="openghg",
    site="TAC",
    network="DECC",
    instrument="picarro",
    sampling_period="1h",
    tag="in-memory-dataset",
)

The details of the added can be viewed in the returned results dictionary.

decc_results
[{'uuid': 'dc9dc901-a753-475c-84d2-cf9c26afb274',
  'new': True,
  'site': 'tac',
  'species': 'co2',
  'network': 'decc',
  'instrument': 'picarro',
  'sampling_period': '3600.0',
  'inlet': '185m',
  'data_source': 'internal',
  'source_format': 'OPENGHG'}]

2. Searching for data#

Searching the object store#

We can search the object store by property using the search_surface(...) function. This function retrieves all of the metadata associated with the search query from the data in the object store.

For example we can find all sites which have measurements for carbon tetrafluoride (“cf4”) using the species keyword:

from openghg.retrieve import search_surface

cfc_results = search_surface(species="cfc11")
cfc_results.results
instrument site network sampling_period units calibration_scale inlet species data_type species_alt ... source_format uuid tag period latest_version timestamp start_date end_date versions object_store
0 medusa cgo agage 1200 ppt sio-05 70m cfc11 surface cfc-11 ... gcwerks cbd5e422-2f6b-441b-b5ea-c4bc686b39a9 [project1, project2] 1200s v2 2025-11-15 21:14:23.857926+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store

1 rows × 32 columns

We could also look for details of all the data measured at the Tacolneston (“TAC”) site using the site keyword:

tac_results = search_surface(site="tac")
tac_results.results
site instrument sampling_period inlet type port network species calibration_scale long_name ... source_format uuid tag period latest_version timestamp start_date end_date versions object_store
0 tac picarro 3600.0 54m air 8 decc ch4 wmo-x2004a tacolneston ... crds 251497ae-2790-4bcc-a2f7-df29f67d92c7 [project1] 3600s v2 2025-11-15 21:14:22.973351+00:00 2012-07-26 11:23:05+00:00 2017-12-19 11:25:19+00:00 {'v1': ['2012-07-26-11:23:05+00:00_2017-12-19-... /home/runner/openghg_store/tutorial_store
1 tac picarro 3600.0 54m air 8 decc co2 wmo-x2019 tacolneston ... crds 20aaa696-c415-4fca-9966-62ded2f4a83a [project1] 3600s v2 2025-11-15 21:14:22.994671+00:00 2012-07-26 11:23:05+00:00 2017-12-19 11:25:19+00:00 {'v1': ['2012-07-26-11:23:05+00:00_2017-12-19-... /home/runner/openghg_store/tutorial_store
2 tac picarro 3600.0 100m air 9 decc ch4 wmo-x2004a tacolneston ... crds 5e58a6d0-063e-4bc2-a091-e6ca83d51730 [project1] 3600s v2 2025-11-15 21:14:23.221427+00:00 2012-07-26 11:04:07+00:00 2018-01-01 00:23:04+00:00 {'v1': ['2012-07-26-11:04:07+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store
3 tac picarro 3600.0 100m air 9 decc co2 wmo-x2019 tacolneston ... crds b07b189d-4acc-40cd-88e2-e74937a581e0 [project1] 3600s v2 2025-11-15 21:14:23.244944+00:00 2012-07-26 11:04:07+00:00 2018-01-01 00:23:04+00:00 {'v1': ['2012-07-26-11:04:07+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store
4 tac picarro 3600.0 185m air 10 decc ch4 wmo-x2004a tacolneston ... crds b2bf58e1-bce6-4ed4-bffb-3ccff1a8114b [project1] 3600s v2 2025-11-15 21:14:23.451239+00:00 2013-01-31 00:13:28+00:00 2018-01-01 00:53:06+00:00 {'v1': ['2013-01-31-00:13:28+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store
5 tac picarro 3600.0 185m air 10 decc co2 wmo-x2019 tacolneston ... crds 131f8f2e-e1b4-43ab-85da-037a3a4150e9 [project1] 3600s v2 2025-11-15 21:14:23.473931+00:00 2013-01-31 00:13:28+00:00 2018-01-01 00:53:06+00:00 {'v1': ['2013-01-31-00:13:28+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store
6 tac picarro 3600.0 185m NaN NaN decc co2 wmo-x2019 NaN ... openghg dc9dc901-a753-475c-84d2-cf9c26afb274 [in-memory-dataset] 3600s v1 2025-11-15 21:14:25.481796+00:00 2013-01-31 00:05:00+00:00 2022-09-29 12:44:59+00:00 {'v1': ['2013-01-31-00:05:00+00:00_2022-09-29-... /home/runner/openghg_store/tutorial_store

7 rows × 33 columns

For this site you can see this contains details of each of the species as well as the inlet heights these were measured at.

Searching by tag keyword#

We can also search by the tag keyword when this has been set. Even though the tag keyword can contain multiple values, this will find all the datasources where the tag value is included (rather than needing an exact match like the other keywords).

For the “TAC” and “CGO” data we added the “project1” tag and so this data can be found using this keyword:

results = search_surface(tag="project1")
results.results
site instrument sampling_period inlet type port network species calibration_scale long_name ... tag period latest_version timestamp start_date end_date versions object_store units species_alt
0 tac picarro 3600.0 54m air 8 decc ch4 wmo-x2004a tacolneston ... [project1] 3600s v2 2025-11-15 21:14:22.973351+00:00 2012-07-26 11:23:05+00:00 2017-12-19 11:25:19+00:00 {'v1': ['2012-07-26-11:23:05+00:00_2017-12-19-... /home/runner/openghg_store/tutorial_store NaN NaN
1 tac picarro 3600.0 54m air 8 decc co2 wmo-x2019 tacolneston ... [project1] 3600s v2 2025-11-15 21:14:22.994671+00:00 2012-07-26 11:23:05+00:00 2017-12-19 11:25:19+00:00 {'v1': ['2012-07-26-11:23:05+00:00_2017-12-19-... /home/runner/openghg_store/tutorial_store NaN NaN
2 tac picarro 3600.0 100m air 9 decc ch4 wmo-x2004a tacolneston ... [project1] 3600s v2 2025-11-15 21:14:23.221427+00:00 2012-07-26 11:04:07+00:00 2018-01-01 00:23:04+00:00 {'v1': ['2012-07-26-11:04:07+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store NaN NaN
3 tac picarro 3600.0 100m air 9 decc co2 wmo-x2019 tacolneston ... [project1] 3600s v2 2025-11-15 21:14:23.244944+00:00 2012-07-26 11:04:07+00:00 2018-01-01 00:23:04+00:00 {'v1': ['2012-07-26-11:04:07+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store NaN NaN
4 tac picarro 3600.0 185m air 10 decc ch4 wmo-x2004a tacolneston ... [project1] 3600s v2 2025-11-15 21:14:23.451239+00:00 2013-01-31 00:13:28+00:00 2018-01-01 00:53:06+00:00 {'v1': ['2013-01-31-00:13:28+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store NaN NaN
5 tac picarro 3600.0 185m air 10 decc co2 wmo-x2019 tacolneston ... [project1] 3600s v2 2025-11-15 21:14:23.473931+00:00 2013-01-31 00:13:28+00:00 2018-01-01 00:53:06+00:00 {'v1': ['2013-01-31-00:13:28+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store NaN NaN
6 cgo medusa 1200 70m NaN NaN agage ch4 tu1987 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.820665+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store ppb NaN
7 cgo medusa 1200 70m NaN NaN agage cfc12 sio-05 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.833160+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store ppt cfc-12
8 cgo medusa 1200 70m NaN NaN agage n2o sio-16 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.845745+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store ppb NaN
9 cgo medusa 1200 70m NaN NaN agage cfc11 sio-05 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.857926+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store ppt cfc-11
10 cgo medusa 1200 70m NaN NaN agage cfc113 sio-05 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.870309+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store ppt cfc-113
11 cgo medusa 1200 70m NaN NaN agage chcl3 sio-98 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.882756+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store ppt NaN
12 cgo medusa 1200 70m NaN NaN agage ch3ccl3 sio-05 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.896105+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store ppt NaN
13 cgo medusa 1200 70m NaN NaN agage ccl4 sio-05 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.908491+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store ppt NaN
14 cgo medusa 1200 70m NaN NaN agage h2 mpi-2009 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.920602+00:00 2018-01-01 00:30:00+00:00 2018-12-27 00:02:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-27-... /home/runner/openghg_store/tutorial_store ppb h2_pdd
15 cgo medusa 1200 70m NaN NaN agage co csiro-94 NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.933063+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store ppb NaN
16 cgo medusa 1200 70m NaN NaN agage ne na NaN ... [project1, project2] 1200s v2 2025-11-15 21:14:23.944938+00:00 2018-01-01 00:30:00+00:00 2018-11-19 10:54:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-11-19-... /home/runner/openghg_store/tutorial_store na ne_pdd

17 rows × 35 columns

For the “CGO” data we also included the “project2” tag and we can find this data by searching for this:

results = search_surface(tag="project2")
results.results
instrument site network sampling_period units calibration_scale inlet species data_type inlet_height_magl ... uuid tag period latest_version timestamp start_date end_date versions object_store species_alt
0 medusa cgo agage 1200 ppb tu1987 70m ch4 surface 70 ... 76f0c1d7-dbd1-44e4-9d3f-899470dafec1 [project1, project2] 1200s v2 2025-11-15 21:14:23.820665+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store NaN
1 medusa cgo agage 1200 ppt sio-05 70m cfc12 surface 70 ... 93bc1191-c6e5-4131-8ddb-8160972076fa [project1, project2] 1200s v2 2025-11-15 21:14:23.833160+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store cfc-12
2 medusa cgo agage 1200 ppb sio-16 70m n2o surface 70 ... 925aaf01-957c-49f1-aa02-546d4f260540 [project1, project2] 1200s v2 2025-11-15 21:14:23.845745+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store NaN
3 medusa cgo agage 1200 ppt sio-05 70m cfc11 surface 70 ... cbd5e422-2f6b-441b-b5ea-c4bc686b39a9 [project1, project2] 1200s v2 2025-11-15 21:14:23.857926+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store cfc-11
4 medusa cgo agage 1200 ppt sio-05 70m cfc113 surface 70 ... 33468231-e9e9-4ad1-be79-a4d3c9abf28b [project1, project2] 1200s v2 2025-11-15 21:14:23.870309+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store cfc-113
5 medusa cgo agage 1200 ppt sio-98 70m chcl3 surface 70 ... 988aab7e-4a0e-47d7-9052-2638b9213491 [project1, project2] 1200s v2 2025-11-15 21:14:23.882756+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store NaN
6 medusa cgo agage 1200 ppt sio-05 70m ch3ccl3 surface 70 ... fe675202-afde-49d5-8548-52d004e76ae6 [project1, project2] 1200s v2 2025-11-15 21:14:23.896105+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store NaN
7 medusa cgo agage 1200 ppt sio-05 70m ccl4 surface 70 ... 44c48902-9b2f-4af0-9b36-f988b6254366 [project1, project2] 1200s v2 2025-11-15 21:14:23.908491+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store NaN
8 medusa cgo agage 1200 ppb mpi-2009 70m h2 surface 70 ... 60d04003-eef1-4a72-b3ae-ea243c8a0e07 [project1, project2] 1200s v2 2025-11-15 21:14:23.920602+00:00 2018-01-01 00:30:00+00:00 2018-12-27 00:02:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-27-... /home/runner/openghg_store/tutorial_store h2_pdd
9 medusa cgo agage 1200 ppb csiro-94 70m co surface 70 ... 29db5230-1753-443f-a753-36ee92c24426 [project1, project2] 1200s v2 2025-11-15 21:14:23.933063+00:00 2018-01-01 00:30:00+00:00 2018-12-31 23:24:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-12-31-... /home/runner/openghg_store/tutorial_store NaN
10 medusa cgo agage 1200 na na 70m ne surface 70 ... df1a12cc-4bcc-4504-9170-e192f035e660 [project1, project2] 1200s v2 2025-11-15 21:14:23.944938+00:00 2018-01-01 00:30:00+00:00 2018-11-19 10:54:59+00:00 {'v1': ['2018-01-01-00:30:00+00:00_2018-11-19-... /home/runner/openghg_store/tutorial_store ne_pdd

11 rows × 32 columns

Quickly retrieve data#

Say we want to retrieve all the co2 data from Tacolneston, we can perform perform a search and expect a SearchResults object to be returned. If no results are found None is returned.

results = search_surface(site="tac", species="co2")
results.results
site instrument sampling_period inlet type port network species calibration_scale long_name ... source_format uuid tag period latest_version timestamp start_date end_date versions object_store
0 tac picarro 3600.0 54m air 8 decc co2 wmo-x2019 tacolneston ... crds 20aaa696-c415-4fca-9966-62ded2f4a83a [project1] 3600s v2 2025-11-15 21:14:22.994671+00:00 2012-07-26 11:23:05+00:00 2017-12-19 11:25:19+00:00 {'v1': ['2012-07-26-11:23:05+00:00_2017-12-19-... /home/runner/openghg_store/tutorial_store
1 tac picarro 3600.0 100m air 9 decc co2 wmo-x2019 tacolneston ... crds b07b189d-4acc-40cd-88e2-e74937a581e0 [project1] 3600s v2 2025-11-15 21:14:23.244944+00:00 2012-07-26 11:04:07+00:00 2018-01-01 00:23:04+00:00 {'v1': ['2012-07-26-11:04:07+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store
2 tac picarro 3600.0 185m air 10 decc co2 wmo-x2019 tacolneston ... crds 131f8f2e-e1b4-43ab-85da-037a3a4150e9 [project1] 3600s v2 2025-11-15 21:14:23.473931+00:00 2013-01-31 00:13:28+00:00 2018-01-01 00:53:06+00:00 {'v1': ['2013-01-31-00:13:28+00:00_2018-01-01-... /home/runner/openghg_store/tutorial_store
3 tac picarro 3600.0 185m NaN NaN decc co2 wmo-x2019 NaN ... openghg dc9dc901-a753-475c-84d2-cf9c26afb274 [in-memory-dataset] 3600s v1 2025-11-15 21:14:25.481796+00:00 2013-01-31 00:05:00+00:00 2022-09-29 12:44:59+00:00 {'v1': ['2013-01-31-00:05:00+00:00_2022-09-29-... /home/runner/openghg_store/tutorial_store

4 rows × 33 columns

We can retrieve either some or all of the data easily using the retrieve function.

inlet_54m_data = results.retrieve(inlet="54m")
inlet_54m_data
ObsData(metadata={'site': 'tac', 'instrument': 'picarro', 'sampling_period': '3600.0', 'inlet': '54m', ...}, uuid=20aaa696-c415-4fca-9966-62ded2f4a83a)

Or we can retrieve all of the data and get a list of ObsData objects.

all_co2_data = results.retrieve_all()
all_co2_data
[ObsData(metadata={'site': 'tac', 'instrument': 'picarro', 'sampling_period': '3600.0', 'inlet': '54m', ...}, uuid=20aaa696-c415-4fca-9966-62ded2f4a83a),
 ObsData(metadata={'site': 'tac', 'instrument': 'picarro', 'sampling_period': '3600.0', 'inlet': '100m', ...}, uuid=b07b189d-4acc-40cd-88e2-e74937a581e0),
 ObsData(metadata={'site': 'tac', 'instrument': 'picarro', 'sampling_period': '3600.0', 'inlet': '185m', ...}, uuid=131f8f2e-e1b4-43ab-85da-037a3a4150e9),
 ObsData(metadata={'site': 'tac', 'species': 'co2', 'network': 'decc', 'instrument': 'picarro', ...}, uuid=dc9dc901-a753-475c-84d2-cf9c26afb274)]

3. Retrieving data#

To retrieve the standardised data from the object store there are several functions we can use which depend on the type of data we want to access.

To access the surface data we have added so far we can use the get_obs_surface function and pass keywords for the site code, species and inlet height to retrieve our data. Using get_* functions will only allow one set of data to be returned and will give details if this is not the case.

In this case we want to extract the carbon dioxide (“co2”) data from the Tacolneston data (“TAC”) site measured at the “185m” inlet with tag as “project1”:

from openghg.retrieve import get_obs_surface

co2_data = get_obs_surface(site="tac", species="co2", inlet="185m", tag="project1")

If we view our returned obs_data variable this will contain:

  • data - The standardised data (accessed using e.g. obs_data.data). This is returned as an xarray Dataset.

  • metadata - The associated metadata (accessed using e.g. obs_data.metadata).

co2_data
ObsData(metadata={'site': 'tac', 'instrument': 'picarro', 'sampling_period': '3600.0', 'inlet': '185m', ...}, uuid=131f8f2e-e1b4-43ab-85da-037a3a4150e9)
co2_data.data
<xarray.Dataset> Size: 1MB
Dimensions:                    (time: 39114)
Coordinates:
  * time                       (time) datetime64[ns] 313kB 2013-01-31T00:13:2...
Data variables:
    mf                         (time) float64 313kB dask.array<chunksize=(19557,), meta=np.ndarray>
    mf_number_of_observations  (time) int64 313kB dask.array<chunksize=(19557,), meta=np.ndarray>
    mf_variability             (time) float64 313kB dask.array<chunksize=(19557,), meta=np.ndarray>
Attributes: (12/28)
    Conventions:           CF-1.8
    comment:               Cavity ring-down measurements. Output from GCWerks
    conditions_of_use:     Ensure that you contact the data owner at the outs...
    data_owner:            Simon O'Doherty
    data_owner_email:      s.odoherty@bristol.ac.uk
    data_source:           internal
    ...                    ...
    station_latitude:      52.51882
    station_long_name:     Tacolneston Tower, UK
    station_longitude:     1.1387
    tag:                   project1
    type:                  air
    scale:                 WMO-X2019
co2_data.metadata
{'data_type': 'surface',
 'site': 'tac',
 'instrument': 'picarro',
 'sampling_period': '3600.0',
 'inlet': '185m',
 'type': 'air',
 'port': '10',
 'network': 'decc',
 'species': 'co2',
 'calibration_scale': 'WMO-X2019',
 'long_name': 'tacolneston',
 'inlet_height_magl': '185',
 'data_owner': "Simon O'Doherty",
 'data_owner_email': 's.odoherty@bristol.ac.uk',
 'station_longitude': '1.1387',
 'station_latitude': '52.51882',
 'station_long_name': 'Tacolneston Tower, UK',
 'station_height_masl': 64,
 'data_level': 'not_set',
 'data_sublevel': 'not_set',
 'dataset_source': 'not_set',
 'platform': 'not_set',
 'data_source': 'internal',
 'source_format': 'CRDS',
 'uuid': '131f8f2e-e1b4-43ab-85da-037a3a4150e9',
 'tag': 'project1',
 'period': '3600s',
 'latest_version': 'v2',
 'timestamp': '2025-11-15 21:14:23.473931+00:00',
 'start_date': '2013-01-31 00:13:28+00:00',
 'end_date': '2018-01-01 00:53:06+00:00',
 'versions': {'v1': ['2013-01-31-00:13:28+00:00_2018-01-01-00:53:06+00:00'],
  'v2': ['2013-01-31-00:13:28+00:00_2018-01-01-00:53:06+00:00']},
 'object_store': '/home/runner/openghg_store/tutorial_store',
 'Conventions': 'CF-1.8',
 'comment': 'Cavity ring-down measurements. Output from GCWerks',
 'conditions_of_use': 'Ensure that you contact the data owner at the outset of your project.',
 'file_created': '2025-11-15 21:14:23.424807+00:00',
 'processed_by': 'OpenGHG_Cloud',
 'sampling_period_unit': 's',
 'source': 'In situ measurements of air',
 'scale': 'WMO-X2019'}

We can now make a simple plot using the plot_timeseries method of the ObsData object.

NOTE: the plot created below may not show up on the online documentation version of this notebook.

co2_data.plot_timeseries()

You can also pass any of title, xlabel, ylabel and units to the plot_timeseries function to modify the labels.

You can request the mole fraction data in a different unit by specifying the target_units argument when calling get_obs_surface.

For example, to convert the mole fraction from the default unit (usually ppm for CO₂) to ppb:

co2_ppb = get_obs_surface(
    site="tac", species="co2", inlet="185m", target_units={"mf": "ppb"}, tag="project1"
)
co2_ppb.data
<xarray.Dataset> Size: 1MB
Dimensions:                    (time: 39114)
Coordinates:
  * time                       (time) datetime64[ns] 313kB 2013-01-31T00:13:2...
Data variables:
    mf                         (time) float64 313kB dask.array<chunksize=(19557,), meta=np.ndarray>
    mf_number_of_observations  (time) int64 313kB dask.array<chunksize=(19557,), meta=np.ndarray>
    mf_variability             (time) float64 313kB dask.array<chunksize=(19557,), meta=np.ndarray>
Attributes: (12/28)
    Conventions:           CF-1.8
    comment:               Cavity ring-down measurements. Output from GCWerks
    conditions_of_use:     Ensure that you contact the data owner at the outs...
    data_owner:            Simon O'Doherty
    data_owner_email:      s.odoherty@bristol.ac.uk
    data_source:           internal
    ...                    ...
    station_latitude:      52.51882
    station_long_name:     Tacolneston Tower, UK
    station_longitude:     1.1387
    tag:                   project1
    type:                  air
    scale:                 WMO-X2019

By default, the returned data is dequantified, so you can confirm the unit conversion using:

co2_ppb.data["mf"].attrs["units"]
'1e-9'

This confirms that the mole fraction (mf) was converted to parts per billion (ppb) instead of the default parts per million (ppm). The original units attribute is preserved in scalar format compatible with the further workflow. We can display units in other formats:

# quantify, then get pint units
pint_units = co2_ppb.data.mf.pint.quantify().pint.units

# print in cf format
print(f"{pint_units:cf}")

# print in default format
print(f"{pint_units:D}")
parts_per_billion
ppb

If you prefer to keep the data quantified (i.e., retaining the Pint unit objects), set the is_dequantified argument to False when calling get_obs_surface.

co2_ppb_quantified = get_obs_surface(site="tac", species="co2", inlet="185m", target_units={"mf": "ppb"}, is_dequantified=False, tag="project1")

You can then access the Pint units directly:

co2_ppb_quantified.data["mf"].pint.units
ppb

Note

Above mentioned unit conversion can be applied on get_obs_column, get_flux, get_footprint, and get_bc too.

4. Cleanup#

If you’re finished with the data in this tutorial you can cleanup the tutorial object store using the clear_tutorial_store function.

from openghg.tutorial import clear_tutorial_store
clear_tutorial_store()