Store#

Storage#

These classes are used to store each type of data in the object store. Each has a static load function that loads a version of itself from the object store. The read_file function is then used to read data files, call standardisation functions based on the format of the data file, collect metadata and then store the data and metadata in the object store.

openghg.store.BoundaryConditions#

The BoundaryConditions class is used to standardise and store boundary conditions data.

class openghg.store.BoundaryConditions(bucket)[source]#

This class is used to process boundary condition data

format_inputs(**kwargs)[source]#

Apply appropriate formatting for expected inputs for BoundaryConditions. Expected inputs will typically be defined within the openghg.standardse.standardise_bc() function.

Parameters:

kwargs (Any) – Set of keyword arguments. Selected keywords will be appropriately formatted.

Returns:

Formatted parameters for this data type.

Return type:

dict

TODO: Decide if we can phase out additional_metadata or if this could be

added to params.

read_raw_data(binary_data, metadata, file_metadata, source_format)[source]#

Ready a footprint from binary data

Parameters:
  • binary_data (bytes) – Footprint data

  • metadata (dict) – Dictionary of metadata

  • file_metadat – File metadata

  • source_format (str) – Type of data being input e.g. openghg (internal format)

Returns:

UUIDs of Datasources data has been assigned to

Return type:

dict

static schema()[source]#

Define schema for boundary conditions Dataset.

Includes volume mole fractions for each time and ordinal, vertical boundary at the edge of the defined domain:
  • “vmr_n”, “vmr_s”
    • expected dimensions: (“time”, “height”, “lon”)

  • “vmr_e”, “vmr_w”
    • expected dimensions: (“time”, “height”, “lat”)

Expected data types for all variables and coordinates also included.

Returns:

Contains schema for BoundaryConditions.

Return type:

DataSchema

transform_data(datapath, database, if_exists='auto', save_current='auto', overwrite=False, compressor=None, filters=None, info_metadata=None, **kwargs)[source]#
Return type:

list[dict]

Read and transform a cams boundary conditions data. This will find the appropriate parser function to use for the database specified. The necessary inputs are determined by which database is being used. The underlying parser functions will be of the form:

  • openghg.transform.boundary_conditions.parse_{database.lower()}
    • e.g. openghg.transform.boundary_conditions.parse_cams()

openghg.store.Emissions#

The Emissions class is used to process emissions / flux data files.

openghg.store.EulerianModel#

The EulerianModel class is used to process Eulerian model data.

class openghg.store.EulerianModel(bucket)[source]#

This class is used to process Eulerian model data

format_inputs(**kwargs)[source]#

Apply appropriate formatting for expected inputs for EulerianModel. Expected inputs will typically be defined within the openghg.standardise.standardise_eulerian() function.

Parameters:

kwargs (Any) – Set of keyword arguments. Selected keywords will be appropriately formatted.

Returns:

Formatted parameters for this data type.

Return type:

dict

static schema()[source]#

Define schema for Eulerian model Dataset.

At present, this doesn’t check the variables but does check that “lat”, “lon”, “time” are included as appropriate types.

Returns:

Contains dummy schema for EulerianModel.

Return type:

DataSchema

TODO: Decide on data_vars checks as we build up the use of this data_type

openghg.store.Footprints#

The Footprints class is used to store and retrieve meteorological data from the ECMWF data store. Some data may be cached locally for quicker access.

class openghg.store.Footprints(bucket)[source]#

This class is used to process footprints model output

chunking_schema(time_resolved=False, high_time_resolution=False, high_spatial_resolution=False, short_lifetime=False, source_format='')[source]#

Get chunking schema for footprint data.

Parameters:
  • time_resolved (bool) – Set footprint variable to be high time resolution.

  • high_time_resolution (bool) – This argument is deprecated and will be replaced in future versions with time_resolved.

  • high_spatial_resolution (bool) – Set footprint variables include high and low resolution options.

  • short_lifetime (bool) – Include additional particle age parameters for short lived species.

Returns:

Chunking schema for footprint data.

Return type:

dict

format_inputs(**kwargs)[source]#

Apply appropriate formatting for expected inputs for Footprints. Expected inputs will typically be defined within the openghg.standardise.standardise_footprint() function.

Parameters:

kwargs (Any) – Set of keyword arguments. Selected keywords will be appropriately formatted.

Returns:

Formatted parameters for this data type.

Return type:

dict

TODO: Decide if we can phase out additional_metadata or if this could be

added to params.

read_raw_data(binary_data, metadata, file_metadata)[source]#

Ready a footprint from binary data

Parameters:
  • binary_data (bytes) – Footprint data

  • metadata (dict) – Dictionary of metadata

  • file_metadat – File metadata

Returns:

UUIDs of Datasources data has been assigned to

Return type:

dict

static schema(particle_locations=True, high_spatial_resolution=False, time_resolved=False, high_time_resolution=False, short_lifetime=False, source_format=None)[source]#

Define schema for footprint Dataset.

The returned schema depends on what the footprint represents, indicated using the keywords. By default, this will include “fp” variable but this will be superceded if high_spatial_resolution or time_resolved are specified.

Parameters:
  • particle_locations (bool) – Include 4-directional particle location variables: - “particle_location_[nesw]” and include associated additional dimensions (“height”)

  • high_spatial_resolution (bool) – Set footprint variables include high and low resolution options: - “fp_low” - “fp_high” and include associated additional dimensions (“lat_high”, “lon_high”).

  • time_resolved (bool) – Set footprint variable to be high time resolution - “fp_HiTRes” and include associated dimensions (“H_back”).

  • high_time_resolution (bool) – This argument is deprecated and will be replaced in future versions with time_resolved.

  • short_lifetime (bool) – Include additional particle age parameters for short lived species: - “mean_age_particles_[nesw]”

  • source_format (str | None) – optional string containing source format; necessary for “time resolved” footprints since the the schema is different for PARIS/FLEXPART and ACRG formats.

Return type:

DataSchema

Returns:

DataSchema object describing this format.

Note: In PARIS format the coordinate dimensions are (“latitude”, “longitude”) rather than (“lat”, “lon”)

but given that all other openghg internal formats are (“lat”, “lon”), we are currently keeping all footprint internal formats consistent with this.

openghg.store.ObsColumn#

The ObsColumn class is used to process column / satellite observation data.

class openghg.store.ObsColumn(bucket)[source]#

This class is used to process emissions / flux data

format_inputs(**kwargs)[source]#

Apply appropriate formatting for expected inputs for ObsColumn. Expected inputs will typically be defined within the openghg.standardse.standardise_column() function.

Parameters:

kwargs (Any) – Set of keyword arguments. Selected keywords will be appropriately formatted.

Returns:

Formatted parameters and any additional parameters

for this data type.

Return type:

(dict, dict)

TODO: Decide if we can phase out additional_metadata or if this could be

added to params.

static schema(species, vertical_name='lev')[source]#

Define schema for a column Dataset.

Includes column data for each time point:
  • standardised species and column name as “x{species}” (e.g. “xch4”)

  • averaging kernel variable as “x{species_name}_averaging_kernel”

  • profile apriori variable as “{species_name}_profile_apriori”

  • expected “time” dimension

  • expected vertical dimension (defined by input)

Expected data types for all variables and coordinates also included.

Parameters:
  • species (str) – Species name which will be used to construct appropriate variable names e.g. “ch4” will create “xch4”

  • vertical_name (str) – Name of the vertical dimension for averaging kernel and apriori. Default = “lev”

Returns:

Contains schema for ObsColumn.

Return type:

DataSchema

TODO: Expand valid list of vertical names as needed (e.g. “lev”, “height”) and

check vertical_name inputs against valid list of options.

openghg.store.ObsSurface#

The ObsSurface class is used to process surface observation data.

class openghg.store.ObsSurface(bucket)[source]#

This class is used to process surface observation data

align_metadata_attributes(data, update_mismatch)[source]#

Check values within metadata and attributes are consistent and update (in place). This is a wrapper for separate openghg.util.align_metadata_attributes() function.

Parameters:
  • data (list[MetadataAndData]) – sequence of MetadataAndData objects

  • update_mismatch (str) –

    This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:

    • ”never” - don’t update mismatches and raise an AttrMismatchError

    • ”from_source” / “attributes” - update mismatches based on input data (e.g. data attributes)

    • ”from_definition” / “metadata” - update mismatches based on associated data (e.g. site_info.json)

Return type:

None

Returns:

None

TODO: At the moment the align_metadata_attributes() function is only applicable

to surface data but this should be generalised to all data types.

define_loop_params()[source]#

If filepath is supplied as a list, depending on the data type this will be looped over to extract each file. If there are additional parameters which need to be looped over as well (when defined) these are defined here.

Returns:

Dictionary of name of loop parameters within inputs and to pass

to the relevant parse functions.

Return type:

dict

format_inputs(**kwargs)[source]#

Apply appropriate formatting for expected inputs for ObsColumn. Expected inputs will typically be defined within the openghg.standardse.standardise_surface() function.

Parameters:

kwargs (Any) – Set of keyword arguments. Selected keywords will be appropriately formatted.

Returns:

Formatted parameters for this data type.

Return type:

dict

TODO: Decide if we can phase out additional_metadata or if this could be

added to params.

read_multisite_aqmesh(filepath, metadata_filepath, network='aqmesh_glasgow', instrument='aqmesh', sampling_period=60, measurement_type='insitu', if_exists='auto', overwrite=False)[source]#

Read AQMesh data for the Glasgow network

NOTE - temporary function until we know what kind of AQMesh data we’ll be retrieve in the future.

This data is different in that it contains multiple sites in the same file.

Return type:

defaultdict

read_raw_data(binary_data, metadata, file_metadata, precision_data=None, site_filepath=None)[source]#

Reads binary data passed in by serverless function. The data dictionary should contain sub-dictionaries that contain data and metadata keys.

This is clunky and the ObsSurface.read_file function could be tidied up quite a lot to be more flexible.

Parameters:
  • binary_data (bytes) – Binary measurement data

  • metadata (dict) – Metadata

  • file_metadata (dict) – File metadata such as original filename

  • precision_data (bytes | None) – GCWERKS precision data

  • site_filepath (Union[str, Path, None]) – Alternative site info file (see openghg/openghg_defs repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.

Returns:

Dictionary of result

Return type:

dict

static schema(species)[source]#

Define schema for surface observations Dataset.

Only includes mandatory variables
  • standardised species name (e.g. “ch4”)

  • expected dimensions: (“time”)

Expected data types for variables and coordinates also included.

Returns:

Contains basic schema for ObsSurface.

Return type:

DataSchema

# TODO: Decide how to best incorporate optional variables # e.g. “ch4_variability”, “ch4_number_of_observations”

store_data(data, if_exists='auto', overwrite=False, force=False, required_metakeys=None, compressor=None, filters=None)[source]#

This expects already standardised data such as ICOS / CEDA

Parameters:
  • data (MutableSequence[MetadataAndData]) – Dictionary of data in standard format, see the data spec under

  • documentation (Development -> Data specifications in the)

  • if_exists (str) –

    What to do if existing data is present. - “auto” - checks new and current data for timeseries overlap

    • adds data if no overlap

    • raises DataOverlapError if there is an overlap

    • ”new” - creates new version with just new data

    • ”combine” - replace and insert new data into current timeseries

  • overwrite (bool) – Deprecated. This will use options for if_exists=”new”.

  • force (bool) – Force adding of data even if this is identical to data stored (checked based on previously retrieved file hashes).

  • required_metakeys (Sequence | None) –

    Keys in the metadata we should use to store this metadata in the object store if None it defaults to:

    {“species”, “site”, “station_long_name”, “inlet”, “instrument”, “network”, “source_format”, “data_source”, “icos_data_level”}

  • compressor (Any | None) – A custom compressor to use. If None, this will default to Blosc(cname=”zstd”, clevel=5, shuffle=Blosc.SHUFFLE). See https://zarr.readthedocs.io/en/stable/api/codecs.html for more information on compressors.

  • filters (Any | None) – Filters to apply to the data on storage, this defaults to no filtering. See https://zarr.readthedocs.io/en/stable/tutorial.html#filters for more information on picking filters.

Return type:

list[dict] | None

Returns:

list of dicts containing details of stored data, or None

openghg.store.FluxTimeseries#

The FluxTimeseries class is used to process UK inventory data.

class openghg.store.FluxTimeseries(bucket)[source]#

This class is used to process ond dimension timeseries data

_data_type = 'flux_timeseries'#

_root = “FluxTimeseries” _uuid = “099b597b-0598-4efa-87dd-472dfe027f5d8” _metakey = f”{_root}/uuid/{_uuid}/metastore

format_inputs(**kwargs)[source]#

Apply appropriate formatting for expected inputs for FluxTimeseries. Expected inputs will typically be defined within the openghg.standardise.standardise_flux_timeseries() function.

Parameters:

kwargs (Any) – Set of keyword arguments. Selected keywords will be appropriately formatted.

Returns:

Formatted parameters for this data type.

Return type:

dict

read_raw_data(binary_data, metadata, file_metadata)[source]#

Ready a footprint from binary data

Parameters:
  • binary_data (bytes) – Footprint data

  • metadata (dict) – Dictionary of metadata

  • file_metadat – File metadata

Returns:

UUIDs of Datasources data has been assigned to

Return type:

dict

static schema()[source]#

Define schema for one dimensional timeseries(FluxTimeseries) Dataset.

Includes observation for each time of the defined domain:
  • “Obs”
    • expected dimensions: (“time”)

Expected data types for all variables and coordinates also included.

Returns:

Contains schema for FluxTimeseries.

Return type:

DataSchema

Recombination functions#

These handle the recombination of data retrieved from the object store.

Segmentation functions#

These handle the segmentation of data ready for storage in the object store.

Metadata Handling#

The data_manager function is used in the same way as the search functions. It takes any number of keyword arguments for searching of metadata and a data_type argument. It returns a DataManager object.

Data types#

These helper functions provide a useful way of retrieving the data types OpenGHG can process and their associated storage classes.