Store#
Storage#
These classes are used to store each type of data in the object store. Each has a static load function that loads a version of itself from the object store. The read_file function is then used to read data files, call standardisation functions based on the format of the data file, collect metadata and then store the data and metadata in the object store.
openghg.store.BoundaryConditions#
The BoundaryConditions
class is used to standardise and store boundary conditions data.
- class openghg.store.BoundaryConditions[source]#
This class is used to process boundary condition data
- static read_data(binary_data, metadata, file_metadata)[source]#
Ready a footprint from binary data
- Parameters:
binary_data (
bytes
) – Footprint datametadata (
Dict
) – Dictionary of metadatafile_metadat – File metadata
- Returns:
UUIDs of Datasources data has been assigned to
- Return type:
dict
- static read_file(filepath, species, bc_input, domain, period=None, continuous=True, overwrite=False)[source]#
Read boundary conditions file
- Parameters:
filepath (
Union
[str
,Path
]) – Path of boundary conditions filespecies (
str
) – Species namebc_input (
str
) – Input used to create boundary conditions. For example: - a model name such as “MOZART” or “CAMS” - a description such as “UniformAGAGE” (uniform values based on AGAGE average)domain (
str
) – Region for boundary conditionsperiod (
Union
[str
,tuple
,None
]) –Period of measurements. Only needed if this can not be inferred from the time coords If specified, should be one of:
”yearly”, “monthly”
suitable pandas Offset Alias
tuple of (value, unit) as would be passed to pandas.Timedelta function
continuous (
bool
) – Whether time stamps have to be continuous.overwrite (
bool
) – Should this data overwrite currently stored data.
- Returns:
Dictionary of datasource UUIDs data assigned to
- Return type:
dict
- static schema()[source]#
Define schema for boundary conditions Dataset.
- Includes volume mole fractions for each time and ordinal, vertical boundary at the edge of the defined domain:
- “vmr_n”, “vmr_s”
expected dimensions: (“time”, “height”, “lon”)
- “vmr_e”, “vmr_w”
expected dimensions: (“time”, “height”, “lat”)
Expected data types for all variables and coordinates also included.
- Returns:
Contains schema for BoundaryConditions.
- Return type:
DataSchema
- static validate_data(data)[source]#
Validate input data against BoundaryConditions schema - definition from BoundaryConditions.schema() method.
- Parameters:
data (
Dataset
) – xarray Dataset in expected format- Return type:
None
- Returns:
None
Raises a ValueError with details if the input data does not adhere to the BoundaryConditions schema.
openghg.store.Emissions#
The Emissions
class is used to process emissions / flux data files.
- class openghg.store.Emissions[source]#
This class is used to process emissions / flux data
- static read_data(binary_data, metadata, file_metadata)[source]#
Ready a footprint from binary data
- Parameters:
binary_data (
bytes
) – Footprint datametadata (
Dict
) – Dictionary of metadatafile_metadat – File metadata
- Returns:
UUIDs of Datasources data has been assigned to
- Return type:
dict
- static read_file(filepath, species, source, domain, database=None, database_version=None, model=None, source_format='openghg', high_time_resolution=False, period=None, chunks=None, continuous=True, overwrite=False)[source]#
Read emissions file
- Parameters:
filepath (
Union
[str
,Path
]) – Path of emissions filespecies (
str
) – Species namedomain (
str
) – Emissions domainsource (
str
) – Emissions sourcedatabase (
Optional
[str
]) – Name of database source for this input (if relevant)database_version (
Optional
[str
]) – Name of database version (if relevant)model (
Optional
[str
]) – Model name (if relevant)source_format (
str
) – Type of data being input e.g. openghg (internal format)high_time_resolution (
Optional
[bool
]) – If this is a high resolution fileperiod (
Union
[str
,tuple
,None
]) – Period of measurements. Only needed if this can not be inferred from the time coordsspecified (If) –
“yearly”, “monthly”
suitable pandas Offset Alias
tuple of (value, unit) as would be passed to pandas.Timedelta function
of (should be one) –
“yearly”, “monthly”
suitable pandas Offset Alias
tuple of (value, unit) as would be passed to pandas.Timedelta function
continuous (
bool
) – Whether time stamps have to be continuous.overwrite (
bool
) – Should this data overwrite currently stored data.
- Returns:
Dictionary of datasource UUIDs data assigned to
- Return type:
dict
- static schema()[source]#
Define schema for emissions Dataset.
- Includes flux/emissions for each time and position:
- “flux”
expected dimensions: (“time”, “lat”, “lon”)
Expected data types for all variables and coordinates also included.
- Returns:
Contains schema for Emissions.
- Return type:
DataSchema
- static transform_data(datapath, database, overwrite=False, **kwargs)[source]#
Read and transform an emissions database. This will find the appropriate parser function to use for the database specified. The necessary inputs are determined by which database ie being used.
- The underlying parser functions will be of the form:
- openghg.transform.emissions.parse_{database.lower()}
e.g. openghg.transform.emissions.parse_edgar()
- Parameters:
datapath (
Union
[str
,Path
]) – Path to local copy of database archive (for now)database (
str
) – Name of databaseoverwrite (
bool
) – Should this data overwrite currently stored data which matches.**kwargs (
Dict
) – Inputs for underlying parser function for the database. Necessary inputs will depend on the database being parsed.
- Return type:
Dict
TODO: Could allow Callable[…, Dataset] type for a pre-defined function be passed
- static validate_data(data)[source]#
Validate input data against Emissions schema - definition from Emissions.schema() method.
- Parameters:
data (
Dataset
) – xarray Dataset in expected format- Return type:
None
- Returns:
None
Raises a ValueError with details if the input data does not adhere to the Emissions schema.
openghg.store.EulerianModel#
The EulerianModel
class is used to process Eulerian model data.
- class openghg.store.EulerianModel[source]#
This class is used to process Eulerian model data
- static read_file(filepath, model, species, start_date=None, end_date=None, setup=None, overwrite=False)[source]#
Read Eulerian model output
- Parameters:
filepath (
Union
[str
,Path
]) – Path of Eulerian model species outputmodel (
str
) – Eulerian model namespecies (
str
) – Species namestart_date (
Optional
[str
]) – Start date (inclusive) associated with model runend_date (
Optional
[str
]) – End date (exclusive) associated with model runsetup (
Optional
[str
]) – Additional setup details for runoverwrite (
bool
) – Should this data overwrite currently stored data.
- Return type:
Dict
openghg.store.Footprints#
The Footprints
class is used to store and retrieve meteorological data from the ECMWF data store.
Some data may be cached locally for quicker access.
- class openghg.store.Footprints[source]#
This class is used to process footprints model output
- static read_data(binary_data, metadata, file_metadata)[source]#
Ready a footprint from binary data
- Parameters:
binary_data (
bytes
) – Footprint datametadata (
Dict
) – Dictionary of metadatafile_metadat – File metadata
- Returns:
UUIDs of Datasources data has been assigned to
- Return type:
dict
- static read_file(filepath, site, domain, model, inlet=None, height=None, metmodel=None, species=None, network=None, period=None, chunks=None, continuous=True, retrieve_met=False, high_spatial_res=False, high_time_res=False, short_lifetime=False, overwrite=False)[source]#
Reads footprints data files and returns the UUIDS of the Datasources the processed data has been assigned to
- Parameters:
filepath (
Union
[str
,Path
]) – Path of file to loadsite (
str
) – Site namedomain (
str
) – Domain of footprintsmodel (
str
) – Model used to create footprint (e.g. NAME or FLEXPART)inlet (
Optional
[str
]) – Height above ground level in metres. Format ‘NUMUNIT’ e.g. “10m”height (
Optional
[str
]) – Alias for inlet. One of height or inlet MUST be included.metmodel (
Optional
[str
]) – Underlying meteorlogical model used (e.g. UKV)species (
Optional
[str
]) – Species name. Only needed if footprint is for a specific species e.g. co2 (and not inert)network (
Optional
[str
]) – Network nameperiod (
Union
[str
,tuple
,None
]) – Period of measurements. Only needed if this can not be inferred from the time coordscontinuous (
bool
) – Whether time stamps have to be continuous.retrieve_met (
bool
) – Whether to also download meterological data for this footprints areahigh_spatial_res (
bool
) – Indicate footprints include both a low and high spatial resolution.high_time_res (
bool
) – Indicate footprints are high time resolution (include H_back dimension) Note this will be set to True automatically if species=”co2” (Carbon Dioxide).short_lifetime (
bool
) – Indicate footprint is for a short-lived species. Needs species input. Note this will be set to True if species has an associated lifetime.overwrite (
bool
) – Overwrite any currently stored data
- Returns:
UUIDs of Datasources data has been assigned to
- Return type:
dict
- static schema(particle_locations=True, high_spatial_res=False, high_time_res=False, short_lifetime=False)[source]#
Define schema for footprint Dataset.
The returned schema depends on what the footprint represents, indicated using the keywords. By default, this will include “fp” variable but this will be superceded if high_spatial_res or high_time_res are specified.
- Parameters:
particle_locations (
bool
) – Include 4-directional particle location variables: - “particle_location_[nesw]” and include associated additional dimensions (“height”)high_spatial_res (
bool
) – Set footprint variables include high and low resolution options: - “fp_low” - “fp_high” and include associated additional dimensions (“lat_high”, “lon_high”).high_time_res (
bool
) – Set footprint variable to be high time resolution - “fp_HiTRes” and include associated dimensions (“H_back”).short_lifetime (
bool
) – Include additional particle age parameters for short lived species: - “mean_age_particles_[nesw]”
- Return type:
DataSchema
- static validate_data(data, particle_locations=True, high_spatial_res=False, high_time_res=False, short_lifetime=False)[source]#
Validate data against Footprint schema - definition from Footprints.schema(…) method.
- Parameters:
data (
Dataset
) – xarray Dataset in expected formatinputs. (See Footprints.schema() method for details on optional) –
- Return type:
None
- Returns:
None
Raises a ValueError with details if the input data does not adhere to the Footprints schema.
openghg.store.ObsColumn#
The ObsColumn
class is used to process column / satellite observation data.
- class openghg.store.ObsColumn[source]#
This class is used to process emissions / flux data
- static read_file(filepath, satellite=None, domain=None, selection=None, site=None, species=None, network=None, instrument=None, platform='satellite', source_format='openghg', overwrite=False)[source]#
Read column observation file
- Parameters:
filepath (
Union
[str
,Path
]) – Path of observation filesatellite (
Optional
[str
]) – Name of satellite (if relevant)domain (
Optional
[str
]) – For satellite only. If data has been selected on an area include the identifier name for domain covered. This can map to previously defined domains (see openghg_defs “domain_info.json” file) or a newly defined domain.selection (
Optional
[str
]) – For satellite only, identifier for any data selection which has been performed on satellite data. This can be based on any form of filtering, binning etc. but should be unique compared to other selections made e.g. “land”, “glint”, “upperlimit”. If not specified, domain will be used.site (
Optional
[str
]) – Site code/name (if relevant). Can include satellite OR site.species (
Optional
[str
]) – Species name or synonym e.g. “ch4”instrument (
Optional
[str
]) – Instrument name e.g. “TANSO-FTS”network (
Optional
[str
]) – Name of in-situ or satellite network e.g. “TCCON”, “GOSAT”platform (
str
) – Type of platform. Should be one of: - “satellite” - “site”source_format (
str
) – Type of data being input e.g. openghg (internal format)overwrite (
bool
) – Should this data overwrite currently stored data.
- Returns:
Dictionary of datasource UUIDs data assigned to
- Return type:
dict
openghg.store.ObsSurface#
The ObsSurface
class is used to process surface observation data.
- class openghg.store.ObsSurface[source]#
This class is used to process surface observation data
- delete(uuid)[source]#
Delete a Datasource with the given UUID
This function deletes both the record of the object store in he
- Parameters:
uuid (str) – UUID of Datasource
- Return type:
None
- Returns:
None
- static read_data(binary_data, metadata, file_metadata, precision_data=None, site_filepath=None)[source]#
Reads binary data passed in by serverless function. The data dictionary should contain sub-dictionaries that contain data and metadata keys.
This is clunky and the ObsSurface.read_file function could be tidied up quite a lot to be more flexible.
- Parameters:
binary_data (
bytes
) – Binary measurement datametadata (
Dict
) – Metadatafile_metadata (
Dict
) – File metadata such as original filenameprecision_data (
Optional
[bytes
]) – GCWERKS precision datasite_filepath (
Union
[str
,Path
,None
]) – Alternative site info file (see openghg/supplementary_data repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.
- Returns:
Dictionary of result
- Return type:
dict
- static read_file(filepath, source_format, network, site, inlet=None, height=None, instrument=None, sampling_period=None, calibration_scale=None, measurement_type='insitu', overwrite=False, verify_site_code=True, site_filepath=None)[source]#
- Process files and store in the object store. This function
utilises the process functions of the other classes in this submodule to handle each data type.
- Parameters:
filepath (
Union
[str
,Path
,Tuple
,List
]) – Filepath(s)source_format (
str
) – Data format, for example CRDS, GCWERKSsite (
str
) – Site code/namenetwork (
str
) – Network nameinlet (
Optional
[str
]) – Inlet height. Format ‘NUMUNIT’ e.g. “10m”. If retrieve multiple files pass None, OpenGHG will attempt to extract this from the file.height (
Optional
[str
]) – Alias for inlet.data. (read inlets from) –
instrument (
Optional
[str
]) – Instrument namesampling_period (
Union
[Timedelta
,str
,None
]) – Sampling period in pandas style (e.g. 2H for 2 hour period, 2m for 2 minute period).measurement_type (
str
) – Type of measurement e.g. insitu, flaskoverwrite (
bool
) – Overwrite previously uploaded dataverify_site_code (
bool
) – Verify the site codesite_filepath (
Union
[str
,Path
,None
]) – Alternative site info file (see openghg/supplementary_data repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.
- Returns:
Dictionary of Datasource UUIDs
- Return type:
dict
TODO: Should “measurement_type” be changed to “platform” to align with ModelScenario and ObsColumn?
- static read_multisite_aqmesh(data_filepath, metadata_filepath, network='aqmesh_glasgow', instrument='aqmesh', sampling_period=60, measurement_type='insitu', overwrite=False)[source]#
Read AQMesh data for the Glasgow network
NOTE - temporary function until we know what kind of AQMesh data we’ll be retrieve in the future.
This data is different in that it contains multiple sites in the same file.
- Return type:
DefaultDict
- static schema(species)[source]#
Define schema for surface observations Dataset.
- Only includes mandatory variables
standardised species name (e.g. “ch4”)
expected dimensions: (“time”)
Expected data types for variables and coordinates also included.
- Returns:
Contains basic schema for ObsSurface.
- Return type:
DataSchema
# TODO: Decide how to best incorporate optional variables # e.g. “ch4_variability”, “ch4_number_of_observations”
- static store_data(data, overwrite=False, required_metakeys=None)[source]#
This expects already standardised data such as ICOS / CEDA
- Parameters:
data (
Dict
) – Dictionary of data in standard format, see the data spec underdocumentation (Development -> Data specifications in the) –
overwrite (
bool
) – If True overwrite currently stored datarequired_metakeys (
Optional
[Sequence
]) – Keys in the metadata we should use to store this metadata in the object storeto (if None it defaults) –
{"species" –
"site" –
"station_long_name" –
"inlet" –
"instrument" –
:param : :param “network”: :param “source_format”: :param “data_source”: :param “icos_data_level”}:
- Return type:
Dict or None
- store_hashes(hashes)[source]#
Store hashes of data retrieved from a remote data source such as ICOS or CEDA. This takes the full dictionary of hashes, removes the ones we’ve seen before and adds the new.
- Parameters:
hashes (
Dict
) – Dictionary of hashes provided by the hash_retrieved_data function- Return type:
None
- Returns:
None
- static validate_data(data, species)[source]#
Validate input data against ObsSurface schema - definition from ObsSurface.schema() method.
- Parameters:
data (
Dataset
) – xarray Dataset in expected formatspecies (
str
) – Species name
- Return type:
None
- Returns:
None
Raises a ValueError with details if the input data does not adhere to the ObsSurface schema.
Recombination functions#
These handle the recombination of data retrieved from the object store.
- openghg.store.recombine_datasets(keys, sort=False, attrs_to_check=None, elevate_inlet=False)[source]#
Combines datasets stored separately in the object store into a single dataset
- Parameters:
keys (
List
[str
]) – List of object store keyssort (
Optional
[bool
]) – Sort the resulting Dataset by the time dimension. Default = Trueattrs_to_check (
Optional
[Dict
[str
,str
]]) – Attributes to check for duplicates. If duplicates are present a new data variable will be created containing the values from each dataset If a dictionary is passed, the attribute(s) will be retained and the new value assigned. If a list/string is passed, the attribute(s) will be removed.elevate_inlet (
bool
) – Force the elevation of the inlet attribute
- Returns:
Combined Dataset
- Return type:
xarray.Dataset
- openghg.store.recombine_multisite(keys, sort=True)[source]#
Recombine the keys from multiple sites into a single Dataset for each site
- Parameters:
site_keys – A dictionary of lists of keys, keyed by site
sort (
Optional
[bool
]) – Sort the resulting Dataset by the time dimension
- Returns:
Dictionary of xarray.Datasets
- Return type:
dict
Segmentation functions#
These handle the segmentation of data ready for storage in the object store.
- openghg.store.assign_data(data_dict, lookup_results, overwrite, data_type)[source]#
Assign data to a Datasource. This will either create a new Datasource Create or get an existing Datasource for each gas in the file :rtype:
Dict
[str
,Dict
]- Args:
data_dict: Dictionary containing data and metadata for species lookup_results: Dictionary of lookup results] overwrite: If True overwrite current data stored
- Returns:
dict: Dictionary of UUIDs of Datasources data has been assigned to keyed by species name
Metadata Handling#
The data_handler_lookup
function is used in the same way as the search functions. It takes any number of
keyword arguments for searching of metadata and a data_type
argument. It returns a DataHandler object.
- openghg.store.data_handler_lookup(data_type, **kwargs)[source]#
Lookup the data / metadata you’d like to modify.
- Parameters:
data_type (
str
) – Type of data, for example surface, flux, footprintkwargs (
Dict
) – Any pair of keyword arguments for searching
- Returns:
A handler object to help modify the metadata
- Return type:
Data types#
These helper functions provide a useful way of retrieving the data types OpenGHG can process and their associated storage classes.