Standardise#

Functions that accept data in specific formats, standardise it to a CF-compliant format and ensure it has the correct metadata attached. The data returned from these functions is then stored in the object store.

Measurement Standardisation#

These functions cover the four types of measurement we currently support.

Surface measurements#

openghg.standardise.standardise_surface(source_format, network, site, filepath, inlet=None, height=None, instrument=None, data_level=None, data_sublevel=None, dataset_source=None, sampling_period=None, calibration_scale=None, platform=None, measurement_type=None, verify_site_code=True, site_filepath=None, tag=None, store=None, update_mismatch='never', if_exists='auto', save_current='auto', overwrite=False, force=False, compression=True, compressor=None, filters=None, chunks=None, info_metadata=None, sort_files=False)[source]#

Standardise surface measurements and store the data in the object store.

Parameters:

filepath (Union[str, Path, tuple, list]) – Filepath(s)
source_format (str) – Data format, for example CRDS, GCWERKS
site (str) – Site code/name
network (str) – Network name
inlet (str | None) – Inlet height. Format ‘NUMUNIT’ e.g. “10m”. If retrieve multiple files pass None, OpenGHG will attempt to extract this from the file.
height (str | None) – Alias for inlet.
instrument (str | None) – Instrument name
data_level (str | int | float | None) –
The level of quality control which has been applied to the data. This should follow the convention of:
- ”0”: raw sensor output
- ”1”: automated quality assurance (QA) performed
- ”2”: final data set
- ”3”: elaborated data products using the data
data_sublevel (str | float | None) – Typically used for “L1” data depending on different QA performed before data is finalised.
dataset_source (str | None) – Dataset source name, for example “ICOS”, “InGOS”, “European ObsPack”, “CEDA 2023.06”.
sampling_period (Timedelta | str | None) – Sampling period as pandas time code, e.g. 1m for 1 minute, 1h for 1 hour
calibration_scale (str | None) – Calibration scale for data
platform (str | None) – Type of measurement platform e.g. “surface-insitu”, “surface-flask”
measurement_type (str | None) – Type of measurement. For some source_formats this value is added to the attributes. Platform should be used in preference. If platform is specified and measurement_type is not, this will be set to match the platform.
verify_site_code (bool) – Verify the site code
site_filepath (str | Path | None) – Alternative site info file (see openghg/openghg_defs repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.
tag (str | list | None) – Special tagged values to add to the Datasource. This will be added to any current values if the tag key already exists in a list.
store (str | None) – Name of object store to write to, required if user has access to more than one writable store
update_mismatch (str) –
This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:
- ”never” - don’t update mismatches and raise an AttrMismatchError
- ”from_source” / “attributes” - update mismatches based on input attributes
- ”from_definition” / “metadata” - update mismatches based on input metadata
if_exists (str) –
What to do if existing data is present. - “auto” - checks new and current data for timeseries overlap
- adds data if no overlap
- raises DataOverlapError if there is an overlap
- ”new” - just include new data and ignore previous
- ”combine” - replace and insert new data into current timeseries
save_current (str) – Whether to save data in current form and create a new version. - “auto” - this will depend on if_exists input (“auto” -> False), (other -> True) - “y” / “yes” - Save current data exactly as it exists as a separate (previous) version - “n” / “no” - Allow current data to updated / deleted
overwrite (bool) – Deprecated. This will use options for if_exists=”new”.
force (bool) – Force adding of data even if this is identical to data stored.
compression (bool) – Enable compression in the store
compressor (Any | None) – A custom compressor to use. If None, this will default to Blosc(cname=”zstd”, clevel=5, shuffle=Blosc.SHUFFLE). See https://zarr.readthedocs.io/en/stable/api/codecs.html for more information on compressors.
filters (Any | None) – Filters to apply to the data on storage, this defaults to no filtering. See https://zarr.readthedocs.io/en/stable/tutorial.html#filters for more information on picking filters.
chunks (dict | None) – Chunking schema to use when storing data. It expects a dictionary of dimension name and chunk size, for example {“time”: 100}. If None then a chunking schema will be set automatically by OpenGHG. See documentation for guidance on chunking: https://docs.openghg.org/tutorials/local/Adding_data/Adding_ancillary_data.html#chunking. To disable chunking pass an empty dictionary.
info_metadata (dict | None) – Allows to pass in additional tags to describe the data. e.g {“comment”:”Quality checks have been applied”}
sort_files (bool) – Sorts multiple files date-wise.

Returns:

Dictionary of result data

Return type:

dict

Boundary Conditions#

openghg.standardise.standardise_bc(filepath, species, bc_input, domain, source_format='openghg', period=None, continuous=True, tag=None, store=None, if_exists='auto', save_current='auto', overwrite=False, force=False, compression=True, compressor=None, filters=None, chunks=None, info_metadata=None)[source]#

Standardise boundary condition data and store it in the object store.

Parameters:

filepath (str | Path) – Path of boundary conditions file
species (str) – Species name
bc_input (str) – Input used to create boundary conditions. For example: - a model name such as “MOZART” or “CAMS” - a description such as “UniformAGAGE” (uniform values based on AGAGE average)
domain (str) – Region for boundary conditions
source_format (str) – Type of data being input e.g. openghg (internal format).
period (str | tuple | None) – Period of measurements, if not passed this is inferred from the time coords
continuous (bool) – Whether time stamps have to be continuous.
tag (str | list | None) – Special tagged values to add to the Datasource. This will be added to any current values if the tag key already exists in a list.
store (str | None) – Name of store to write to
if_exists (str) –
What to do if existing data is present. - “auto” - checks new and current data for timeseries overlap
- adds data if no overlap
- raises DataOverlapError if there is an overlap
- ”new” - just include new data and ignore previous
- ”combine” - replace and insert new data into current timeseries
save_current (str) – Whether to save data in current form and create a new version. - “auto” - this will depend on if_exists input (“auto” -> False), (other -> True) - “y” / “yes” - Save current data exactly as it exists as a separate (previous) version - “n” / “no” - Allow current data to updated / deleted
overwrite (bool) – Deprecated. This will use options for if_exists=”new”.
force (bool) – Force adding of data even if this is identical to data stored.
compression (bool) – Enable compression in the store
compressor (Any | None) – A custom compressor to use. If None, this will default to Blosc(cname=”zstd”, clevel=5, shuffle=Blosc.SHUFFLE). See https://zarr.readthedocs.io/en/stable/api/codecs.html for more information on compressors.
filters (Any | None) – Filters to apply to the data on storage, this defaults to no filtering. See https://zarr.readthedocs.io/en/stable/tutorial.html#filters for more information on picking filters.
chunks (dict | None) – Chunking schema to use when storing data. It expects a dictionary of dimension name and chunk size, for example {“time”: 100}. If None then a chunking schema will be set automatically by OpenGHG. See documentation for guidance on chunking: https://docs.openghg.org/tutorials/local/Adding_data/Adding_ancillary_data.html#chunking To disable chunking pass an empty dictionary.
info_metadata (dict | None) – Allows to pass in additional tags to describe the data. e.g {“comment”:”Quality checks have been applied”}

Returns:

Dictionary containing confirmation of standardisation process.

Return type:

dict

Emissions / Flux#

openghg.standardise.standardise_flux(filepath, species, source, domain, database=None, source_format='openghg', database_version=None, model=None, time_resolved=False, high_time_resolution=False, period=None, chunks=None, continuous=True, tag=None, store=None, if_exists='auto', save_current='auto', overwrite=False, force=False, compression=True, compressor=None, filters=None, info_metadata=None)[source]#

Process flux / emissions data

Parameters:

filepath (str | Path) – Path of flux / emissions file
species (str) – Species name
source (str) – Flux / Emissions source
domain (str) – Flux / Emissions domain
source_format (str) – Data format, for example openghg, intem
date – Date as a string e.g. “2012” or “201206” associated with emissions as a string. Only needed if this can not be inferred from the time coords
time_resolved (bool) – If this is a high resolution file
high_time_resolution (bool) – This argument is deprecated and will be replaced in future versions with time_resolved.
period (str | tuple | None) – Period of measurements, if not passed this is inferred from the time coords
chunks (dict | None) – Chunking schema to use when storing data. It expects a dictionary of dimension name and chunk size, for example {“time”: 100}. If None then a chunking schema will be set automatically by OpenGHG. See documentation for guidance on chunking: https://docs.openghg.org/tutorials/local/Adding_data/Adding_ancillary_data.html#chunking. To disable chunking pass an empty dictionary.
continuous (bool) – Whether time stamps have to be continuous.
tag (str | list | None) – Special tagged values to add to the Datasource. This will be added to any current values if the tag key already exists in a list.
store (str | None) – Name of store to write to
if_exists (str) –
What to do if existing data is present. - “auto” - checks new and current data for timeseries overlap
- adds data if no overlap
- raises DataOverlapError if there is an overlap
- ”new” - just include new data and ignore previous
- ”combine” - replace and insert new data into current timeseries
save_current (str) – Whether to save data in current form and create a new version. - “auto” - this will depend on if_exists input (“auto” -> False), (other -> True) - “y” / “yes” - Save current data exactly as it exists as a separate (previous) version - “n” / “no” - Allow current data to updated / deleted
overwrite (bool) – Deprecated. This will use options for if_exists=”new”.
force (bool) – Force adding of data even if this is identical to data stored.
compression (bool) – Enable compression in the store
compressor (Any | None) – A custom compressor to use. If None, this will default to Blosc(cname=”zstd”, clevel=5, shuffle=Blosc.SHUFFLE). See https://zarr.readthedocs.io/en/stable/api/codecs.html for more information on compressors.
filters (Any | None) – Filters to apply to the data on storage, this defaults to no filtering. See https://zarr.readthedocs.io/en/stable/tutorial.html#filters for more information on picking filters.
info_metadata (dict | None) – Allows to pass in additional tags to describe the data. e.g {“comment”:”Quality checks have been applied”}

Returns:

Dictionary of Datasource UUIDs data assigned to

Return type:

dict

Footprints#

openghg.standardise.standardise_footprint(filepath, model, domain, site=None, satellite=None, obs_region=None, selection=None, inlet=None, height=None, met_model=None, species=None, network=None, source_format='acrg_org', period=None, chunks=None, continuous=True, retrieve_met=False, tag=None, store=None, if_exists='auto', save_current='auto', overwrite=False, force=False, high_spatial_resolution=False, time_resolved=False, high_time_resolution=False, short_lifetime=False, sort=False, drop_duplicates=False, compression=True, compressor=None, filters=None, info_metadata=None, sort_files=False)[source]#

Reads footprint data files and returns the UUIDs of the Datasources the processed data has been assigned to

Parameters:

filepath (Union[str, Path, tuple, list]) – Path(s) of file to standardise
model (str) – Model used to create footprint (e.g. NAME or FLEXPART)
domain (str) – Domain of footprints
site (str | None) – Site name
satellite (str | None) – Satellite name
obs_region (str | None) – The geographic region covered by the data (“BRAZIL”, “INDIA”, “UK”).
selection (str | None) – For satellite only, identifier for any data selection which has been performed on satellite data. This can be based on any form of filtering, binning etc. but should be unique compared to other selections made e.g. “land”, “glint”, “upperlimit”.
inlet (str | None) – Height above ground level in metres. Format ‘NUMUNIT’ e.g. “10m”
height (str | None) – Alias for inlet. One of height or inlet must be included.
met_model (str | None) – Underlying meteorlogical model used (e.g. UKV)
species (str | None) – Species name. Only needed if footprint is for a specific species e.g. co2 (and not inert)
network (str | None) – Network name
source_format (str) – Format of the input data format, for example acrg_org
period (str | tuple | None) – Period of measurements. Only needed if this can not be inferred from the time coords
chunks (dict | None) – Chunk schema to use when storing data. It expects a dictionary of dimension name and chunk size, for example {“time”: 100}. If None then a chunking schema will be set automatically by OpenGHG. See documentation for guidance on chunking: https://docs.openghg.org/tutorials/local/Adding_data/Adding_ancillary_data.html#chunking by OpenGHG as per the TODO RELEASE: add link to documentation. To disable chunking pass an empty dictionary.
continuous (bool) – Whether time stamps have to be continuous.
retrieve_met (bool) – Whether to also download meterological data for this footprints area
high_spatial_resolution (bool) – Indicate footprints include both a low and high spatial resolution.
time_resolved (bool) – Indicate footprints are high time resolution (include H_back dimension) Note this will be set to True automatically for Carbon Dioxide data.
short_lifetime (bool) – Indicate footprint is for a short-lived species. Needs species input. Note this will be set to True if species has an associated lifetime.
high_time_resolution (bool) – This argument is deprecated and will be replaced in future versions with time_resolved.
tag (str | list | None) – Special tagged values to add to the Datasource. This will be added to any current values if the tag key already exists in a list.
store (str | None) – Name of store to write to
if_exists (str) –
What to do if existing data is present. - “auto” - checks new and current data for timeseries overlap
- adds data if no overlap
- raises DataOverlapError if there is an overlap
- ”new” - just include new data and ignore previous
- ”combine” - replace and insert new data into current timeseries
save_current (str) – Whether to save data in current form and create a new version. - “auto” - this will depend on if_exists input (“auto” -> False), (other -> True) - “y” / “yes” - Save current data exactly as it exists as a separate (previous) version - “n” / “no” - Allow current data to updated / deleted overwrite: Deprecated. This will use options for if_exists=”new”.
force (bool) – Force adding of data even if this is identical to data stored.
sort (bool) – Sort data in by time
drop_duplicates (bool) – Drop duplicate timestamps, keeping the first value
compression (bool) – Enable compression in the store
compressor (Any | None) – A custom compressor to use. If None, this will default to Blosc(cname=”zstd”, clevel=5, shuffle=Blosc.SHUFFLE). See https://zarr.readthedocs.io/en/stable/api/codecs.html for more information on compressors.
filters (Any | None) – Filters to apply to the data on storage, this defaults to no filtering. See https://zarr.readthedocs.io/en/stable/tutorial.html#filters for more information on picking filters.
info_metadata (dict | None) – Allows to pass in additional tags to describe the data. e.g {“comment”:”Quality checks have been applied”}
sort_files (bool) – Sort multiple files datewise

Returns:

Dictionary containing confirmation of standardisation process. None if file already processed.

Return type:

dict / None

Flux Timeseries#

openghg.standardise.standardise_flux_timeseries(filepath, species, source, region='UK', source_format='crf', domain=None, database=None, database_version=None, model=None, tag=None, store=None, if_exists='auto', save_current='auto', overwrite=False, force=False, compressor=None, filters=None, period=None, continuous=None, info_metadata=None)[source]#

Process one dimension timeseries file

Parameters:

filepath (str | Path) – Path of flux timeseries file
species (str) – Species name
source (str) – Flux / Emissions source
region (str) – Region/Country of the CRF data
source_format (str) – Type of data being input e.g. openghg (internal format)
period (str | tuple | None) – Period of measurements. Only needed if this can not be inferred from the time coords
domain (str | None) – If flux is related to pre-existing domain (e.g. “EUROPE”) with defined latitude-longitude bounds this can be used to flag that. Otherwise, use region input to describe the name of a region (e.g. “UK”).
database (str | None) – Name of database source for this input (if relevant)
database_version (str | None) – Name of database version (if relevant)
model (str | None) – Model name (if relevant)
specified (If) –
- “yearly”, “monthly”
- suitable pandas Offset Alias
- tuple of (value, unit) as would be passed to pandas.Timedelta function
of (should be one) –
- “yearly”, “monthly”
- suitable pandas Offset Alias
- tuple of (value, unit) as would be passed to pandas.Timedelta function
tag (str | list | None) – Special tagged values to add to the Datasource. This will be added to any current values if the tag key already exists in a list.
chunks – Chunking schema to use when storing data. It expects a dictionary of dimension name and chunk size, for example {“time”: 100}. If None then a chunking schema will be set automatically by OpenGHG. See documentation for guidance on chunking: https://docs.openghg.org/tutorials/local/Adding_data/Adding_ancillary_data.html#chunking. To disable chunking pass in an empty dictionary.
continuous (bool | None) – Whether time stamps have to be continuous.
if_exists (str) –
What to do if existing data is present. - “auto” - checks new and current data for timeseries overlap
- adds data if no overlap
- raises DataOverlapError if there is an overlap
- ”new” - just include new data and ignore previous
- ”combine” - replace and insert new data into current timeseries
save_current (str) – Whether to save data in current form and create a new version. - “auto” - this will depend on if_exists input (“auto” -> False), (other -> True) - “y” / “yes” - Save current data exactly as it exists as a separate (previous) version - “n” / “no” - Allow current data to updated / deleted
overwrite (bool) – Deprecated. This will use options for if_exists=”new”.
force (bool) – Force adding of data even if this is identical to data stored.
compressor (Any | None) – A custom compressor to use. If None, this will default to Blosc(cname=”zstd”, clevel=5, shuffle=Blosc.SHUFFLE). See https://zarr.readthedocs.io/en/stable/api/codecs.html for more information on compressors.
filters (Any | None) – Filters to apply to the data on storage, this defaults to no filtering. See https://zarr.readthedocs.io/en/stable/tutorial.html#filters for more information on picking filters.
info_metadata (dict | None) – Allows to pass in additional tags to describe the data. e.g {“comment”:”Quality checks have been applied”}

Returns:

Dictionary of datasource UUIDs data assigned to

Return type:

dict

Helpers#

Some of the functions above require quite specific arguments as we must ensure all metadata attriuted to data is as correct as possible. These functions help you find the correct arguments in each case.

Behind the scences these functions use parsing functions that are written specifically for each data type. Please see the Developer API for these functions.