Standardise - data#
Each of these functions parses a specific type of data file and returns a dictionary containing the data and metadata.
Surface observations#
- openghg.standardise.surface.parse_beaco2n(filepath, site, network, inlet, instrument='shinyei', sampling_period=None, **kwargs)[source]#
Read BEACO2N data files
- Parameters:
filepath (
str
|Path
) – Data filepathsite (
str
) – Site namenetwork (
str
) – Network nameinlet (
str
) – Inlet height in metresinstrument (
str
|None
) – Instrument namesampling_period (
Optional
[str
]) – Measurement sampling period
- Returns:
Dictionary of data
- Return type:
dict
- openghg.standardise.surface.parse_crds(filepath, site, network, inlet=None, instrument=None, sampling_period=None, measurement_type=None, drop_duplicates=True, update_mismatch='never', site_filepath=None, **kwargs)[source]#
Parses a CRDS data file and creates a dictionary of xarray Datasets ready for storage in the object store.
- Parameters:
filepath (
str
|Path
) – Path to filesite (
str
) – Three letter site codenetwork (
str
) – Network nameinlet (
Optional
[str
]) – Inlet heightinstrument (
Optional
[str
]) – Instrument namesampling_period (
Union
[str
,int
,float
,None
]) – Sampling period in secondsmeasurement_type (
Optional
[str
]) – Measurement type e.g. insitu, flaskdrop_duplicates (
bool
) – Drop measurements at duplicate timestamps, keeping the first.update_mismatch (
str
) –This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:
”never” - don’t update mismatches and raise an AttrMismatchError
”from_source” / “attributes” - update mismatches based on input data (e.g. data attributes)
”from_definition” / “metadata” - update mismatches based on associated data (e.g. site_info.json)
site_filepath (
Union
[str
,Path
,None
]) – Alternative site info file (see openghg/openghg_defs repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.
- Returns:
Dictionary of gas data
- Return type:
dict
- openghg.standardise.surface.parse_gcwerks(filepath, precision_filepath, site, network, inlet=None, instrument=None, sampling_period=None, measurement_type=None, update_mismatch='never', site_filepath=None)[source]#
Reads a GC data file by creating a GC object and associated datasources
- Parameters:
filepath (
str
|Path
) – Path of data fileprecision_filepath (
str
|Path
) – Path of precision filesite (
str
) – Three letter code or name for siteinstrument (
Optional
[str
]) – Instrument namenetwork (
str
) – Network nameupdate_mismatch (
str
) –This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:
”never” - don’t update mismatches and raise an AttrMismatchError
”from_source” / “attributes” - update mismatches based on input data (e.g. data attributes)
”from_definition” / “metadata” - update mismatches based on associated data (e.g. site_info.json)
site_filepath (
Union
[str
,Path
,None
]) – Alternative site info file (see openghg/openghg_defs repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.
- Returns:
Dictionary of source_name : UUIDs
- Return type:
dict
- openghg.standardise.surface.parse_noaa(filepath, site, measurement_type, inlet=None, network='NOAA', instrument=None, sampling_period=None, update_mismatch='never', site_filepath=None, **kwarg)[source]#
Read NOAA data from raw text file or ObsPack NetCDF
- Parameters:
filepath (
str
|Path
) – Data filepathsite (
str
) – Three letter site codeinlet (
Optional
[str
]) – Inlet height (as value unit e.g. “10m”)measurement_type (
str
) – One of (“flask”, “insitu”, “pfp”)network (
str
) – Network, defaults to NOAAinstrument (
Optional
[str
]) – Instrument namesampling_period (
Optional
[str
]) – Sampling periodupdate_mismatch (
str
) –This determines how mismatches between the internal data attributes and the supplied / derived metadata are handled. This includes the options:
”never” - don’t update mismatches and raise an AttrMismatchError
”attributes” - update mismatches based on input attributes
”metadata” - update mismatches based on input metadata
site_filepath (
Union
[str
,Path
,None
]) – Alternative site info file (see openghg/openghg_defs repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.
- Returns:
Dictionary of data and metadata
- Return type:
dict
- openghg.standardise.surface.parse_npl(filepath, site='NPL', network='LGHG', inlet=None, instrument=None, sampling_period=None, measurement_type=None, update_mismatch='never')[source]#
Reads NPL data files and returns the UUIDS of the Datasources the processed data has been assigned to
- Parameters:
filepath (
Union
[str
,Path
]) – Path of file to loadsite (
str
) – Site namenetwork (
str
) – Network, defaults to LGHGinlet (
Optional
[str
]) – Inlet height. Will be inferred if not specifiedinstrument (
Optional
[str
]) – Instrument namesampling_period (
Optional
[str
]) – Sampling periodmeasurement_type (
Optional
[str
]) – Type of measurement taken e.g.”flask”, “insitu”update_mismatch (
str
) –This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:
”never” - don’t update mismatches and raise an AttrMismatchError
”from_source” / “attributes” - update mismatches based on input data (e.g. data attributes)
”from_definition” / “metadata” - update mismatches based on associated data (e.g. site_info.json)
- Returns:
UUIDs of Datasources data has been assigned to
- Return type:
list
Column data#
- openghg.standardise.column.parse_openghg(filepath, satellite=None, domain=None, selection=None, site=None, species=None, network=None, instrument=None, platform='satellite', chunks=None, **kwargs)[source]#
Parse and extract data from pre-formatted netcdf file which already matches expected OpenGHG format.
The arguments specified below are the metadata needed to store this surface observation file within the object store. If these keywords are not included within the attributes of the netcdf file being passed then these arguments must be specified.
For column data this can either be a satellite (e.g. satellite=”GOSAT”) or a site (site=”RUN”, network=”TCCON”). Either can be specified or this function will attempt to extract this from the data file.
- Parameters:
filepath (
str
|Path
) – Path of observation filesatellite (
Optional
[str
]) – Name of satellite (if relevant)domain (
Optional
[str
]) – For satellite only. If data has been selected on an area include the identifier name for domain covered. This can map to previously defined domains (see openghg_defs “domain_info.json” file) or a newly defined domain.selection (
Optional
[str
]) – For satellite only, identifier for any data selection which has been performed on satellite data. This can be based on any form of filtering, binning etc. but should be unique compared to other selections made e.g. “land”, “glint”, “upperlimit”. If not specified, domain will be used.site (
Optional
[str
]) – Site code/name (if relevant). Can include satellite OR site.species (
Optional
[str
]) – Species name or synonym e.g. “ch4”instrument (
Optional
[str
]) – Instrument name e.g. “TANSO-FTS”network (
Optional
[str
]) – Name of in-situ or satellite network e.g. “TCCON”, “GOSAT”platform (
str
) – Type of platform. Should be one of: - “satellite” - “site” Note: this will be superceded if site or satellite keywords are specified.chunks (
Optional
[dict
]) – Chunking schema to use when storing data. It expects a dictionary of dimension name and chunk size, for example {“time”: 100}. If None then a chunking schema will be set automatically by OpenGHG. See documentation for guidance on chunking: https://docs.openghg.org/tutorials/local/Adding_data/Adding_ancillary_data.html#chunking. To disable chunking pass in an empty dictionary.kwargs (
str
) – Any additional attributes to be associated with the data.
- Returns:
Dictionary of source_name : data, metadata, attributes
- Return type:
Dict
Emissions / flux#
Flux Timeseries#
- openghg.standardise.flux_timeseries.parse_crf(filepath, species, source='anthro', region='UK', domain=None, data_type='flux_timeseries', database=None, database_version=None, model=None, period=None, continuous=True)[source]#
Parse CRF emissions data from the specified file.
- Parameters:
filepath (
Path
) – Path to the ‘.xlsx’ file containing CRF emissions data.species (
str
) – Name of speciessource (
str
) – Source of the emissions data, e.g. “energy”, “anthro”, default is ‘anthro’.region (
str
) – Region/Country of the CRF datadomain (
Optional
[str
]) – Geographic domain, default is ‘None’. Instead region is used to identify areadata_type (
str
) – Type of data, default is ‘flux_timeseries’.database (
Optional
[str
]) – Database name if applicable.database_version (
Optional
[str
]) – Version of the database if applicable.model (
Optional
[str
]) – Model name if applicable.period (
Union
[str
,tuple
,None
]) –Period of measurements. Only needed if this can not be inferred from the time coords If specified, should be one of:
”yearly”, “monthly”
suitable pandas Offset Alias
tuple of (value, unit) as would be passed to pandas.Timedelta function
continuous (
bool
) – Whether time stamps have to be continuous.
- Returns:
Parsed flux timeseries data in dictionary format.
- Return type:
Dict
Metadata#
These ensure the metadata and attributes stored with data are correct.
- openghg.standardise.meta.assign_attributes(data, site=None, network=None, sampling_period=None, update_mismatch='never', site_filepath=None, species_filepath=None)[source]#
Assign attributes to each site and species dataset. This ensures that the xarray Datasets produced are CF 1.7 compliant. Some of the attributes written to the Dataset are saved as metadata to the Datasource allowing more detailed searching of data.
If accessing underlying stored site or species definitions, this will be accessed from the openghg/openghg_defs repository by default.
- Parameters:
data (
dict
) – Dictionary containing data, metadata and attributessite (
Optional
[str
]) – Site codesampling_period (
Union
[str
,int
,float
,None
]) – Number of seconds for which air sample is taken. Only for time variable attributenetwork (
Optional
[str
]) – Network nameupdate_mismatch (
str
) –This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:
”never” - don’t update mismatches and raise an AttrMismatchError
”from_source” / “attributes” - update mismatches based on input data (e.g. data attributes)
”from_definition” / “metadata” - update mismatches based on associated data (e.g. site_info.json)
site_filepath (
Union
[str
,Path
,None
]) – Alternative site info filespecies_filepath (
Union
[str
,Path
,None
]) – Alternative species info file
- Returns:
Dictionary of combined data with correct attributes assigned to Datasets
- Return type:
dict
- openghg.standardise.meta.get_attributes(ds, species, site, network=None, global_attributes=None, units=None, scale=None, sampling_period=None, date_range=None, site_filepath=None, species_filepath=None)[source]#
This function writes attributes to an xarray.Dataset so that they conform with the CF Convention v1.6
Attributes of the xarray DataSet are modified, and variable names are changed
If accessing underlying stored site or species definitions, this will be accessed from the openghg/openghg_defs repository by default.
Variable naming related to species name will be defined using define_species_label() function.
- Parameters:
ds (
Dataset
) – Should contain variables such as “ch4”, “ch4 repeatability”. Must have a “time” dimension.species (
str
) – Species name. e.g. “CH4”, “HFC-134a”, “dCH4C13”site (
str
) – Three-letter site codenetwork (
Optional
[str
]) – Network site is associated withglobal_attribuates – Dictionary containing any info you want to add to the file header (e.g. {“Contact”: “Contact_Name”})
units (
Optional
[str
]) – This routine will try to guess the units unless this is specified. Options are in units_interpretscale (
Optional
[str
]) – Calibration scale for species.sampling_period (
Union
[str
,int
,float
,None
]) – Number of seconds for which air sample is taken. Only for time variable attributedate_range (
Optional
[list
[str
]]) – Start and end date for output If you only want an end date, just put a very early start date (e.g. [“1900-01-01”, “2010-01-01”])site_filepath (
Union
[str
,Path
,None
]) – Alternative site info filespecies_filepath (
Union
[str
,Path
,None
]) – Alternative species info file
- Return type:
Dataset
- openghg.standardise.meta.assign_flux_attributes(data, species=None, source=None, domain=None, units='mol/m2/s', prior_info_dict=None)[source]#
Assign attributes for the input flux dataset within dictionary based on metadata and passed arguments.
- Parameters:
data (
dict
) – Dictionary containing data, metadata and attributesspecies (
Optional
[str
]) – Species namesource (
Optional
[str
]) – Source namedomain (
Optional
[str
]) – Domain nameunits (
str
) – Unit values for the “flux” variable. Default = “mol/m2/s”prior_info_dict (
Optional
[dict
]) –Dictionary of additional ‘prior’ information about for the emissions sources. Expect this to be of the form e.g.
- {“EDGAR”: {“version”: “v4.3.2”,
”raw_resolution”: “0.1 degree x 0.1 degree”, “reference”: “http://edgar.jrc.ec.europa.eu/overview.php?v=432_GHG” …},
…}
- Returns:
Same format as inputted but with updated “data” component (Dataset)
- Return type:
Dict
- openghg.standardise.meta.define_species_label(species, species_filepath=None)[source]#
Define standardised label to use for observation datasets. This uses the data stored within openghg_defs/data/site_info JSON file by default with alternative names (‘alt’) defined within.
- Formatting:
species label will be all lower case
any spaces will be replaced with underscores
- if species or synonym cannot be found, species name will used
but with any hyphens taken out (see also openghg.util.clean_string function)
Note: Suggested naming for isotopologues should be d<species><isotope>, e.g. dCH4C13, or dCO2C14
- Parameters:
species (
str
) – Species name.species_filepath (
Union
[str
,Path
,None
]) – Alternative species info file.
- Returns:
- Both the species label to be used exactly and the original attribute
key needed to extract additional data from the ‘site_info.json’ attributes file.
- Return type:
str, str
Example
>>> define_species_label("methane") ("ch4", "CH4") >>> define_species_label("radon") ("rn", "Rn") >>> define_species_label("cfc-11") ("cfc11", "CFC11") >>> define_species_label("CH4C13") ("dch4c13", "DCH4C13")
- openghg.standardise.meta.attributes_default_keys()[source]#
Defines default values expected within ObsSurface metadata. :returns: keys required in metadata :rtype: list
- openghg.standardise.meta.sync_surface_metadata(metadata, attributes, keys_to_add=None, update_mismatch='never')[source]#
Makes sure any duplicated keys between the metadata and attributes dictionaries match and that certain keys are present in the metadata.
- Parameters:
metadata (
dict
) – Dictionary of metadataattributes (
dict
) – Attributeskeys_to_add (
Optional
[list
]) – Add these keys to the metadata, if not present, based onNote (the attribute values.) – this skips any keys which can’t be
values. (copied from the attribute)
update_mismatch (
str
) –If case insensitive mismatch is found between an attribute and a metadata value, this determines the function behaviour. This includes the options:
”never” - don’t update mismatches and raise an AttrMismatchError
”from_source” / “attributes” - update mismatches based on input attributes
”from_definition” / “metadata” - update mismatches based on input metadata
- Returns:
Aligned metadata, attributes
- Return type:
dict, dict