Standardise - data#

Each of these functions parses a specific type of data file and returns a dictionary containing the data and metadata.

Surface observations#

openghg.standardise.surface.parse_beaco2n(filepath, site, network, inlet, instrument='shinyei', sampling_period=None, **kwargs)[source]#

Read BEACO2N data files

Parameters:
  • filepath (str | Path) – Data filepath

  • site (str) – Site name

  • network (str) – Network name

  • inlet (str) – Inlet height in metres

  • instrument (str | None) – Instrument name

  • sampling_period (Optional[str]) – Measurement sampling period

Returns:

Dictionary of data

Return type:

dict

openghg.standardise.surface.parse_crds(filepath, site, network, inlet=None, instrument=None, sampling_period=None, measurement_type=None, drop_duplicates=True, update_mismatch='never', site_filepath=None, **kwargs)[source]#

Parses a CRDS data file and creates a dictionary of xarray Datasets ready for storage in the object store.

Parameters:
  • filepath (str | Path) – Path to file

  • site (str) – Three letter site code

  • network (str) – Network name

  • inlet (Optional[str]) – Inlet height

  • instrument (Optional[str]) – Instrument name

  • sampling_period (Union[str, int, float, None]) – Sampling period in seconds

  • measurement_type (Optional[str]) – Measurement type e.g. insitu, flask

  • drop_duplicates (bool) – Drop measurements at duplicate timestamps, keeping the first.

  • update_mismatch (str) –

    This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:

    • ”never” - don’t update mismatches and raise an AttrMismatchError

    • ”from_source” / “attributes” - update mismatches based on input data (e.g. data attributes)

    • ”from_definition” / “metadata” - update mismatches based on associated data (e.g. site_info.json)

  • site_filepath (Union[str, Path, None]) – Alternative site info file (see openghg/openghg_defs repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.

Returns:

Dictionary of gas data

Return type:

dict

openghg.standardise.surface.parse_gcwerks(filepath, precision_filepath, site, network, inlet=None, instrument=None, sampling_period=None, measurement_type=None, update_mismatch='never', site_filepath=None)[source]#

Reads a GC data file by creating a GC object and associated datasources

Parameters:
  • filepath (str | Path) – Path of data file

  • precision_filepath (str | Path) – Path of precision file

  • site (str) – Three letter code or name for site

  • instrument (Optional[str]) – Instrument name

  • network (str) – Network name

  • update_mismatch (str) –

    This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:

    • ”never” - don’t update mismatches and raise an AttrMismatchError

    • ”from_source” / “attributes” - update mismatches based on input data (e.g. data attributes)

    • ”from_definition” / “metadata” - update mismatches based on associated data (e.g. site_info.json)

  • site_filepath (Union[str, Path, None]) – Alternative site info file (see openghg/openghg_defs repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.

Returns:

Dictionary of source_name : UUIDs

Return type:

dict

openghg.standardise.surface.parse_noaa(filepath, site, measurement_type, inlet=None, network='NOAA', instrument=None, sampling_period=None, update_mismatch='never', site_filepath=None, **kwarg)[source]#

Read NOAA data from raw text file or ObsPack NetCDF

Parameters:
  • filepath (str | Path) – Data filepath

  • site (str) – Three letter site code

  • inlet (Optional[str]) – Inlet height (as value unit e.g. “10m”)

  • measurement_type (str) – One of (“flask”, “insitu”, “pfp”)

  • network (str) – Network, defaults to NOAA

  • instrument (Optional[str]) – Instrument name

  • sampling_period (Optional[str]) – Sampling period

  • update_mismatch (str) –

    This determines how mismatches between the internal data attributes and the supplied / derived metadata are handled. This includes the options:

    • ”never” - don’t update mismatches and raise an AttrMismatchError

    • ”attributes” - update mismatches based on input attributes

    • ”metadata” - update mismatches based on input metadata

  • site_filepath (Union[str, Path, None]) – Alternative site info file (see openghg/openghg_defs repository for format). Otherwise will use the data stored within openghg_defs/data/site_info JSON file by default.

Returns:

Dictionary of data and metadata

Return type:

dict

openghg.standardise.surface.parse_npl(filepath, site='NPL', network='LGHG', inlet=None, instrument=None, sampling_period=None, measurement_type=None, update_mismatch='never')[source]#

Reads NPL data files and returns the UUIDS of the Datasources the processed data has been assigned to

Parameters:
  • filepath (Union[str, Path]) – Path of file to load

  • site (str) – Site name

  • network (str) – Network, defaults to LGHG

  • inlet (Optional[str]) – Inlet height. Will be inferred if not specified

  • instrument (Optional[str]) – Instrument name

  • sampling_period (Optional[str]) – Sampling period

  • measurement_type (Optional[str]) – Type of measurement taken e.g.”flask”, “insitu”

  • update_mismatch (str) –

    This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:

    • ”never” - don’t update mismatches and raise an AttrMismatchError

    • ”from_source” / “attributes” - update mismatches based on input data (e.g. data attributes)

    • ”from_definition” / “metadata” - update mismatches based on associated data (e.g. site_info.json)

Returns:

UUIDs of Datasources data has been assigned to

Return type:

list

Column data#

openghg.standardise.column.parse_openghg(filepath, satellite=None, domain=None, selection=None, site=None, species=None, network=None, instrument=None, platform='satellite', chunks=None, **kwargs)[source]#

Parse and extract data from pre-formatted netcdf file which already matches expected OpenGHG format.

The arguments specified below are the metadata needed to store this surface observation file within the object store. If these keywords are not included within the attributes of the netcdf file being passed then these arguments must be specified.

For column data this can either be a satellite (e.g. satellite=”GOSAT”) or a site (site=”RUN”, network=”TCCON”). Either can be specified or this function will attempt to extract this from the data file.

Parameters:
  • filepath (str | Path) – Path of observation file

  • satellite (Optional[str]) – Name of satellite (if relevant)

  • domain (Optional[str]) – For satellite only. If data has been selected on an area include the identifier name for domain covered. This can map to previously defined domains (see openghg_defs “domain_info.json” file) or a newly defined domain.

  • selection (Optional[str]) – For satellite only, identifier for any data selection which has been performed on satellite data. This can be based on any form of filtering, binning etc. but should be unique compared to other selections made e.g. “land”, “glint”, “upperlimit”. If not specified, domain will be used.

  • site (Optional[str]) – Site code/name (if relevant). Can include satellite OR site.

  • species (Optional[str]) – Species name or synonym e.g. “ch4”

  • instrument (Optional[str]) – Instrument name e.g. “TANSO-FTS”

  • network (Optional[str]) – Name of in-situ or satellite network e.g. “TCCON”, “GOSAT”

  • platform (str) – Type of platform. Should be one of: - “satellite” - “site” Note: this will be superceded if site or satellite keywords are specified.

  • chunks (Optional[dict]) – Chunking schema to use when storing data. It expects a dictionary of dimension name and chunk size, for example {“time”: 100}. If None then a chunking schema will be set automatically by OpenGHG. See documentation for guidance on chunking: https://docs.openghg.org/tutorials/local/Adding_data/Adding_ancillary_data.html#chunking. To disable chunking pass in an empty dictionary.

  • kwargs (str) – Any additional attributes to be associated with the data.

Returns:

Dictionary of source_name : data, metadata, attributes

Return type:

Dict

Emissions / flux#

Flux Timeseries#

openghg.standardise.flux_timeseries.parse_crf(filepath, species, source='anthro', region='UK', domain=None, data_type='flux_timeseries', database=None, database_version=None, model=None, period=None, continuous=True)[source]#

Parse CRF emissions data from the specified file.

Parameters:
  • filepath (Path) – Path to the ‘.xlsx’ file containing CRF emissions data.

  • species (str) – Name of species

  • source (str) – Source of the emissions data, e.g. “energy”, “anthro”, default is ‘anthro’.

  • region (str) – Region/Country of the CRF data

  • domain (Optional[str]) – Geographic domain, default is ‘None’. Instead region is used to identify area

  • data_type (str) – Type of data, default is ‘flux_timeseries’.

  • database (Optional[str]) – Database name if applicable.

  • database_version (Optional[str]) – Version of the database if applicable.

  • model (Optional[str]) – Model name if applicable.

  • period (Union[str, tuple, None]) –

    Period of measurements. Only needed if this can not be inferred from the time coords If specified, should be one of:

    • ”yearly”, “monthly”

    • suitable pandas Offset Alias

    • tuple of (value, unit) as would be passed to pandas.Timedelta function

  • continuous (bool) – Whether time stamps have to be continuous.

Returns:

Parsed flux timeseries data in dictionary format.

Return type:

Dict

Metadata#

These ensure the metadata and attributes stored with data are correct.

openghg.standardise.meta.assign_attributes(data, site=None, network=None, sampling_period=None, update_mismatch='never', site_filepath=None, species_filepath=None)[source]#

Assign attributes to each site and species dataset. This ensures that the xarray Datasets produced are CF 1.7 compliant. Some of the attributes written to the Dataset are saved as metadata to the Datasource allowing more detailed searching of data.

If accessing underlying stored site or species definitions, this will be accessed from the openghg/openghg_defs repository by default.

Parameters:
  • data (dict) – Dictionary containing data, metadata and attributes

  • site (Optional[str]) – Site code

  • sampling_period (Union[str, int, float, None]) – Number of seconds for which air sample is taken. Only for time variable attribute

  • network (Optional[str]) – Network name

  • update_mismatch (str) –

    This determines how mismatches between the internal data “attributes” and the supplied / derived “metadata” are handled. This includes the options:

    • ”never” - don’t update mismatches and raise an AttrMismatchError

    • ”from_source” / “attributes” - update mismatches based on input data (e.g. data attributes)

    • ”from_definition” / “metadata” - update mismatches based on associated data (e.g. site_info.json)

  • site_filepath (Union[str, Path, None]) – Alternative site info file

  • species_filepath (Union[str, Path, None]) – Alternative species info file

Returns:

Dictionary of combined data with correct attributes assigned to Datasets

Return type:

dict

openghg.standardise.meta.get_attributes(ds, species, site, network=None, global_attributes=None, units=None, scale=None, sampling_period=None, date_range=None, site_filepath=None, species_filepath=None)[source]#

This function writes attributes to an xarray.Dataset so that they conform with the CF Convention v1.6

Attributes of the xarray DataSet are modified, and variable names are changed

If accessing underlying stored site or species definitions, this will be accessed from the openghg/openghg_defs repository by default.

Variable naming related to species name will be defined using define_species_label() function.

Parameters:
  • ds (Dataset) – Should contain variables such as “ch4”, “ch4 repeatability”. Must have a “time” dimension.

  • species (str) – Species name. e.g. “CH4”, “HFC-134a”, “dCH4C13”

  • site (str) – Three-letter site code

  • network (Optional[str]) – Network site is associated with

  • global_attribuates – Dictionary containing any info you want to add to the file header (e.g. {“Contact”: “Contact_Name”})

  • units (Optional[str]) – This routine will try to guess the units unless this is specified. Options are in units_interpret

  • scale (Optional[str]) – Calibration scale for species.

  • sampling_period (Union[str, int, float, None]) – Number of seconds for which air sample is taken. Only for time variable attribute

  • date_range (Optional[list[str]]) – Start and end date for output If you only want an end date, just put a very early start date (e.g. [“1900-01-01”, “2010-01-01”])

  • site_filepath (Union[str, Path, None]) – Alternative site info file

  • species_filepath (Union[str, Path, None]) – Alternative species info file

Return type:

Dataset

openghg.standardise.meta.assign_flux_attributes(data, species=None, source=None, domain=None, units='mol/m2/s', prior_info_dict=None)[source]#

Assign attributes for the input flux dataset within dictionary based on metadata and passed arguments.

Parameters:
  • data (dict) – Dictionary containing data, metadata and attributes

  • species (Optional[str]) – Species name

  • source (Optional[str]) – Source name

  • domain (Optional[str]) – Domain name

  • units (str) – Unit values for the “flux” variable. Default = “mol/m2/s”

  • prior_info_dict (Optional[dict]) –

    Dictionary of additional ‘prior’ information about for the emissions sources. Expect this to be of the form e.g.

    {“EDGAR”: {“version”: “v4.3.2”,

    ”raw_resolution”: “0.1 degree x 0.1 degree”, “reference”: “http://edgar.jrc.ec.europa.eu/overview.php?v=432_GHG” …},

    …}

Returns:

Same format as inputted but with updated “data” component (Dataset)

Return type:

Dict

openghg.standardise.meta.define_species_label(species, species_filepath=None)[source]#

Define standardised label to use for observation datasets. This uses the data stored within openghg_defs/data/site_info JSON file by default with alternative names (‘alt’) defined within.

Formatting:
  • species label will be all lower case

  • any spaces will be replaced with underscores

  • if species or synonym cannot be found, species name will used

    but with any hyphens taken out (see also openghg.util.clean_string function)

Note: Suggested naming for isotopologues should be d<species><isotope>, e.g. dCH4C13, or dCO2C14

Parameters:
  • species (str) – Species name.

  • species_filepath (Union[str, Path, None]) – Alternative species info file.

Returns:

Both the species label to be used exactly and the original attribute

key needed to extract additional data from the ‘site_info.json’ attributes file.

Return type:

str, str

Example

>>> define_species_label("methane")
    ("ch4", "CH4")
>>> define_species_label("radon")
    ("rn", "Rn")
>>> define_species_label("cfc-11")
    ("cfc11", "CFC11")
>>> define_species_label("CH4C13")
    ("dch4c13", "DCH4C13")
openghg.standardise.meta.attributes_default_keys()[source]#

Defines default values expected within ObsSurface metadata. :returns: keys required in metadata :rtype: list

openghg.standardise.meta.sync_surface_metadata(metadata, attributes, keys_to_add=None, update_mismatch='never')[source]#

Makes sure any duplicated keys between the metadata and attributes dictionaries match and that certain keys are present in the metadata.

Parameters:
  • metadata (dict) – Dictionary of metadata

  • attributes (dict) – Attributes

  • keys_to_add (Optional[list]) – Add these keys to the metadata, if not present, based on

  • Note (the attribute values.) – this skips any keys which can’t be

  • values. (copied from the attribute)

  • update_mismatch (str) –

    If case insensitive mismatch is found between an attribute and a metadata value, this determines the function behaviour. This includes the options:

    • ”never” - don’t update mismatches and raise an AttrMismatchError

    • ”from_source” / “attributes” - update mismatches based on input attributes

    • ”from_definition” / “metadata” - update mismatches based on input metadata

Returns:

Aligned metadata, attributes

Return type:

dict, dict