Retrieve#

These handle the retrieval of data from the object store.

Search functions#

We have a number of search functions, most customised to the data type, which we hope will make it easier for users to find the data they require from the object store.

Surface observations#

To search for surface observations we recommend the use of search_surface.

openghg.retrieve.search_surface(species=None, site=None, inlet=None, height=None, instrument=None, data_level=None, data_sublevel=None, dataset_source=None, data_source=None, measurement_type=None, source_format=None, network=None, start_date=None, end_date=None, sampling_height=None, icos_data_level=None, **kwargs)[source]#

Cloud object store search.

Parameters:
  • species (Union[str, list[str], None]) – Species

  • site (Union[str, list[str], None]) – Three letter site code

  • inlet (Union[str, slice, None, list[Union[str, slice, None]]]) – Inlet height above ground level in metres; use slice(lower, upper) to search for a range of values. lower and upper can be int, float, or strings such as ‘100m’.

  • height (Union[str, slice, None, list[Union[str, slice, None]]]) – Alias for inlet

  • instrument (Union[str, list[str], None]) – Instrument name

  • data_level (Union[str, list[str], dict, None]) – Data quality assurance level (0-3)

  • data_sublevel (Union[str, list[str], None]) – Typically used for “L1” data depending on different QA performed before data is finalised.

  • data_source (Optional[str]) – Where data was retrieved from (used especially when retrieved from external archives) e.g. “internal”, “noaa_obspack”, “icoscp”, “ceda_archive”. This argument only needs to be used to narrow the search to data solely from retrieval methods.

  • dataset_source (Optional[str]) – External name applied to source of the dataset, for example “ICOS”, “InGOS”, “European ObsPack”, “CEDA 2023.06”

  • measurement_type (Union[str, list[str], None]) – Measurement type

  • data_type – Data type e.g. “surface”, “column”, “flux” See openghg.store.spec.define_data_types() for full details.

  • start_date (Union[str, list[str], None]) – Start date

  • end_date (Union[str, list[str], None]) – End date

  • sampling_height (Optional[str]) – Sampling height of measurements

  • icos_data_level (Union[int, str, None]) – ICOS data level, see ICOS documentation

  • kwargs (Any) – Additional search terms

Returns:

SearchResults object

Return type:

SearchResults

Flux data#

openghg.retrieve.search_flux(species=None, source=None, domain=None, database=None, database_version=None, model=None, start_date=None, end_date=None, time_resolved=None, high_time_resolution=None, period=None, continuous=None, **kwargs)[source]#

Search for flux / emissions data.

Parameters:
  • species (Optional[str]) – Species name

  • domain (Optional[str]) – Flux / Emissions domain

  • source (Optional[str]) – Flux / Emissions source

  • database (Optional[str]) – Name of database source for this input (if relevant)

  • database_version (Optional[str]) – Name of database version (if relevant)

  • model (Optional[str]) – Model name (if relevant)

  • source_format – Type of data being input e.g. openghg (internal format)

  • time_resolved (Optional[bool]) – If this is a high resolution file

  • period (Union[str, tuple, None]) –

    Period of measurements. Only needed if this can not be inferred from the time coords If specified, should be one of:

    • ”yearly”, “monthly”

    • suitable pandas Offset Alias

    • tuple of (value, unit) as would be passed to pandas.Timedelta function

  • high_time_resolution (Optional[bool]) – This argument is deprecated and will be replaced in future versions with time_resolved.

  • continuous (Optional[bool]) – Whether time stamps have to be continuous.

  • kwargs (Any) – Additional search terms

Returns:

SearchResults object

Return type:

SearchResults

Boundary conditions data#

openghg.retrieve.search_bc(species=None, bc_input=None, domain=None, start_date=None, end_date=None, period=None, continuous=None, **kwargs)[source]#

Search for boundary condition data.

Parameters:
  • species (Optional[str]) – Species name

  • bc_input (Optional[str]) – Input used to create boundary conditions. For example: - a model name such as “MOZART” or “CAMS” - a description such as “UniformAGAGE” (uniform values based on AGAGE average)

  • domain (Optional[str]) – Region for boundary conditions

  • start_date (Optional[str]) – Start date (inclusive) for boundary conditions

  • end_date (Optional[str]) – End date (exclusive) for boundary conditions

  • period (Union[str, tuple, None]) –

    Period of measurements. Only needed if this can not be inferred from the time coords If specified, should be one of:

    • ”yearly”, “monthly”

    • suitable pandas Offset Alias

    • tuple of (value, unit) as would be passed to pandas.Timedelta function

  • continuous (Optional[bool]) – Whether time stamps have to be continuous.

  • kwargs (Any) – Additional search terms

Returns:

SearchResults object

Return type:

SearchResults

Eulerian data#

openghg.retrieve.search_eulerian(model=None, species=None, start_date=None, end_date=None, **kwargs)[source]#

Search for eulerian data.

Parameters:
  • model (Optional[str]) – Eulerian model name

  • species (Optional[str]) – Species name

  • start_date (Optional[str]) – Start date (inclusive) associated with model run

  • end_date (Optional[str]) – End date (exclusive) associated with model run

  • kwargs (Any) – Additional search terms

Returns:

SearchResults object

Return type:

SearchResults

Column / satellite data#

openghg.retrieve.search_column(satellite=None, domain=None, selection=None, site=None, species=None, network=None, instrument=None, platform=None, **kwargs)[source]#

Search column data.

Parameters:
  • satellite (Optional[str]) – Name of satellite (if relevant)

  • domain (Optional[str]) – For satellite only. If data has been selected on an area include the identifier name for domain covered. This can map to previously defined domains (see openghg_defs “domain_info.json” file) or a newly defined domain.

  • selection (Optional[str]) – For satellite only, identifier for any data selection which has been performed on satellite data. This can be based on any form of filtering, binning etc. but should be unique compared to other selections made e.g. “land”, “glint”, “upperlimit”. If not specified, domain will be used.

  • site (Optional[str]) – Site code/name (if relevant). Can include satellite OR site.

  • species (Optional[str]) – Species name or synonym e.g. “ch4”

  • instrument (Optional[str]) – Instrument name e.g. “TANSO-FTS”

  • network (Optional[str]) – Name of in-situ or satellite network e.g. “TCCON”, “GOSAT”

  • platform (Optional[str]) – Type of platform. Should be one of: - “satellite” - “site”

  • kwargs (Any) – Additional search terms

Returns:

SearchResults object

Return type:

SearchResults

Footprints#

openghg.retrieve.search_footprints(site=None, inlet=None, domain=None, model=None, height=None, met_model=None, species=None, start_date=None, end_date=None, network=None, period=None, continuous=None, high_spatial_resolution=None, time_resolved=None, high_time_resolution=None, short_lifetime=None, **kwargs)[source]#

Search for footprints data.

Parameters:
  • site (Optional[str]) – Site name

  • inlet (Optional[str]) – Height above ground level in metres

  • domain (Optional[str]) – Domain of footprints

  • model (Optional[str]) – Model used to create footprint (e.g. NAME or FLEXPART)

  • height (Optional[str]) – Alias for inlet

  • met_model (Optional[str]) – Underlying meteorlogical model used (e.g. UKV)

  • species (Optional[str]) – Species name. Only needed if footprint is for a specific species e.g. co2 (and not inert)

  • network (Optional[str]) – Network name

  • period (Union[str, tuple, None]) – Period of measurements. Only needed if this can not be inferred from the time coords

  • continuous (Optional[bool]) – Whether time stamps have to be continuous.

  • retrieve_met – Whether to also download meterological data for this footprints area

  • high_spatial_resolution (Optional[bool]) – Indicate footprints include both a low and high spatial resolution.

  • time_resolved (Optional[bool]) – Indicate footprints are high time resolution (include H_back dimension) Note this will be set to True automatically if species=”co2” (Carbon Dioxide).

  • high_time_resolution (Optional[bool]) – This argument is deprecated and will be replaced in future versions with time_resolved.

  • short_lifetime (Optional[bool]) – Indicate footprint is for a short-lived species. Needs species input. Note this will be set to True if species has an associated lifetime.

  • kwargs (Any) – Additional search terms

Returns:

SearchResults object

Return type:

SearchResults

General#

For a more general search you can use the search function directly. This function accepts any number of keyword arguments.

openghg.retrieve.search(**kwargs)[source]#

Search for observations data. Any keyword arguments may be passed to the the function and these keywords will be used to search the metadata associated with each Datasource.

Though any types can be passed as keyword arguments, these will be interpreted in the following ways:
  • None - argument will be ignored.

  • list/tuple - an OR search will be created for the argument and each of the values.

  • dict - an OR search will be created for the key, value pairs. - Note: in this case the name of argument itself will be ignored.

  • str/other - argument used directly.

All input search values are formatted (openghg.utils.clean_string).

This function detects the running environment and routes the call to either the cloud or local search function.

Example / commonly used arguments are given below.

Parameters:
  • species – Terms to search for in Datasources

  • locations – Where to search for the terms in species

  • inlet – Inlet height such as 100m

  • instrument – Instrument name such as picarro

  • find_all – Require all search terms to be satisfied

  • start_date – Start datetime for search.

  • epoch (If None a start datetime of UNIX)

  • end_date – End datetime for search.

  • set (If None an end datetime of the current datetime is)

Returns:

SearchResults object is results found, otherwise None

Return type:

SearchResults or None

Retrieving from other data sources#

ICOS#

OpenGHG can retrieve data from the ICOS Carbon Portal.

openghg.retrieve.icos.retrieve_atmospheric(site, species=None, inlet=None, sampling_height=None, start_date=None, end_date=None, force_retrieval=False, data_level=2, dataset_source=None, store=None, update_mismatch='never', force=False)[source]#

Retrieve ICOS atmospheric measurement data. If data is found in the object store it is returned. Otherwise data will be retrieved from the ICOS Carbon Portal. Data retrieval from the Carbon Portal may take a short time. If only a single data source is found an ObsData object is returned, if multiple a list of ObsData objects if returned, if nothing then None.

Parameters:
  • site (str) – Site code

  • species (Union[str, List, None]) – Species name

  • inlet (Optional[str]) – Height of the inlet for sampling in metres.

  • sampling_height (Optional[str]) – Alias for inlet

  • start_date (Optional[str]) – Start date

  • end_date (Optional[str]) – End date

  • force_retrieval (bool) – Force the retrieval of data from the ICOS Carbon Portal

  • data_level (int) – ICOS data level (1, 2)

  • 1 (- Data level) – Near Real Time Data (NRT) or Internal Work data (IW).

  • 2 (- Data level) – The final quality checked ICOS RI data set, published by the CFs, to be distributed through the Carbon Portal. This level is the ICOS-data product and free available for users.

  • https (See) – //icos-carbon-portal.github.io/pylib/modules/#stationdatalevelnone

  • dataset_source (Optional[str]) – Dataset source name, for example ICOS, InGOS, European ObsPack

  • store (Optional[str]) – Name of object to search/store data to

  • update_mismatch (str) –

    This determines how mismatches between the “metadata” derived from stored data and “attributes” derived from ICOS Header are handled. This includes the options:

    • ”never” - don’t update mismatches and raise an AttrMismatchError

    • ”from_source” / “attributes” - update mismatches based on attributes from ICOS Header

    • ”from_definition” / “metadata” - update mismatches based on input metadata

  • force (bool) – Force adding of data even if this is identical to data stored (checked based on previously retrieved file hashes).

Return type:

Union[ObsData, List[ObsData], None]

Returns:

ObsData, list[ObsData] or None

CEDA#

Pulling from the CEDA archive is also possible. After finding the URL to the dataset you require you can retrieve it using

openghg.retrieve.ceda.retrieve_surface(site=None, species=None, inlet=None, url=None, force_retrieval=False, additional_metadata=None, store=None)[source]#

Retrieve surface measurements from the CEDA archive. This function will route the call to either local or cloud functions based on the environment.

Parameters:
  • site (Optional[str]) – Site name

  • species (Optional[str]) – Species name

  • inlet (doesn't contain everythging we need. At the moment we try and find site and) – Inlet height

  • url (Optional[str]) – URL of data in CEDA archive

  • force_retrieval (bool) – Force the retrieval of data from a URL

  • additional_metadata (Optional[Dict]) – Additional metadata to pass if the returned data

  • inlet

  • attributes. (keys if they aren't found in the dataset's)

  • example (For) – {“site”: “AAA”, “inlet”: “10m”}

  • store (Optional[str]) – Name of object store to use

Returns:

ObsData if data found / retrieved successfully.

Return type:

ObsData or None

Example

To retrieve new data from the CEDA archive using a URL >>> retrieve_surface(url=https://dap.ceda.ac.uk/badc/…) To retrieve already cached data from the object store >>> retrieve_surface(site=”BSD”, species=”ch4)

Specific retrieval functions#

openghg.retrieve.get_obs_surface(site, species, inlet=None, height=None, start_date=None, end_date=None, average=None, network=None, instrument=None, calibration_scale=None, keep_missing=False, skip_ranking=False, **kwargs)[source]#

This is the equivalent of the get_obs function from the ACRG repository.

Usage and return values are the same whilst implementation may differ.

Parameters:
  • site (str) – Site of interest e.g. MHD for the Mace Head site.

  • species (str) – Species identifier e.g. ch4 for methane.

  • start_date (Union[str, Timestamp, None]) – Output start date in a format that Pandas can interpret

  • end_date (Union[str, Timestamp, None]) – Output end date in a format that Pandas can interpret

  • inlet (Union[str, slice, None]) – Inlet height above ground level in metres; This can be a single value or slice(lower, upper) can be used to search for a range of values. lower and upper can be int, float, or strings such as ‘100m’.

  • height (Optional[str]) – Alias for inlet

  • average (Optional[str]) – Averaging period for each dataset. Each value should be a string of

  • "2H" (the form e.g.)

  • "30min" (should match pandas offset aliases format)

  • keep_missing (bool) – Keep missing data points or drop them.

  • network (Optional[str]) – Network for the site/instrument (must match number of sites).

  • instrument (Optional[str]) – Specific instrument for the sipte (must match number of sites).

  • calibration_scale (Optional[str]) – Convert to this calibration scale

  • kwargs (Any) – Additional search terms

Returns:

ObsData object if data found, else None

Return type:

ObsData or None

openghg.retrieve.get_flux(species, source, domain, database=None, database_version=None, model=None, start_date=None, end_date=None, time_resolution=None, **kwargs)[source]#

The flux function reads in all flux files for the domain and species as an xarray Dataset. Note that at present ALL flux data is read in per species per domain or by emissions name. To be consistent with the footprints, fluxes should be in mol/m2/s.

Parameters:
  • species (str) – Species name

  • source (str) – Source name

  • domain (str) – Domain e.g. EUROPE

  • start_date (Union[str, Timestamp, None]) – Start date

  • end_date (Union[str, Timestamp, None]) – End date

  • time_resolution (Optional[str]) – One of [“standard”, “high”]

  • kwargs (Any) – Additional search terms

Returns:

FluxData object

Return type:

FluxData

openghg.retrieve.get_footprint(site, domain, inlet=None, height=None, model=None, start_date=None, end_date=None, species=None, **kwargs)[source]#

Get footprints from one site.

Parameters:
  • site (str) – The name of the site given in the footprints. This often matches to the site name but if the same site footprints are run with a different met and they are named slightly differently from the obs file. E.g. site=”DJI”, site_modifier = “DJI-SAM” - station called DJI, footprints site called DJI-SAM

  • domain (str) – Domain name for the footprints

  • inlet (Optional[str]) – Height above ground level in metres

  • height (Optional[str]) – Alias for inlet

  • model (Optional[str]) – Model used to create footprint (e.g. NAME or FLEXPART)

  • start_date (Union[str, Timestamp, None]) – Output start date in a format that Pandas can interpret

  • end_date (Union[str, Timestamp, None]) – Output end date in a format that Pandas can interpret

  • species (Optional[str]) – Species identifier e.g. “co2” for carbon dioxide. Only needed if species needs a modified footprints from the typical 30-day footprints appropriate for a long-lived species (like methane) e.g. for high time resolution (co2) or is a short-lived species.

  • kwargs (Any) – Additional search terms

Returns:

FootprintData dataclass

Return type:

FootprintData

openghg.retrieve.get_bc(species, domain, bc_input=None, start_date=None, end_date=None, **kwargs)[source]#

Get boundary conditions for a given species, domain and bc_input name.

Parameters:
  • species (str) – Species name

  • bc_input (Optional[str]) – Input used to create boundary conditions. For example: - a model name such as “MOZART” or “CAMS” - a description such as “UniformAGAGE” (uniform values based on AGAGE average)

  • domain (str) – Region for boundary conditions e.g. EUROPE

  • start_date (Union[str, Timestamp, None]) – Start date

  • end_date (Union[str, Timestamp, None]) – End date

Returns:

BoundaryConditionsData object

Return type:

BoundaryConditionsData