Retrieve#
These handle the retrieval of data from the object store.
Search functions#
We have a number of search functions, most customised to the data type, which we hope will make it easier for users to find the data they require from the object store.
Surface observations#
To search for surface observations we recommend the use of search_surface
.
- openghg.retrieve.search_surface(species=None, site=None, inlet=None, height=None, instrument=None, data_level=None, data_sublevel=None, dataset_source=None, data_source=None, measurement_type=None, source_format=None, network=None, start_date=None, end_date=None, sampling_height=None, icos_data_level=None, **kwargs)[source]#
Cloud object store search.
- Parameters:
species (
Union
[str
,list
[str
],None
]) – Speciessite (
Union
[str
,list
[str
],None
]) – Three letter site codeinlet (
Union
[str
,slice
,None
,list
[str
|slice
|None
]]) – Inlet height above ground level in metres; use slice(lower, upper) to search for a range of values. lower and upper can be int, float, or strings such as ‘100m’.height (
Union
[str
,slice
,None
,list
[str
|slice
|None
]]) – Alias for inletinstrument (
Union
[str
,list
[str
],None
]) – Instrument namedata_level (
Union
[str
,list
[str
],dict
,None
]) – Data quality assurance level (0-3)data_sublevel (
Union
[str
,list
[str
],None
]) – Typically used for “L1” data depending on different QA performed before data is finalised.data_source (
Optional
[str
]) – Where data was retrieved from (used especially when retrieved from external archives) e.g. “internal”, “noaa_obspack”, “icoscp”, “ceda_archive”. This argument only needs to be used to narrow the search to data solely from retrieval methods.dataset_source (
Optional
[str
]) – External name applied to source of the dataset, for example “ICOS”, “InGOS”, “European ObsPack”, “CEDA 2023.06”measurement_type (
Union
[str
,list
[str
],None
]) – Measurement typedata_type – Data type e.g. “surface”, “column”, “flux” See openghg.store.spec.define_data_types() for full details.
start_date (
Union
[str
,list
[str
],None
]) – Start dateend_date (
Union
[str
,list
[str
],None
]) – End datesampling_height (
Optional
[str
]) – Sampling height of measurementsicos_data_level (
Union
[int
,str
,None
]) – ICOS data level, see ICOS documentationkwargs (
Any
) – Additional search terms
- Returns:
SearchResults object
- Return type:
Flux data#
- openghg.retrieve.search_flux(species=None, source=None, domain=None, database=None, database_version=None, model=None, start_date=None, end_date=None, time_resolved=None, high_time_resolution=None, period=None, continuous=None, **kwargs)[source]#
Search for flux / emissions data.
- Parameters:
species (
Optional
[str
]) – Species namedomain (
Optional
[str
]) – Flux / Emissions domainsource (
Optional
[str
]) – Flux / Emissions sourcedatabase (
Optional
[str
]) – Name of database source for this input (if relevant)database_version (
Optional
[str
]) – Name of database version (if relevant)model (
Optional
[str
]) – Model name (if relevant)source_format – Type of data being input e.g. openghg (internal format)
time_resolved (
Optional
[bool
]) – If this is a high resolution fileperiod (
Union
[str
,tuple
,None
]) –Period of measurements. Only needed if this can not be inferred from the time coords If specified, should be one of:
”yearly”, “monthly”
suitable pandas Offset Alias
tuple of (value, unit) as would be passed to pandas.Timedelta function
high_time_resolution (
Optional
[bool
]) – This argument is deprecated and will be replaced in future versions with time_resolved.continuous (
Optional
[bool
]) – Whether time stamps have to be continuous.kwargs (
Any
) – Additional search terms
- Returns:
SearchResults object
- Return type:
Boundary conditions data#
- openghg.retrieve.search_bc(species=None, bc_input=None, domain=None, start_date=None, end_date=None, period=None, continuous=None, **kwargs)[source]#
Search for boundary condition data.
- Parameters:
species (
Optional
[str
]) – Species namebc_input (
Optional
[str
]) – Input used to create boundary conditions. For example: - a model name such as “MOZART” or “CAMS” - a description such as “UniformAGAGE” (uniform values based on AGAGE average)domain (
Optional
[str
]) – Region for boundary conditionsstart_date (
Optional
[str
]) – Start date (inclusive) for boundary conditionsend_date (
Optional
[str
]) – End date (exclusive) for boundary conditionsperiod (
Union
[str
,tuple
,None
]) –Period of measurements. Only needed if this can not be inferred from the time coords If specified, should be one of:
”yearly”, “monthly”
suitable pandas Offset Alias
tuple of (value, unit) as would be passed to pandas.Timedelta function
continuous (
Optional
[bool
]) – Whether time stamps have to be continuous.kwargs (
Any
) – Additional search terms
- Returns:
SearchResults object
- Return type:
Eulerian data#
- openghg.retrieve.search_eulerian(model=None, species=None, start_date=None, end_date=None, **kwargs)[source]#
Search for eulerian data.
- Parameters:
model (
Optional
[str
]) – Eulerian model namespecies (
Optional
[str
]) – Species namestart_date (
Optional
[str
]) – Start date (inclusive) associated with model runend_date (
Optional
[str
]) – End date (exclusive) associated with model runkwargs (
Any
) – Additional search terms
- Returns:
SearchResults object
- Return type:
Column / satellite data#
- openghg.retrieve.search_column(satellite=None, domain=None, selection=None, site=None, species=None, network=None, instrument=None, platform=None, **kwargs)[source]#
Search column data.
- Parameters:
satellite (
Optional
[str
]) – Name of satellite (if relevant)domain (
Optional
[str
]) – For satellite only. If data has been selected on an area include the identifier name for domain covered. This can map to previously defined domains (see openghg_defs “domain_info.json” file) or a newly defined domain.selection (
Optional
[str
]) – For satellite only, identifier for any data selection which has been performed on satellite data. This can be based on any form of filtering, binning etc. but should be unique compared to other selections made e.g. “land”, “glint”, “upperlimit”. If not specified, domain will be used.site (
Optional
[str
]) – Site code/name (if relevant). Can include satellite OR site.species (
Optional
[str
]) – Species name or synonym e.g. “ch4”instrument (
Optional
[str
]) – Instrument name e.g. “TANSO-FTS”network (
Optional
[str
]) – Name of in-situ or satellite network e.g. “TCCON”, “GOSAT”platform (
Optional
[str
]) – Type of platform. Should be one of: - “satellite” - “site”kwargs (
Any
) – Additional search terms
- Returns:
SearchResults object
- Return type:
Footprints#
- openghg.retrieve.search_footprints(site=None, inlet=None, domain=None, model=None, height=None, met_model=None, species=None, start_date=None, end_date=None, network=None, period=None, continuous=None, high_spatial_resolution=None, time_resolved=None, high_time_resolution=None, short_lifetime=None, **kwargs)[source]#
Search for footprints data.
- Parameters:
site (
Optional
[str
]) – Site nameinlet (
Optional
[str
]) – Height above ground level in metresdomain (
Optional
[str
]) – Domain of footprintsmodel (
Optional
[str
]) – Model used to create footprint (e.g. NAME or FLEXPART)height (
Optional
[str
]) – Alias for inletmet_model (
Optional
[str
]) – Underlying meteorlogical model used (e.g. UKV)species (
Optional
[str
]) – Species name. Only needed if footprint is for a specific species e.g. co2 (and not inert)network (
Optional
[str
]) – Network nameperiod (
Union
[str
,tuple
,None
]) – Period of measurements. Only needed if this can not be inferred from the time coordscontinuous (
Optional
[bool
]) – Whether time stamps have to be continuous.retrieve_met – Whether to also download meterological data for this footprints area
high_spatial_resolution (
Optional
[bool
]) – Indicate footprints include both a low and high spatial resolution.time_resolved (
Optional
[bool
]) – Indicate footprints are high time resolution (include H_back dimension) Note this will be set to True automatically if species=”co2” (Carbon Dioxide).high_time_resolution (
Optional
[bool
]) – This argument is deprecated and will be replaced in future versions with time_resolved.short_lifetime (
Optional
[bool
]) – Indicate footprint is for a short-lived species. Needs species input. Note this will be set to True if species has an associated lifetime.kwargs (
Any
) – Additional search terms
- Returns:
SearchResults object
- Return type:
General#
For a more general search you can use the search
function directly. This function accepts any number of keyword arguments.
- openghg.retrieve.search(**kwargs)[source]#
Search for observations data. Any keyword arguments may be passed to the the function and these keywords will be used to search the metadata associated with each Datasource.
- Though any types can be passed as keyword arguments, these will be interpreted in the following ways:
None - argument will be ignored.
list/tuple - an OR search will be created for the argument and each of the values.
dict - an OR search will be created for the key, value pairs. - Note: in this case the name of argument itself will be ignored.
str/other - argument used directly.
All input search values are formatted (openghg.utils.clean_string).
This function detects the running environment and routes the call to either the cloud or local search function.
Example / commonly used arguments are given below.
- Parameters:
species – Terms to search for in Datasources
locations – Where to search for the terms in species
inlet – Inlet height such as 100m
instrument – Instrument name such as picarro
find_all – Require all search terms to be satisfied
start_date – Start datetime for search.
epoch (If None a start datetime of UNIX)
end_date – End datetime for search.
set (If None an end datetime of the current datetime is)
- Returns:
SearchResults object is results found, otherwise None
- Return type:
SearchResults or None
Retrieving from other data sources#
ICOS#
OpenGHG can retrieve data from the ICOS Carbon Portal.
- openghg.retrieve.icos.retrieve_atmospheric(site, species=None, inlet=None, sampling_height=None, start_date=None, end_date=None, force_retrieval=False, data_level=2, dataset_source=None, store=None, update_mismatch='never', force=False)[source]#
Retrieve ICOS atmospheric measurement data. If data is found in the object store it is returned. Otherwise data will be retrieved from the ICOS Carbon Portal. Data retrieval from the Carbon Portal may take a short time. If only a single data source is found an ObsData object is returned, if multiple a list of ObsData objects if returned, if nothing then None.
- Parameters:
site (
str
) – Site codespecies (
Union
[str
,list
,None
]) – Species nameinlet (
Optional
[str
]) – Height of the inlet for sampling in metres.sampling_height (
Optional
[str
]) – Alias for inletstart_date (
Optional
[str
]) – Start dateend_date (
Optional
[str
]) – End dateforce_retrieval (
bool
) – Force the retrieval of data from the ICOS Carbon Portaldata_level (
int
) – ICOS data level (1, 2)1 (- Data level) – Near Real Time Data (NRT) or Internal Work data (IW).
2 (- Data level) – The final quality checked ICOS RI data set, published by the CFs, to be distributed through the Carbon Portal. This level is the ICOS-data product and free available for users.
https (See) – //icos-carbon-portal.github.io/pylib/modules/#stationdatalevelnone
dataset_source (
Optional
[str
]) – Dataset source name, for example ICOS, InGOS, European ObsPackstore (
Optional
[str
]) – Name of object to search/store data toupdate_mismatch (
str
) –This determines how mismatches between the “metadata” derived from stored data and “attributes” derived from ICOS Header are handled. This includes the options:
”never” - don’t update mismatches and raise an AttrMismatchError
”from_source” / “attributes” - update mismatches based on attributes from ICOS Header
”from_definition” / “metadata” - update mismatches based on input metadata
force (
bool
) – Force adding of data even if this is identical to data stored (checked based on previously retrieved file hashes).
- Return type:
- Returns:
ObsData, list[ObsData] or None
CEDA#
Pulling from the CEDA archive is also possible. After finding the URL to the dataset you require you can retrieve it using
- openghg.retrieve.ceda.retrieve_surface(site=None, species=None, inlet=None, url=None, force_retrieval=False, additional_metadata=None, store=None)[source]#
Retrieve surface measurements from the CEDA archive. This function will route the call to either local or cloud functions based on the environment.
- Parameters:
site (
Optional
[str
]) – Site namespecies (
Optional
[str
]) – Species nameinlet (doesn't contain everythging we need. At the moment we try and find site and) – Inlet height
url (
Optional
[str
]) – URL of data in CEDA archiveforce_retrieval (
bool
) – Force the retrieval of data from a URLadditional_metadata (
Optional
[dict
]) – Additional metadata to pass if the returned datainlet
attributes. (keys if they aren't found in the dataset's)
example (For) – {“site”: “AAA”, “inlet”: “10m”}
store (
Optional
[str
]) – Name of object store to use
- Returns:
ObsData if data found / retrieved successfully.
- Return type:
ObsData or None
Example
To retrieve new data from the CEDA archive using a URL >>> retrieve_surface(url=https://dap.ceda.ac.uk/badc/…) To retrieve already cached data from the object store >>> retrieve_surface(site=”BSD”, species=”ch4)
Specific retrieval functions#
- openghg.retrieve.get_obs_surface(site, species, inlet=None, height=None, start_date=None, end_date=None, average=None, network=None, instrument=None, calibration_scale=None, rename_vars=True, keep_missing=False, skip_ranking=False, **kwargs)[source]#
This is the equivalent of the get_obs function from the ACRG repository.
Usage and return values are the same whilst implementation may differ.
- Parameters:
site (
str
) – Site of interest e.g. MHD for the Mace Head site.species (
str
) – Species identifier e.g. ch4 for methane.start_date (
Union
[str
,Timestamp
,None
]) – Output start date in a format that Pandas can interpretend_date (
Union
[str
,Timestamp
,None
]) – Output end date in a format that Pandas can interpretinlet (
Union
[str
,slice
,None
]) – Inlet height above ground level in metres; This can be a single value or slice(lower, upper) can be used to search for a range of values. lower and upper can be int, float, or strings such as ‘100m’.height (
Optional
[str
]) – Alias for inletaverage (
Optional
[str
]) – Averaging period for each dataset. Each value should be a string of"2H" (the form e.g.)
"30min" (should match pandas offset aliases format)
keep_missing (
bool
) – Keep missing data points or drop them.network (
Optional
[str
]) – Network for the site/instrument (must match number of sites).instrument (
Optional
[str
]) – Specific instrument for the sipte (must match number of sites).calibration_scale (
Optional
[str
]) – Convert to this calibration scalerename_vars (
bool
) – Rename variables from species names to use “mf” explictly.kwargs (
Any
) – Additional search terms
- Returns:
ObsData object if data found, else None
- Return type:
ObsData or None
- openghg.retrieve.get_flux(species, source, domain, database=None, database_version=None, model=None, start_date=None, end_date=None, time_resolution=None, **kwargs)[source]#
The flux function reads in all flux files for the domain and species as an xarray Dataset. Note that at present ALL flux data is read in per species per domain or by emissions name. To be consistent with the footprints, fluxes should be in mol/m2/s.
- Parameters:
species (
str
) – Species namesource (
str
) – Source namedomain (
str
) – Domain e.g. EUROPEstart_date (
Union
[str
,Timestamp
,None
]) – Start dateend_date (
Union
[str
,Timestamp
,None
]) – End datetime_resolution (
Optional
[str
]) – One of [“standard”, “high”]kwargs (
Any
) – Additional search terms
- Returns:
FluxData object
- Return type:
- openghg.retrieve.get_footprint(site, domain, inlet=None, height=None, model=None, start_date=None, end_date=None, species=None, **kwargs)[source]#
Get footprints from one site.
- Parameters:
site (
str
) – The name of the site given in the footprints. This often matches to the site name but if the same site footprints are run with a different met and they are named slightly differently from the obs file. E.g. site=”DJI”, site_modifier = “DJI-SAM” - station called DJI, footprints site called DJI-SAMdomain (
str
) – Domain name for the footprintsinlet (
Optional
[str
]) – Height above ground level in metresheight (
Optional
[str
]) – Alias for inletmodel (
Optional
[str
]) – Model used to create footprint (e.g. NAME or FLEXPART)start_date (
Union
[str
,Timestamp
,None
]) – Output start date in a format that Pandas can interpretend_date (
Union
[str
,Timestamp
,None
]) – Output end date in a format that Pandas can interpretspecies (
Optional
[str
]) – Species identifier e.g. “co2” for carbon dioxide. Only needed if species needs a modified footprints from the typical 30-day footprints appropriate for a long-lived species (like methane) e.g. for high time resolution (co2) or is a short-lived species.kwargs (
Any
) – Additional search terms
- Returns:
FootprintData dataclass
- Return type:
- openghg.retrieve.get_bc(species, domain, bc_input=None, start_date=None, end_date=None, **kwargs)[source]#
Get boundary conditions for a given species, domain and bc_input name.
- Parameters:
species (
str
) – Species namebc_input (
Optional
[str
]) – Input used to create boundary conditions. For example: - a model name such as “MOZART” or “CAMS” - a description such as “UniformAGAGE” (uniform values based on AGAGE average)domain (
str
) – Region for boundary conditions e.g. EUROPEstart_date (
Union
[str
,Timestamp
,None
]) – Start dateend_date (
Union
[str
,Timestamp
,None
]) – End date
- Returns:
BoundaryConditionsData object
- Return type: