Util#

Helper functions that are used throughout OpenGHG. From file hashing to timestamp handling.

Domain#

openghg.util.convert_longitude(longitude, return_index=False)[source]#

Convert longitude extent to -180 - 180 and reorder.

Parameters:
  • longitude (ndarray) – Array of valid longitude values in degrees.

  • return_index (bool) – Return re-ordering index as well as updated longitude

Returns:

Updated longitude values and new indices if requested.

Return type:

ndarray(, ndarray)

openghg.util.find_domain(domain, domain_filename=None)[source]#

Finds the latitude and longitude values in degrees associated with a given domain name.

Parameters:
  • domain (str) – Pre-defined domain name

  • domain_filename (Union[str, Path, None]) – Alternative domain info file. Defaults to openghg_defs input.

Returns:

Latitude and longitude values for the domain in degrees.

Return type:

array, array

Downloading data#

openghg.util.download_data(url, filepath=None, timeout=10)[source]#

Download data file, with progress bar.

Based on https://stackoverflow.com/a/63831344/1303032

Parameters:
  • url (str) – URL of content to download

  • filepath (Union[str, Path, None]) – Filepath to write out data

  • timeount – Timeout for HTTP request (seconds)

Returns:

Bytes if no filepath given

Return type:

bytes / None

openghg.util.parse_url_filename(url)[source]#

Get the filename from a (messy) URL.

Parameters:

url (str) – URL of file

Returns:

Filename

Return type:

str

File handling, compression#

openghg.util.compress(data)[source]#

Compress the given data

Parameters:

data (bytes) – Binary data

Returns:

Compressed data

Return type:

bytes

openghg.util.compress_json(data)[source]#

Convert object to JSON string and compress

Parameters:

data (Any) – Object to pass to json.dumps

Returns:

Compressed binary data

Return type:

bytes

openghg.util.compress_str(s)[source]#

Compress a string

Parameters:

s (str) – String

Return type:

bytes

Retruns:

bytes: Compressed data

openghg.util.decompress(data)[source]#

Decompress the given data

Parameters:

data (bytes) – Compressed data

Returns:

Decompressed data

Return type:

bytes

openghg.util.decompress_json(data)[source]#

Decompress a string and load to JSON

Parameters:

data (bytes) – Compressed binary data

Return type:

Any

Returns:

Object loaded from JSON

openghg.util.decompress_str(data)[source]#

Decompress a string from bytes

Parameters:

data (bytes) – Compressed data

Returns:

Decompressed str

Return type:

str

openghg.util.get_datapath(filename, directory=None)[source]#

Returns the correct path to data files used for assigning attributes

Parameters:

filename (Union[str, Path]) – Name of file to be accessed

Returns:

Path of file

Return type:

pathlib.Path

openghg.util.get_logfile_path()[source]#

Get the logfile path

Returns:

Path to logfile

Return type:

Path

openghg.util.load_column_parser(source_format)[source]#

Load a parsing object for the obscolumn data type. Used with openghg.standardise.column sub-module

Parameters:

source_format (str) – Name of data type e.g. OPENGHG

Returns:

parser function

Return type:

callable

openghg.util.load_column_source_parser(source_format)[source]#

Load a parsing object for the source of column data. Used with openghg.transform.column sub-module

Parameters:

source_format (str) – Name of data source e.g. GOSAT

Returns:

parser function

Return type:

callable

openghg.util.load_emissions_database_parser(database)[source]#

Load a parsing object for the source of column data. Used with openghg.transform.emissions sub-module

Parameters:

database (str) – Name of data source e.g. EDGAR

Returns:

parser function

Return type:

callable

openghg.util.load_emissions_parser(source_format)[source]#

Load a parsing object for the emissions data type. Used with openghg.standardise.emissions sub-module

Parameters:

source_format (str) – Name of data type e.g. OPENGHG

Returns:

parser function

Return type:

callable

openghg.util.load_json(filename, internal_data=False)[source]#

Returns a dictionary deserialised from JSON.

Parameters:
  • filename (Union[str, Path]) – Name of JSON file

  • internal_data (bool) – Whether to use data internal to OpenGHG. This refers to JSON files stored within the openghg/data/ folder. If this is set to False, the full path to the file needs to be included.

Returns:

Dictionary created from JSON

Return type:

dict

openghg.util.load_surface_parser(source_format)[source]#

Load parsing object for the obssurface data type. Used with openghg.standardise.surface sub-module

Parameters:

source_format (str) – Name of data type such as CRDS

Returns:

class_name object

Return type:

callable

openghg.util.read_header(filepath, comment_char='#')[source]#

Reads the header lines denoted by the comment_char

Parameters:
  • filepath (Union[str, Path]) – Path to file

  • comment_char (str) – Character that denotes a comment line

  • file (at the start of a) –

Returns:

List of lines in the header

Return type:

list

Hashing#

openghg.util.hash_bytes(data)[source]#

Calculate the SHA1 sum of some data

Parameters:

data (bytes) – Binary data

Returns:

SHA1 hash

Return type:

str

openghg.util.hash_file(filepath)[source]#

Opens the file at filepath and calculates its SHA1 hash

Taken from https://stackoverflow.com/a/22058673

Parameters:

filepath (pathlib.Path) – Path to file

Returns:

SHA1 hash

Return type:

str

openghg.util.hash_retrieved_data(to_hash)[source]#

Hash data retrieved from a data platform. This calculates the SHA1 of the metadata and the start date, end date and the number of timestamps in the Dataset.

Parameters:
  • to_hash (Dict[str, Dict]) – Dictionary to hash

  • as (We expected this to be a dictionary such) –

  • {species_key – {“data”: xr.Dataset, “metadata”: {…}}}

Returns:

Dictionary of hash: species_key

Return type:

dict

openghg.util.hash_string(to_hash)[source]#

Return the SHA-1 hash of a string

Parameters:

to_hash (str) – String to hash

Returns:

SHA1 hash of string

Return type:

str

Measurement helpers#

openghg.util.check_lifetime_monthly(lifetime)[source]#

Check whether retrieved lifetime value represents monthly lifetimes. This checks whether lifetime is a list and contains 12 values.

Parameters:

lifetime (Union[str, List[str], None]) – str or list representation of lifetime value

Returns:

True of lifetime matches criteria for monthly data, False otherwise

Raises ValueError:

if lifetime is a list but does not contain exactly 12 entries, one for each month

Return type:

bool

openghg.util.format_inlet(inlet, units='m', key_name=None, special_keywords=None)[source]#

Make sure inlet / height name conforms to standard. The standard imposed can depend on the associated key_name itself (can be supplied as an option to check).

This standard is as follows:
  • number followed by unit

  • number alone if unit / derviative is specified at the end of key_name (e.g. station_height_masl)

  • unchanged if this is one of the special keywords (by default “multiple” or “various”)

Other considerations:
  • For units of “m”, we will also look for “magl” and “masl” (metres above ground and sea level)

  • If the input string just contains numbers, it is assumed this is already within the correct unit.

Parameters:
  • inlet (Optional[str]) – Inlet / Height value in the specified units

  • units (str) – Units for the inlet value (“m” by default)

  • key_name (Optional[str]) – Name of the associated key. This is optional but will be used to determine whether the unit value should be added to the output string.

  • special_keywords (Optional[list]) – Specify special keywords inlet could be set to If so do not apply any formatting. If this is not set a special keyword of “multiple” and “column” will still be allowed.

Returns:

formatted inlet string / None

Return type:

str

Usage:
>>> format_inlet("10")
    "10m"
>>> format_inlet("10m")
    "10m"
>>> format_inlet("10magl")
    "10m"
>>> format_inlet("10.111")
    "10.1m"
>>> format_inlet("multiple")
    "multiple"
>>> format_inlet("10m", key_name="inlet")
    "10m"
>>> format_inlet("10m", key_name="inlet_magl")
    "10"
>>> format_inlet("10m", key_name="station_height_masl")
    "10"
openghg.util.find_matching_site(site_name, possible_sites)[source]#

Try and find a similar name to site_name in site_list and return a suggestion or error string.

Parameters:
  • site_name (str) – Name of site

  • site_list – List of sites to check

Returns:

Suggestion / error message

Return type:

str

openghg.util.multiple_inlets(site)[source]#

Check if the passed site has more than one inlet

Parameters:

site (str) – Three letter site code

Returns:

True if multiple inlets

Return type:

bool

openghg.util.molar_mass(species, species_filename=None)[source]#

This function extracts the molar mass of a species.

Parameters:
  • species (str) – Species name

  • species_filename (Union[str, Path, None]) – Alternative species info file. Defaults to openghg_defs input.

Returns:

Molar mass of species

Return type:

float

openghg.util.species_lifetime(species, species_filename=None)[source]#

Find species lifetime. This can either be labelled as “lifetime” or “lifetime_monthly”.

Note: no species synonyms accepted yet

Parameters:
  • species (Optional[str]) – Species name e.g. “ch4” or “co2”

  • species_filename (Union[str, Path, None]) – Alternative species info file. Defaults to openghg_defs input.

Returns:

Extracted lifetime or None is no lifetime was present.

Return type:

str / list / None

openghg.util.synonyms(species, lower=True, allow_new_species=True, species_filename=None)[source]#

Check to see if there are other names that we should be using for a particular input. E.g. If CFC-11 or CFC11 was input, go on to use cfc11.

Parameters:
  • species (str) – Input string that you’re trying to match

  • lower (bool) – Return all lower case

  • allow_new_species (bool) – Return original value (may be lower case) if this (or a synonym) is not found in the database. If False, raise a ValueError.

  • species_filename (Union[str, Path, None]) – Alternative species info file. Defaults to openghg_defs input.

Returns:

Matched species string

Return type:

str

TODO: Decide if we need to make this lower case or not. Included this here so this occurs in one place which can be linked to and changed if needed.

openghg.util.site_code_finder(site_name)[source]#

Find the site code for a given site name.

Parameters:

site_name (str) – Site long name

Returns:

Three letter site code if found

Return type:

str or None

openghg.util.verify_site(site)[source]#

Check if the passed site is a valid one and returns the three letter site code if found. Otherwise we use fuzzy text matching to suggest sites with similar names.

Parameters:

site (str) – Three letter site code or site name

Returns:

Verified three letter site code if valid site

Return type:

str

String handling#

openghg.util.clean_string(to_clean)[source]#

Returns a lowercase string with only alphanumeric characters and underscores.

Parameters:

to_clean (Optional[str]) – String to clean

Returns:

Clean string

Return type:

str or None

openghg.util.is_number(s)[source]#

Is it a number?

https://stackoverflow.com/q/354038

Parameters:

s (Any) – String which may be a number

Return type:

bool

Returns:

bool

openghg.util.remove_punctuation(s)[source]#

Removes punctuation and converts the passed string to lowercase

Parameters:

s (str) – String to convert

Returns:

Unpunctuated, lowercased string

Return type:

str

openghg.util.to_lowercase(d, skip_keys=None)[source]#

Convert an object to lowercase. All keys and values in a dictionary will be converted to lowercase as will all objects in a list, tuple or set. You can optionally pass in a list of keys to skip when lowercasing a dictionary.

Based on the answer https://stackoverflow.com/a/40789531/1303032

Parameters:
  • d (Union[Dict, List, Tuple, Set, str]) – Object to lower case

  • skip_keys (Optional[List]) – List of keys to skip when lowercasing.

Returns:

Dictionary of lower case keys and values

Return type:

dict

Dates and times#

openghg.util.check_date(date)[source]#

Check if a date string can be converted to a pd.Timestamp and returns NA if not.

Returns a string that can be JSON serialised.

Parameters:

date (str) – String to test

Returns:

Returns NA if not a date, otherwise date string

Return type:

str

openghg.util.check_nan(data)[source]#

Check if a number is Nan.

Returns a string that can be JSON serialised.

Parameters:

data (Union[int, float]) – Number

Returns:

Returns NA if not a number else number

Return type:

str, float, int

openghg.util.closest_daterange(to_compare, dateranges)[source]#

Finds the closest daterange in a list of dateranges

Parameters:
  • to_compare (str) – Daterange (as a string) to compare

  • dateranges (Union[str, List[str]]) – List of dateranges

Returns:

Daterange from dateranges that’s the closest in time to to_compare

Return type:

str

openghg.util.combine_dateranges(dateranges)[source]#

Combine dateranges

Parameters:

dateranges (List[str]) – Daterange strings

Returns:

List of combined dateranges

Return type:

list

Modified from https://codereview.stackexchange.com/a/69249

openghg.util.create_daterange(start, end, freq='D')[source]#

Create a minute aligned daterange

Parameters:
  • start (Timestamp) – Start date

  • end (Timestamp) – End date

Return type:

DatetimeIndex

Returns:

pandas.DatetimeIndex

openghg.util.create_daterange_str(start, end)[source]#

Convert the passed datetimes into a daterange string for use in searches and Datasource interactions

Parameters:
  • start_date – Start date

  • end_date – End date

Returns:

Daterange string

Return type:

str

openghg.util.create_frequency_str(value=None, unit=None, period=None)[source]#

Create a suitable frequency string based either a value and unit pair or a period value. The unit will be made singular if the value is 1.

Check time_offset_definition() for accepted input units.

Parameters:
  • value (Union[int, float, None]) – Value and unit pair to use

  • unit (Optional[str]) – Value and unit pair to use

  • period (Union[str, tuple, None]) – Suitable input for period (see parse_period() for more details)

Returns:

Formatted string

Examples: >>> create_frequency_str(unit=1, value=”hour”)

”1 hour”

>>> create_frequency(period="3MS")
    "3 months"
>>> create_frequency(period="yearly")
    "1 year"

Return type:

str

openghg.util.daterange_contains(container, contained)[source]#

Check if the daterange container contains the daterange contained

Parameters:
  • container (str) – Daterange

  • contained (str) – Daterange

Return type:

bool

Returns:

bool

openghg.util.daterange_from_str(daterange_str, freq='D')[source]#

Get a Pandas DatetimeIndex from a string. The created DatetimeIndex has minute frequency.

Parameters:
  • daterange_str (str) – Daterange string

  • 2019-01-01T00 (of the form) – 00:00_2019-12-31T00:00:00

Returns:

DatetimeIndex covering daterange

Return type:

pandas.DatetimeIndex

openghg.util.daterange_overlap(daterange_a, daterange_b)[source]#

Check if daterange_a is within daterange_b.

Parameters:
  • daterange_a (str) – Timezone aware daterange string. Example:

  • 2014-01-30-10 – 52:30+00:00_2014-01-30-13:22:30+00:00

  • daterange_b (str) – As daterange_a

Returns:

True if daterange included

Return type:

bool

openghg.util.daterange_to_str(daterange)[source]#

Takes a pandas DatetimeIndex created by pandas date_range converts it to a string of the form 2019-01-01-00:00:00_2019-03-16-00:00:00

Parameters:

daterange (pandas.DatetimeIndex) –

Returns:

Daterange in string format

Return type:

str

openghg.util.find_daterange_gaps(start_search, end_search, dateranges)[source]#

Given a start and end date and a list of dateranges find the gaps.

For example given a list of dateranges

example = [‘2014-09-02_2014-11-01’, ‘2016-09-02_2018-11-01’]

start = timestamp_tzaware(“2012-01-01”) end = timestamp_tzaware(“2019-09-01”)

gaps = find_daterange_gaps(start, end, example)

gaps == [‘2012-01-01-00:00:00+00:00_2014-09-01-00:00:00+00:00’,

‘2014-11-02-00:00:00+00:00_2016-09-01-00:00:00+00:00’, ‘2018-11-02-00:00:00+00:00_2019-09-01-00:00:00+00:00’]

Parameters:
  • start_search (Timestamp) – Start timestamp

  • end_search (Timestamp) – End timestamp

  • dateranges (List[str]) – List of daterange strings

Returns:

List of dateranges

Return type:

list

openghg.util.find_duplicate_timestamps(data)[source]#

Check for duplicates

Parameters:

data (Union[Dataset, DataFrame]) – Data object to check. Should have a time attribute or index

Returns:

A list of duplicates

Return type:

list

openghg.util.first_last_dates(keys)[source]#

Find the first and last timestamp from a list of keys

Parameters:

keys (List) – List of keys

Returns:

First and last timestamp

Return type:

tuple

openghg.util.in_daterange(start_a, end_a, start_b, end_b)[source]#

Check if two dateranges overlap.

Parameters:
  • start – Start datetime

  • end – End datetime

Returns:

True if overlap

Return type:

bool

openghg.util.parse_period(period)[source]#

Parses period input and converts to a value, unit pair.

Check time_offset_definition() for accepted input units.

Parameters:

period (Union[str, tuple]) –

Period of measurements. Should be one of:

  • ”yearly”, “monthly”

  • suitable pandas Offset Alias

  • tuple of (value, unit) as would be passed to pandas.Timedelta function

Returns:

value and associated time period

Examples: >>> parse_period(“12H”)

(12, “hours”)

>>> parse_period("yearly")
    (1, "years")
>>> parse_period("monthly")
    (1, "months")
>>> parse_period((1, "minute"))
    (1, "minutes")

Return type:

int, str

openghg.util.relative_time_offset(value=None, unit=None, period=None)[source]#

Create relative time offset based on inputs. This is based on the pandas DateOffset and Timedelta functions.

Check time_offset_definition() for accepted input units.

If the input is “years” or “months” a relative offset (DateOffset) will be created since these are variable units. For example:

  • “2013-01-01” + 1 year relative offset = “2014-01-01”

  • “2012-05-01” + 2 months relative offset = “2012-07-01”

Otherwise the Timedelta function will be used.

Parameters:
  • value (Union[int, float, None]) – Value and unit pair to use

  • unit (Optional[str]) – Value and unit pair to use

  • period (Union[str, tuple, None]) – Suitable input for period (see parse_period() for more details)

Returns:

Time offset object, appropriate for the period type

Return type:

DateOffset/Timedelta

openghg.util.sanitise_daterange(daterange)[source]#

Make sure the daterange is correct and return tzaware daterange.

Parameters:

daterange (str) – Daterange str

Returns:

Timezone aware daterange str

Return type:

str

openghg.util.split_daterange_str(daterange_str, date_only=False)[source]#

Split a daterange string to the component start and end Timestamps

Parameters:
  • daterange_str (str) – Daterange string of the form

  • date_only (bool) – Return only the date portion of the Timestamp, removing

  • component (the hours / seconds) –

  • 2019-01-01T00 – 00:00_2019-12-31T00:00:00

Returns:

Tuple of start, end timestamps / dates

Return type:

tuple (Timestamp / datetime.date, Timestamp / datetime.date)

openghg.util.split_encompassed_daterange(container, contained)[source]#

Checks if one of the passed dateranges contains the other, if so, then split the larger daterange into three sections.

<—a—>

<———b———–>

Here b is split into three and we end up with:

<-dr1-><—a—><-dr2->

Parameters:
  • daterange_a – Daterange

  • daterange_b – Daterange

Returns:

Dictionary of results

Return type:

dict

openghg.util.time_offset(value=None, unit=None, period=None)[source]#

Create time offset based on inputs. This will return a Timedelta object and cannot create relative offsets (this includes “weeks”, “months”, “years”).

Parameters:
  • value (Union[int, float, None]) – Value and unit pair to use

  • unit (Optional[str]) – Value and unit pair to use

  • period (Union[str, tuple, None]) – Suitable input for period (see parse_period() for more details)

Returns:

Time offset object

Return type:

Timedelta

openghg.util.time_offset_definition()[source]#

Returns synonym definition for time offset inputs.

Accepted inputs are as follows:
  • “months”: “monthly”, “months”, “month”, “MS”

  • “years”: “yearly”, “years”, “annual”, “year”, “AS”, “YS”

  • “weeks”: “weekly”, “weeks”, “week”, “W”

  • “days”: “daily”, “days”, “day”, “D”

  • “hours”: “hourly”, “hours”, “hour”, “hr”, “h”, “H”

  • “minutes”: “minutely”, “minutes”, “minute”, “min”, “m”, “T”

  • “seconds”: “secondly”, “seconds”, “second”, “sec”, “s”, “S”

This is to ensure the correct keyword for using the DataOffset and TimeDelta functions can be supplied (want this to be “years”, “months” etc.)

Returns:

containing list of values of synonym values

Return type:

dict

openghg.util.timestamp_epoch()[source]#

Returns the UNIX epoch time 1st of January 1970

Returns:

Timestamp object at epoch

Return type:

pandas.Timestamp

openghg.util.timestamp_now()[source]#

Returns a pandas timezone (UTC) aware Timestamp for the current time.

Returns:

Timestamp at current time

Return type:

pandas.Timestamp

openghg.util.timestamp_tzaware(timestamp)[source]#

Returns the pandas Timestamp passed as a timezone (UTC) aware Timestamp.

Parameters:

timestamp (pandas.Timestamp) – Timezone naive Timestamp

Returns:

Timezone aware

Return type:

pandas.Timestamp

openghg.util.trim_daterange(to_trim, overlapping)[source]#

Removes overlapping dates from to_trim

Parameters:
  • to_trim (from) – Daterange to trim down. Dates that overlap

  • to_trim

  • overlap_daterange – Daterange containing dates we want to trim

  • to_trim

Returns:

Trimmed daterange

Return type:

str

openghg.util.valid_daterange(daterange)[source]#

Check if the passed daterange is valid

Parameters:

daterange (str) – Daterange string

Returns:

True if valid

Return type:

bool

User#

Handling user configuration files.

openghg.util.create_default_config()[source]#

Creates a default user config in the user’s home directory.

Return type:

None

Returns:

None

openghg.util.get_user_config_path()[source]#

Checks if a config file has already been create for OpenGHG to use. This file is created in the user’s home directory in ~/.config/openghg/user.conf on Linux / macOS or in LOCALAPPDATA/openghg/openghg.conf on Windows.

Returns:

Path to user config file

Return type:

pathlib.Path

openghg.util.read_local_config()[source]#

Reads the local config file.

Returns:

OpenGHG configurations

Return type:

dict

Environment detection#

openghg.util.running_locally()[source]#

Are we running OpenGHG locally?

Returns:

True if running locally

Return type:

bool

openghg.util.running_in_cloud()[source]#

Are we running in the cloud?

Checks for the OPENGHG_CLOUD environment variable being set

Returns:

True if running in cloud

Return type:

bool

openghg.util.running_on_hub()[source]#

Are we running on the OpenGHG Hub?

Checks for the OPENGHG_CLOUD environment variable being set

Returns:

True if running in cloud

Return type:

bool

Miscellaneous#

Some itertools like functions.

openghg.util.pairwise(iterable)[source]#

Return a zip of an iterable where a is the iterable and b is the iterable advanced one step.

Parameters:

iterable (Iterable) – Any iterable type

Returns:

Tuple of iterables

Return type:

tuple

openghg.util.unanimous(seq)[source]#

Checks that all values in an iterable object are the same

Parameters:

seq (Dict) – Iterable object

Return type:

bool

Returns

bool: True if all values are the same