Util#
Helper functions that are used throughout OpenGHG. From file hashing to timestamp handling.
Domain#
- openghg.util.find_domain(domain, domain_filepath=None)[source]#
Finds the latitude and longitude values in degrees associated with a given domain name.
- Parameters:
domain (
str
) – Pre-defined domain namedomain_filepath (
Union
[str
,Path
,None
]) – Alternative domain info file. Defaults to openghg_defs input.
- Returns:
Latitude and longitude values for the domain in degrees.
- Return type:
array, array
Downloading data#
- openghg.util.download_data(url, filepath=None, timeout=10)[source]#
Download data file, with progress bar.
Based on https://stackoverflow.com/a/63831344/1303032
- Parameters:
url (
str
) – URL of content to downloadfilepath (
Union
[str
,Path
,None
]) – Filepath to write out datatimeount – Timeout for HTTP request (seconds)
- Returns:
Bytes if no filepath given
- Return type:
bytes / None
File handling, compression#
- openghg.util.compress(data)[source]#
Compress the given data
- Parameters:
data (
bytes
) – Binary data- Returns:
Compressed data
- Return type:
bytes
- openghg.util.compress_json(data)[source]#
Convert object to JSON string and compress
- Parameters:
data (
Any
) – Object to pass to json.dumps- Returns:
Compressed binary data
- Return type:
bytes
- openghg.util.compress_str(s)[source]#
Compress a string
- Parameters:
s (
str
) – String- Return type:
bytes
- Retruns:
bytes: Compressed data
- openghg.util.decompress(data)[source]#
Decompress the given data
- Parameters:
data (
bytes
) – Compressed data- Returns:
Decompressed data
- Return type:
bytes
- openghg.util.decompress_json(data)[source]#
Decompress a string and load to JSON
- Parameters:
data (
bytes
) – Compressed binary data- Return type:
Any
- Returns:
Object loaded from JSON
- openghg.util.decompress_str(data)[source]#
Decompress a string from bytes
- Parameters:
data (
bytes
) – Compressed data- Returns:
Decompressed str
- Return type:
str
- openghg.util.get_datapath(filename, directory=None)[source]#
Returns the correct path to data files used for assigning attributes
- Parameters:
filename (
Union
[str
,Path
]) – Name of file to be accessed- Returns:
Path of file
- Return type:
pathlib.Path
- openghg.util.get_logfile_path()[source]#
Get the logfile path
- Returns:
Path to logfile
- Return type:
Path
- openghg.util.load_json(path)[source]#
Returns a dictionary deserialised from JSON.
- Parameters:
path (
Union
[str
,Path
]) – Path to file, can be any filepath- Returns:
Dictionary created from JSON
- Return type:
dict
- openghg.util.read_header(filepath, comment_char='#')[source]#
Reads the header lines denoted by the comment_char
- Parameters:
filepath (
Union
[str
,Path
]) – Path to filecomment_char (
str
) – Character that denotes a comment linefile (at the start of a)
- Returns:
List of lines in the header
- Return type:
list
Hashing#
- openghg.util.hash_bytes(data)[source]#
Calculate the SHA1 sum of some data
- Parameters:
data (
bytes
) – Binary data- Returns:
SHA1 hash
- Return type:
str
- openghg.util.hash_file(filepath)[source]#
Opens the file at filepath and calculates its SHA1 hash
Taken from https://stackoverflow.com/a/22058673
- Parameters:
filepath (pathlib.Path) – Path to file
- Returns:
SHA1 hash
- Return type:
str
- openghg.util.hash_retrieved_data(to_hash)[source]#
Hash data retrieved from a data platform. This calculates the SHA1 of the metadata and the start date, end date and the number of timestamps in the Dataset.
- Parameters:
to_hash (
Dict
[str
,Dict
]) – Dictionary to hashas (We expected this to be a dictionary such)
{species_key – {“data”: xr.Dataset, “metadata”: {…}}}
- Returns:
Dictionary of hash: species_key
- Return type:
dict
Measurement helpers#
- openghg.util.check_lifetime_monthly(lifetime)[source]#
Check whether retrieved lifetime value represents monthly lifetimes. This checks whether lifetime is a list and contains 12 values.
- Parameters:
lifetime (
Union
[str
,List
[str
],None
]) – str or list representation of lifetime value- Returns:
True of lifetime matches criteria for monthly data, False otherwise
- Return type:
bool
- Raises ValueError:
if lifetime is a list but does not contain exactly 12 entries, one for each month
- openghg.util.format_inlet(inlet, units='m', key_name=None, special_keywords=None)[source]#
Make sure inlet / height name conforms to standard. The standard imposed can depend on the associated key_name itself (can be supplied as an option to check).
- This standard is as follows:
number followed by unit
number alone if unit / derviative is specified at the end of key_name (e.g. station_height_masl)
unchanged if this is one of the special keywords (by default “multiple” or “various”)
- Other considerations:
For units of “m”, we will also look for “magl” and “masl” (metres above ground and sea level)
If the input string just contains numbers, it is assumed this is already within the correct unit.
- Parameters:
inlet (
Union
[str
,slice
,None
,list
[Union
[str
,slice
,None
]]]) – Inlet / Height value in the specified unitsunits (
str
) – Units for the inlet value (“m” by default)key_name (
Optional
[str
]) – Name of the associated key. This is optional but will be used to determine whether the unit value should be added to the output string.special_keywords (
Optional
[list
]) – Specify special keywords inlet could be set to If so do not apply any formatting. If this is not set a special keyword of “multiple” and “column” will still be allowed.
- Return type:
Union
[str
,slice
,None
,list
[Union
[str
,slice
,None
]]]- Returns:
same type as input, with all strings formatted
- Usage:
>>> format_inlet("10") "10m" >>> format_inlet("10m") "10m" >>> format_inlet("10magl") "10m" >>> format_inlet("10.111") "10.1m" >>> format_inlet(["10", 100]) ["10m", "100m"] >>> format_inlet("multiple") "multiple" >>> format_inlet("10m", key_name="inlet") "10m" >>> format_inlet("10m", key_name="inlet_magl") "10" >>> format_inlet("10m", key_name="station_height_masl") "10"
- openghg.util.find_matching_site(site_name, possible_sites)[source]#
Try and find a similar name to site_name in site_list and return a suggestion or error string.
- Parameters:
site_name (
str
) – Name of sitesite_list – List of sites to check
- Returns:
Suggestion / error message
- Return type:
str
- openghg.util.multiple_inlets(site)[source]#
Check if the passed site has more than one inlet
- Parameters:
site (
str
) – Three letter site code- Returns:
True if multiple inlets
- Return type:
bool
- openghg.util.molar_mass(species, species_filepath=None)[source]#
Extracts the molar mass of a species.
- Parameters:
species (
str
) – Species namespecies_filepath (
Union
[str
,Path
,None
]) – Alternative species info file. Defaults to openghg_defs input.
- Returns:
Molar mass of species
- Return type:
float
- openghg.util.species_lifetime(species, species_filepath=None)[source]#
Find species lifetime. This can either be labelled as “lifetime” or “lifetime_monthly”.
Note: no species synonyms accepted yet
- Parameters:
species (
Optional
[str
]) – Species name e.g. “ch4” or “co2”species_filepath (
Union
[str
,Path
,None
]) – Alternative species info file. Defaults to openghg_defs input.
- Returns:
Extracted lifetime or None is no lifetime was present.
- Return type:
str / list / None
- openghg.util.synonyms(species, lower=True, allow_new_species=True, species_filepath=None)[source]#
Check to see if there are other names that we should be using for a particular input. E.g. If CFC-11 or CFC11 was input, go on to use cfc11.
- Parameters:
species (
str
) – Input string that you’re trying to matchlower (
bool
) – Return all lower caseallow_new_species (
bool
) – Return original value (may be lower case) if this (or a synonym) is not found in the database. If False, raise a ValueError.species_filepath (
Union
[str
,Path
,None
]) – Alternative species info file. Defaults to openghg_defs input.
- Returns:
Matched species string
- Return type:
str
TODO: Decide if we need to make this lower case or not. Included this here so this occurs in one place which can be linked to and changed if needed.
- openghg.util.site_code_finder(site_name)[source]#
Find the site code for a given site name.
- Parameters:
site_name (
str
) – Site long name- Returns:
Three letter site code if found
- Return type:
str or None
- openghg.util.verify_site(site)[source]#
Check if the passed site is a valid one and returns the three letter site code if found. Otherwise we use fuzzy text matching to suggest sites with similar names.
- Parameters:
site (
str
) – Three letter site code or site name- Returns:
Verified three letter site code if valid site
- Return type:
str
String handling#
- openghg.util.clean_string(to_clean)[source]#
Returns a lowercase string with only alphanumeric characters and underscores.
- Parameters:
to_clean (
Optional
[str
]) – String to clean- Returns:
Clean string
- Return type:
str or None
- openghg.util.is_number(s)[source]#
Is it a number?
https://stackoverflow.com/q/354038
- Parameters:
s (
Any
) – String which may be a number- Return type:
bool
- Returns:
bool
- openghg.util.remove_punctuation(s)[source]#
Removes punctuation and converts the passed string to lowercase
- Parameters:
s (
str
) – String to convert- Returns:
Unpunctuated, lowercased string
- Return type:
str
- openghg.util.to_lowercase(d, skip_keys=None)[source]#
Convert an object to lowercase. All keys and values in a dictionary will be converted to lowercase as will all objects in a list, tuple or set. You can optionally pass in a list of keys to skip when lowercasing a dictionary.
Based on the answer https://stackoverflow.com/a/40789531/1303032
- Parameters:
d (
Union
[Dict
,List
,Tuple
,Set
,str
]) – Object to lower caseskip_keys (
Optional
[List
]) – List of keys to skip when lowercasing.
- Returns:
Dictionary of lower case keys and values
- Return type:
dict
Dates and times#
- openghg.util.check_date(date)[source]#
Check if a date string can be converted to a pd.Timestamp and returns NA if not.
Returns a string that can be JSON serialised.
- Parameters:
date (
str
) – String to test- Returns:
Returns NA if not a date, otherwise date string
- Return type:
str
- openghg.util.check_nan(data)[source]#
Check if a number is Nan.
Returns a string that can be JSON serialised.
- Parameters:
data (
Union
[int
,float
]) – Number- Returns:
Returns NA if not a number else number
- Return type:
str, float, int
- openghg.util.closest_daterange(to_compare, dateranges)[source]#
Finds the closest daterange in a list of dateranges
- Parameters:
to_compare (
str
) – Daterange (as a string) to comparedateranges (
Union
[str
,List
[str
]]) – List of dateranges
- Returns:
Daterange from dateranges that’s the closest in time to to_compare
- Return type:
str
- openghg.util.combine_dateranges(dateranges)[source]#
Combine dateranges
- Parameters:
dateranges (
List
[str
]) – Daterange strings- Returns:
List of combined dateranges
- Return type:
list
Modified from https://codereview.stackexchange.com/a/69249
- openghg.util.create_daterange(start, end, freq='D')[source]#
Create a minute aligned daterange
- Parameters:
start (
Timestamp
) – Start dateend (
Timestamp
) – End date
- Return type:
DatetimeIndex
- Returns:
pandas.DatetimeIndex
- openghg.util.create_daterange_str(start, end)[source]#
Convert the passed datetimes into a daterange string for use in searches and Datasource interactions
- Parameters:
start_date – Start date
end_date – End date
- Returns:
Daterange string
- Return type:
str
- openghg.util.create_frequency_str(value=None, unit=None, period=None, include_units=True)[source]#
Create a suitable frequency string based either a value and unit pair or a period value. The unit will be made singular if the value is 1.
Check time_offset_definition() for accepted input units.
- Parameters:
value (
Union
[int
,float
,None
]) – Value and unit pair to useunit (
Optional
[str
]) – Value and unit pair to useperiod (
Union
[str
,tuple
,None
]) – Suitable input for period (see parse_period() for more details)
- Returns:
Formatted string
Examples: >>> create_frequency_str(unit=1, value=”hour”)
”1 hour”
>>> create_frequency(period="3MS") "3 months" >>> create_frequency(period="yearly") "1 year"
- Return type:
str
- openghg.util.daterange_contains(container, contained)[source]#
Check if the daterange container contains the daterange contained
- Parameters:
container (
str
) – Daterangecontained (
str
) – Daterange
- Return type:
bool
- Returns:
bool
- openghg.util.daterange_from_str(daterange_str, freq='D')[source]#
Get a Pandas DatetimeIndex from a string. The created DatetimeIndex has minute frequency.
- Parameters:
daterange_str (str) – Daterange string
2019-01-01T00 (of the form) – 00:00_2019-12-31T00:00:00
- Returns:
DatetimeIndex covering daterange
- Return type:
pandas.DatetimeIndex
- openghg.util.daterange_overlap(daterange_a, daterange_b)[source]#
Check if daterange_a is within daterange_b.
- Parameters:
daterange_a (str) – Timezone aware daterange string. Example:
2014-01-30-10 – 52:30+00:00_2014-01-30-13:22:30+00:00
daterange_b (str) – As daterange_a
- Returns:
True if daterange included
- Return type:
bool
- openghg.util.daterange_to_str(daterange)[source]#
Takes a pandas DatetimeIndex created by pandas date_range converts it to a string of the form 2019-01-01-00:00:00_2019-03-16-00:00:00
- Parameters:
daterange (pandas.DatetimeIndex)
- Returns:
Daterange in string format
- Return type:
str
- openghg.util.find_daterange_gaps(start_search, end_search, dateranges)[source]#
Given a start and end date and a list of dateranges find the gaps.
For example given a list of dateranges
example = [‘2014-09-02_2014-11-01’, ‘2016-09-02_2018-11-01’]
start = timestamp_tzaware(“2012-01-01”) end = timestamp_tzaware(“2019-09-01”)
gaps = find_daterange_gaps(start, end, example)
- gaps == [‘2012-01-01-00:00:00+00:00_2014-09-01-00:00:00+00:00’,
‘2014-11-02-00:00:00+00:00_2016-09-01-00:00:00+00:00’, ‘2018-11-02-00:00:00+00:00_2019-09-01-00:00:00+00:00’]
- Parameters:
start_search (
Timestamp
) – Start timestampend_search (
Timestamp
) – End timestampdateranges (
List
[str
]) – List of daterange strings
- Returns:
List of dateranges
- Return type:
list
- openghg.util.find_duplicate_timestamps(data)[source]#
Check for duplicates
- Parameters:
data (
Union
[Dataset
,DataFrame
]) – Data object to check. Should have a time attribute or index- Returns:
A list of duplicates
- Return type:
list
- openghg.util.first_last_dates(keys)[source]#
Find the first and last timestamp from a list of keys
- Parameters:
keys (
List
) – List of keys- Returns:
First and last timestamp
- Return type:
tuple
- openghg.util.in_daterange(start_a, end_a, start_b, end_b)[source]#
Check if two dateranges overlap.
- Parameters:
start – Start datetime
end – End datetime
- Returns:
True if overlap
- Return type:
bool
- openghg.util.parse_period(period)[source]#
Parses period input and converts to a value, unit pair.
Check time_offset_definition() for accepted input units.
- Parameters:
period (
Union
[str
,tuple
]) –Period of measurements. Should be one of:
”yearly”, “monthly”
suitable pandas Offset Alias
tuple of (value, unit) as would be passed to pandas.Timedelta function
- Returns:
class containing value and associated time period (subclass of NamedTuple)
Examples: >>> parse_period(“12H”)
TimePeriod(12, “hours”)
>>> parse_period("yearly") TimePeriod(1, "years") >>> parse_period("monthly") TimePeriod(1, "months") >>> parse_period((1, "minute")) TimePeriod(1, "minutes")
- Return type:
TimePeriod
- openghg.util.relative_time_offset(value=None, unit=None, period=None)[source]#
Create relative time offset based on inputs. This is based on the pandas DateOffset and Timedelta functions.
Check time_offset_definition() for accepted input units.
If the input is “years” or “months” a relative offset (DateOffset) will be created since these are variable units. For example:
“2013-01-01” + 1 year relative offset = “2014-01-01”
“2012-05-01” + 2 months relative offset = “2012-07-01”
Otherwise the Timedelta function will be used.
- Parameters:
value (
Union
[int
,float
,None
]) – Value and unit pair to useunit (
Optional
[str
]) – Value and unit pair to useperiod (
Union
[str
,tuple
,None
]) – Suitable input for period (see parse_period() for more details)
- Returns:
Time offset object, appropriate for the period type
- Return type:
DateOffset/Timedelta
- openghg.util.sanitise_daterange(daterange)[source]#
Make sure the daterange is correct and return tzaware daterange.
- Parameters:
daterange (
str
) – Daterange str- Returns:
Timezone aware daterange str
- Return type:
str
- openghg.util.split_daterange_str(daterange_str, date_only=False)[source]#
Split a daterange string to the component start and end Timestamps
- Parameters:
daterange_str (
str
) – Daterange string of the formdate_only (
bool
) – Return only the date portion of the Timestamp, removingcomponent (the hours / seconds)
2019-01-01T00 – 00:00_2019-12-31T00:00:00
- Returns:
Tuple of start, end timestamps / dates
- Return type:
tuple (Timestamp / datetime.date, Timestamp / datetime.date)
- openghg.util.split_encompassed_daterange(container, contained)[source]#
Checks if one of the passed dateranges contains the other, if so, then split the larger daterange into three sections.
<—a—>
<———b———–>
Here b is split into three and we end up with:
<-dr1-><—a—><-dr2->
- Parameters:
daterange_a – Daterange
daterange_b – Daterange
- Returns:
Dictionary of results
- Return type:
dict
- openghg.util.time_offset(value=None, unit=None, period=None)[source]#
Create time offset based on inputs. This will return a Timedelta object and cannot create relative offsets (this includes “weeks”, “months”, “years”).
- Parameters:
value (
Union
[int
,float
,None
]) – Value and unit pair to useunit (
Optional
[str
]) – Value and unit pair to useperiod (
Union
[str
,tuple
,None
]) – Suitable input for period (see parse_period() for more details)
- Returns:
Time offset object
- Return type:
Timedelta
- openghg.util.time_offset_definition()[source]#
Returns synonym definition for time offset inputs.
- Accepted inputs are as follows:
“months”: “monthly”, “months”, “month”, “MS”
“years”: “yearly”, “years”, “annual”, “year”, “AS”, “YS”
“weeks”: “weekly”, “weeks”, “week”, “W”
“days”: “daily”, “days”, “day”, “D”
“hours”: “hourly”, “hours”, “hour”, “hr”, “h”, “H”
“minutes”: “minutely”, “minutes”, “minute”, “min”, “m”, “T”
“seconds”: “secondly”, “seconds”, “second”, “sec”, “s”, “S”
This is to ensure the correct keyword for using the DataOffset and TimeDelta functions can be supplied (want this to be “years”, “months” etc.)
- Returns:
containing list of values of synonym values
- Return type:
dict
- openghg.util.timestamp_epoch()[source]#
Returns the UNIX epoch time 1st of January 1970
- Returns:
Timestamp object at epoch
- Return type:
pandas.Timestamp
- openghg.util.timestamp_now()[source]#
Returns a pandas timezone (UTC) aware Timestamp for the current time.
- Returns:
Timestamp at current time
- Return type:
pandas.Timestamp
- openghg.util.timestamp_tzaware(timestamp)[source]#
Returns the pandas Timestamp passed as a timezone (UTC) aware Timestamp.
- Parameters:
timestamp (pandas.Timestamp) – Timezone naive Timestamp
- Returns:
Timezone aware
- Return type:
pandas.Timestamp
User#
Handling user configuration files.
- openghg.util.create_config(silent=False)[source]#
Creates a user config.
- Parameters:
silent (
bool
) – Creates the basic configuration file with onlylocation. (the user's object store in a default)
- Return type:
None
- Returns:
None
- openghg.util.get_user_config_path()[source]#
Returns path to user config file.
This file is created in the user’s home directory in ~/.ghgconfig/openghg/user.conf on Linux / macOS or in LOCALAPPDATA/openghg/openghg.conf on Windows.
- Returns:
Path to user config file
- Return type:
pathlib.Path
Environment detection#
- openghg.util.running_locally()[source]#
Are we running OpenGHG locally?
- Returns:
True if running locally
- Return type:
bool
Miscellaneous#
Some itertools
like functions.