climind.data_types package

There are two main data types implemented in this package: timeseries and grids. In each of those two cases, the data set consists of a data-carrying part and a metadata part. For timeseries, the data-carrying part is a pandas dataframe and for a grid, it’s an xarray dataset.

Submodules

climind.data_types.grid module

class climind.data_types.grid.GridAnnual(input_data, metadata: CombinedMetadata)

Bases: object

A GridAnnual combines an xarray Dataset with a CombinedMetadata to bring together data and metadata in one object. It represents annual averages of data.

Create an annual gridded data set from an xarray Dataset and CombinedMetadata object.

Parameters

input_data (xa.Dataset) – xarray dataset
metadata (CombinedMetadata) – CombinedMetadata object

get_end_year() → int

Get the last year in the dataset

Returns: Last year in the data set
Return type: int

get_start_year() → int

Get the first year in the dataset

Returns: First year in the dataset
Return type: int

get_year_range(start_year: int, end_year: int)

Select a range of consecutive years from the data set.

Parameters

start_year (int) – start year
end_year (int) – end year

Returns

Returns a GridAnnual containing only data within the specified year range.

Return type

GridAnnual

rank()

Return a data set where the values are the ranks of each grid cell value.

Returns: Return a GridAnnual containing the values as ranks from highest (1) to lowest.
Return type: GridAnnual

running_average(n_year: int)

Calculate an n_year running average of the data in the dataset

Parameters: n_year (int) – Number of years for which the running average is calculated
Returns: Annual gridded dataset which contains the running averages
Return type: GridAnnual

select_year_range(start_year: int, end_year: int)

Select a particular range of consecutive years from the data set and throw away the rest.

Parameters

start_year (int) – First year of selection
end_year (int) – Final year of selction

Returns

Returns a GridAnnual containing only data within the specified year range.

Return type

GridAnnual

update_history(message: str) → None

Update the history metadata

Parameters: message (str) – Message to be added to history
Return type: None

write_grid(filename: Path, metadata_filename: Optional[Path] = None, name: Optional[str] = None) → None

Write the grid to file.

Parameters

filename (Path) – Filename to write grid to
metadata_filename (Path) – Filename to write metadata to
name (str) – Optional name to give the data set being written. Note that names should be unique in any data archive.

Return type

None

class climind.data_types.grid.GridMonthly(input_data: xarray.Dataset, metadata: CombinedMetadata)

Bases: object

A GridMonthly combines an xarray Dataset with a CombinedMetadata to bring together data and metadata in one object. It represents monthly averages of data on a regular grid.

Create a :class:’.GridMonthly` object from an xarray Dataset and a CombinedMetadata object.

Parameters

input_data (xa.Dataset) – xarray dataset
metadata (CombinedMetadata) – CombinedMetadata object

calculate_regional_average(regions, region_number, land_only=True) → TimeSeriesMonthly

Calculate a regional average from the grid. The region is specified by a geopandas Geodataframe and the index (region_number) of the chosen shape. By default, the output is masked to land areas only, this can be switched off by setting land_only to False.

Parameters

regions (Geodataframe) – geopandas Geodataframe specifying the region to be average over
region_number (int) – the index of the particular region in the Geodataframe
land_only (bool) – By defauly output is masked to land areas only, to calculate a full area average set land_only to False

Returns

Returns time series of area averages.

Return type

ts.TimeSeriesMonthly

calculate_regional_average_missing(regions, region_number, threshold=0.3, land_only=True) → TimeSeriesMonthly

Calculate a regional average from the grid. The region is specified by a geopandas Geodataframe and the index (region_number) of the chosen shape. By default, the output is masked to land areas only, this can be switched off by setting land_only to False.

Parameters

regions (Geodataframe) – geopandas Geodataframe specifying the region to be average over
region_number (int) – the index of the particular region in the Geodataframe
threshold (float) – If the area covered by data in the region drops below this threshold then NaN is returned.
land_only (bool) – By defauly output is masked to land areas only, to calculate a full area average set land_only to False

Returns

Returns time series of area averages.

Return type

ts.TimeSeriesMonthly

calculate_time_mean(cumulative=False)

Calculate the time mean of the map

Returns: Returns a GridMonthly containing the time mean of the data.
Return type: GridMonthly

get_last_month() → datetime

Get the date of the last month in the dataset

Returns: Date of the last month in the dataset
Return type: datetime

make_annual()

Calculate an annual average from a monthly grid by taking the arithmetic mean of available monthly anomalies.

Returns: Return annual average of the grid
Return type: GridAnnual

rebaseline(first_year: int, final_year: int) → xarray.Dataset

Change the baseline of the data to the period between first_year and final_year by subtracting the average of the available data between those two years (inclusive).

Parameters

first_year (int) – First year of climatology period
final_year (int) – Final year of climatology period

Returns

Changes the dataset in place, but also returns the dataset if needed

Return type

xa.Dataset

select_period(start_year: int, start_month: int, end_year: int, end_month: int)

Select a period from the grid specifed by start year and month and end year and month, inclusive.

Parameters

start_year (int) – Year of start date
start_month (int) – Month of start date
end_year (int) – Year of end date
end_month (int) – Month of end date

Returns

Returns a GridMonthly containing only data within the specified date range.

Return type

GridMonthly

select_year_and_month(year: int, month: int)

Select a particular month from the data set and throw away the rest.

Parameters

year (int) – Year of selection
month (int) – Month of selection

Returns

Returns a GridMonthly containing only data within the specified year range.

Return type

GridMonthly

update_history(message: str) → None

Update the history metadata with a message.

Parameters: message (str) – Message to be added to history
Return type: None

climind.data_types.grid.get_1d_transfer(zero_point_original: float, grid_space_original: float, zero_point_target: float, grid_space_target: float, index_in_original: int) → tuple

Find the overlapping grid spacings for a new grid based on an index in the old grid

Parameters

zero_point_original (float) – longitude or latitude of the zero-indexed grid cell
grid_space_original (float) – grid spacing in degrees
zero_point_target (float) – longitude or latitude of the zero-indexed grid cells in the targe grid
grid_space_target (float) – grid spacing in degrees of the target grid
index_in_original (int) – index of the gridcell in the original grid

Returns

Returns, the longitude of the first grid cell in the new grid, the number of steps, and the first and last indices on the new grid.

Return type

tuple

climind.data_types.grid.get_start_and_end_year(all_datasets: List[GridAnnual]) → Tuple[int, int]

Given a list of GridAnnual datasets, find the earliest start year and the latest end year

Parameters: all_datasets (List[GridAnnual]) – List of datasets for which we want to find the first and last year
Return type: Tuple[int, int]

climind.data_types.grid.make_standard_grid(out_grid: numpy.ndarray, start_date: datetime, freq: str, number_of_times: int) → xarray.Dataset

Make the standard 5x5 grid from a numpy array, start date, temporal frequency and number of time steps.

Parameters

out_grid (np.ndarray) – Numpy array containing the data. Shape should be (number_of_times, 36, 72)
start_date (datetime) – Date of the first time step
freq (str) – Temporal frequency
number_of_times (int) – Number of time steps, should match the first dimension of the out_grid

Returns

xarray Dataset containing the data in out_grid with the specified temporal frequency and number of time steps

Return type

xa.Dataset

climind.data_types.grid.make_xarray(target_grid, times, latitudes, longitudes, variable: str = 'tas_mean') → xarray.Dataset

Make a xarray Dataset for a regular lat-lon grid from a numpy grid (ntime, nlat, nlon), and arrays of time (ntime), latitude (nlat) and longitude (nlon).

Parameters

target_grid (np.ndarray) – numpy array of shape (ntime, nlat, nlon)
times (np.ndarray) – Array of times, shape (ntime)
latitudes (np.ndarray) – Array of latitudes, shape (nlat)
longitudes (np.ndarray) – Array of longitudes, shape (nlon)
variable (str) – Variable name

Returns

Dataset built from the input components

Return type

xa.Dataset

climind.data_types.grid.median_of_datasets(all_datasets: List[GridAnnual]) → GridAnnual

Calculate the median of a list of GridAnnual data sets

Parameters: all_datasets (List[GridAnnual]) – List of GridAnnual datasets from which the medians will be calculated.
Return type: GridAnnual

climind.data_types.grid.process_datasets(all_datasets: List[GridAnnual], grid_type: str) → GridAnnual

Calculate the median or range (depending on selected type) of a list of GridAnnual data sets. Medians are calculated on a grid cell by grid cell basis based on all available data in the list of data sets.

Parameters

all_datasets (List[GridAnnual]) – list of GridAnnual data sets
grid_type (str) – Either ‘median’ or ‘range’

Returns

Data set containing the median (or half-range) values from all the data sets supplied

Return type

GridAnnual

climind.data_types.grid.range_of_datasets(all_datasets: List[GridAnnual]) → GridAnnual

Calculate the half-range of a list of GridAnnual data sets

Parameters: all_datasets (List[GridAnnual]) – List of GridAnnual datasets from which the ranges will be calculated.
Return type: GridAnnual

climind.data_types.grid.rank_array(in_array: numpy.ndarray) → int

Rank array

Parameters: in_array (np.ndarray) – Array to be ranked
Return type: int

climind.data_types.grid.simple_regrid(ingrid: numpy.ndarray, lon0: float, lat0: float, dx: float, target_dy: float) → numpy.ndarray

Perform a simple regridding, using a simple average of grid cells from the original grid that fall within the target grid cell.

Parameters

ingrid (np.ndarray) – Starting grid which we want to regrid
lon0 (float) – Longitude of zero-indexed grid cell in longitudinal direction
lat0 (float) – Latitude of zero-indexed grid cell in latitudinal direction
dx (float) – Grid spacing in degrees
target_dy (float) – Target grid spacing

Returns

Returns regridded array.

Return type

np.ndarray

climind.data_types.timeseries module

class climind.data_types.timeseries.AveragesCollection(all_datasets)

Bases: object

A simple class to perform specific tasks on lists of TimeSeriesAnnual

best_estimate()

count()

lower_range()

range()

upper_range()

class climind.data_types.timeseries.TimeSeries(metadata: Optional[CombinedMetadata] = None)

Bases: ABC

A base class for representing time series data sets. Note that this class should not generally be used and only its subclasses TimeSeriesMonthly, TimeSeriesAnnual and TimeSeriesIrregular should be used. This class contains shared functionality from these classes but does not work on its own.

add_offset(**kwargs)

get_first_and_last_year() → Tuple[int, int]

Get the first and last year in the series

Returns: first and last year
Return type: Tuple[int, int]

abstract get_string_date_range() → str

Create a string which specifies the date range covered by the time series

Return type: str

manually_set_baseline(**kwargs)

select_year_range(**kwargs)

update_history(message: str) → None

Update the history metadata

Parameters: message (str) – Message to be added to history
Return type: None

write_generic_csv(filename: Path, metadata_filename: Path, monthly: bool, uncertainty: bool, irregular: bool, columns_to_write: List[str]) → None

Write the dataset out into csv format

Parameters

filename (Path) – Path of the csv file to which the data will be written.
metadata_filename (Path) – Path of the json file to which the data will be written.
monthly (bool) – Set to True for monthly data
uncertainty (bool) – Set to True to print uncertainties
irregular (bool) – Set to True for irregular data
columns_to_write (List[str]) – List of the columns from the dataframe to be written to the data file

Return type

None

class climind.data_types.timeseries.TimeSeriesAnnual(years: list, data: list, metadata=None, uncertainty: Optional[list] = None)

Bases: TimeSeries

A TimeSeriesAnnual combines a pandas Dataframe with a CombinedMetadata to bring together data and metadata in one object. It represents annual averages of data.

Create TimeSeriesAnnual object from its components.

Parameters

years (list) – List of years
data (list) – List of data values
metadata (CombinedMetadata) – Dictionary containing the metadata

df

Pandas dataframe containing the time and data information

Type: pd.DataFrame

metadata

Dictionary containing the metadata. The only guaranteed entry is ‘history’

Type: dict

add_year(year: int, value: float, uncertainty: Optional[float] = None) → None

Add a year of data.

Parameters

year (int) – the year to be added
value (float) – the data value to be added
uncertainty – the uncertainty of the data value to be added (optional)

Return type

None

generate_dates(time_units: str) → List[datetime]

Given a string specifying the required time units (something like days since 1800-01-01 00:00:00.0), generate a list of times from the time series corresponding to those units.

Parameters: time_units (str) – String specifying the units to use for generating the times e.g. “days since 1800-01-01 00:00:00.0”
Returns: List of dates
Return type: List[datetime]

get_rank_from_year(**kwargs)

get_string_date_range() → str

Create a string which specifies the date range covered by the TimeSeriesAnnual in the format YYYY-YYYY

Returns: String that specifies the date range covered
Return type: str

get_uncertainty_from_year(**kwargs)

get_value_from_year(**kwargs)

get_year_axis() → List[float]

Return a year axis with dates represented as decimal years.

Returns: List of dates as decimal years.
Return type: List[float]

get_year_from_rank(**kwargs)

static make_from_df(df: pandas.DataFrame, metadata: CombinedMetadata)

Create a TimeSeriesAnnual from a pandas data frame.

Parameters

df (pd.DataFrame) – Pandas dataframe containing columns ‘year’ and ‘data’
metadata (dict) – Dictionary containing the metadata

Returns

TimeSeriesAnnual created from the elements in the dataframe and metadata.

Return type

TimeSeriesAnnual

rebaseline(**kwargs)

running_mean(**kwargs)

running_stdev(**kwargs)

select_decade(**kwargs)

write_csv(filename, metadata_filename=None)

Write the timeseries to a csv file with the specified filename. The format used for writing is given by the BADC CSV format. This has a lot of upfront metadata before the data section. An option for writing a metadata file is also provided.

Parameters

filename (Path) – Path of the filename to write the data to
metadata_filename (Path) – Path of the filename to write the metadata to

Return type

None

write_simple_csv(filename)

class climind.data_types.timeseries.TimeSeriesIrregular(years: List[int], months: List[int], days: List[int], data: List[float], metadata: Optional[CombinedMetadata] = None, uncertainty: Optional[List[float]] = None)

Bases: TimeSeries

A TimeSeriesIrregular combines a pandas Dataframe with a CombinedMetadata to bring together data and metadata in one object. It represents non-monthly, non-annual averages of data such as weekly, or 5-day averages.

Create TimeSeriesIrregular object.

Parameters

years (List[int]) – List of integers specifying the year of each data point
months (List[int]) – List of integers specifying the month of each data point
days (List[int]) – List of integers specifying the day of each data point
data (List[float]) – List of floats with the data values
metadata (CombinedMetadata) – CombinedMetadata object holding the metadata for the dataset
uncertainty (List[float]) – List of floats with the uncertainty values for each data point

generate_dates(time_units: str) → List[int]

Given a string specifying the required time units (something like days since 1800-01-01 00:00:00.0), generate a list of times from the time series corresponding to those units.

Parameters: time_units (str) – String specifying the units to use for generating the times e.g. “days since 1800-01-01 00:00:00.0”
Return type: List[int]

get_start_and_end_dates() → Tuple[datetime, datetime]

Get the first and last dates in the dataset

Return type: Tuple[datetime, datetime]

get_string_date_range() → str

Create a string which specifies the date range covered by the TimeSeriesIrregular in the format YYYY.MM.DD-YYYY.MM.DD

Returns: String that specifies the date range covered
Return type: str

get_year_axis() → List[float]

Return a year in which all dates are represented as decimal years. January 1st 1984 is 1984.00.

Returns: List of dates represented as decimal years.
Return type: List[float]

make_monthly(**kwargs)

rebaseline(**kwargs)

write_csv(filename: Path, metadata_filename: Optional[Path] = None) → None

Write the timeseries to a csv file with the specified filename. The format used for writing is given by the BADC CSV format. This has a lot of upfront metadata before the data section. An option for writing a metadata file is also provided.

Parameters

filename (Path) – Path of the filename to write the data to
metadata_filename (Path) – Path of the filename to write the metadata to

Return type

None

class climind.data_types.timeseries.TimeSeriesMonthly(years: List[int], months: List[int], data: List[float], metadata: Optional[CombinedMetadata] = None, uncertainty: Optional[List[float]] = None)

Bases: TimeSeries

A TimeSeriesMonthly combines a pandas Dataframe with a CombinedMetadata to bring together data and metadata in one object. It represents monthly averages of data.

Create TimeSeriesMonthly object.

Parameters

years (List[int]) – List of years
months (List[int]) – List of months
data (List[float]) – List of data values
metadata (CombinedMetadata) – CombinedMetadata object containing the metadata
uncertainty (Optional[List[float]]) –

df

Pandas dataframe used to contain the time and data information.

Type: pd.DataFrame

metadata

Dictionary containing metadata. The only guaranteed entry is “history”

Type: dict

generate_dates(time_units: str) → List[int]

Given a string specifying the required time units (something like days since 1800-01-01 00:00:00.0), generate a list of times from the time series corresponding to those units.

Parameters: time_units (str) – String specifying the units to use for generating the times e.g. “days since 1800-01-01 00:00:00.0”
Return type: List[int]

get_rank_from_year_and_month(**kwargs)

get_start_and_end_dates() → Tuple[datetime, datetime]

Get the first and last dates in the dataset

Returns: Start and end dates.
Return type: Tuple[datetime, datetime]

get_string_date_range() → str

Create a string which specifies the date range covered by the TimeSeriesMonthly in the format YYYY.MM-YYYY.MM

Returns: String that specifies the date range covered
Return type: str

get_uncertainty(year: int, month: int) → Optional[float]

Get the current uncertainty for a particular year and month

Parameters

year (int) – Year for which the uncertainty is required.
month (int) – Month for which the uncertainty is required.

Returns

Value for the specified year and month or None if it does not exist

Return type

Optional[float]

get_value(year: int, month: int) → Optional[float]

Get the current value for a particular year and month

Parameters

year (int) – Year for which the value is required.
month (int) – Month for which the value is required.

Returns

Value for the specified year and month or None if it does not exist

Return type

Optional[float]

get_year_axis() → List[float]

Return a year axis as decimal year. 1st January 1984 is 1984.00.

Returns: List of dates expressed as a decimal year.
Return type: List[float]

make_annual(**kwargs)

make_annual_by_selecting_month(**kwargs)

static make_from_df(df: pandas.DataFrame, metadata: CombinedMetadata)

Create a TimeSeriesMonthly from a pandas data frame.

Parameters

df (pd.DataFrame) – Pandas dataframe containing columns ‘year’ ‘month’ and ‘data’ (optionally ‘uncertainty’)
metadata (dict) – Dictionary containing the metadata

Returns

TimeSeriesMonthly built from input components.

Return type

TimeSeriesMonthly

rebaseline(**kwargs)

write_csv(filename: Path, metadata_filename: Optional[Path] = None) → None

Write the TimeSeriesMonthly to a csv file with the specified filename. The format used for writing is given by the BADC CSV format. This has a lot of upfront metadata before the data section. An option for writing a metadata file is also provided.

Parameters

filename (Path) – Path of the filename to write the data to
metadata_filename (Path) – Path of the filename to write the metadata to

Return type

None

zero_on_month(**kwargs)

climind.data_types.timeseries.create_common_dataframe(dataframes: List[pandas.DataFrame], monthly: bool = False, annual: bool = False, irregular: bool = False) → pandas.DataFrame

Given a list of dataframes make a single dataframe which has rows corresponding to all time steps in the input dataframes

Parameters

dataframes (List[pd.DataFrame]) – List of dataframes which are to be used as the basis for the common data frame
monthly (bool) – Set to true for monthly data
annual (bool) – Set to true for annual data
irregular (bool) – Set to true for daily/irregular data

Returns

Pandas dataframe with one row for each row in the input dataframes

Return type

pd.DataFrame

climind.data_types.timeseries.equalise_datasets(all_datasets: List[Union[TimeSeriesAnnual, TimeSeriesMonthly, TimeSeriesIrregular]]) → pandas.DataFrame

Given a list of datasets

Parameters: all_datasets (List[Union[TimeSeriesAnnual, TimeSeriesMonthly, TimeSeriesIrregular]]) – List of time series datasets whose data is to be combined in a single data frame. The data column from each data set will be combined into a single data from with each data column becoming a column identified by the “name” of the data set from its metadata.
Returns: Pandas dataframe containing the data columns from all the input datasets.
Return type: pd.DataFrame

climind.data_types.timeseries.get_list_of_unique_variables(all_datasets: List[TimeSeriesAnnual]) → List[str]

Given a list of TimeSeriesAnnual, get a list of the unique variable names represented in that list.

Parameters: all_datasets (List[TimeSeriesAnnual]) –
Returns: List of the unique variable names.
Return type: List[str]

climind.data_types.timeseries.get_start_and_end_year(all_datasets: List[TimeSeriesAnnual]) → Tuple[Optional[int], Optional[int]]

Given a list of TimeSeriesAnnual, extract the first year in any of the data sets and the last year in any of the data sets.

Parameters: all_datasets (List[TimeSeriesAnnual]) – List of datasets from which to extract the earliest first year and latest final year.
Returns: Return the first and last years in the list of data sets
Return type: Tuple[Optional[int], Optional[int]]

climind.data_types.timeseries.log_activity(in_function: Callable) → Callable

Decorator function to log name of function run and with which arguments. This aims to provide some traceability in the output.

Parameters: in_function (Callable) – The function to be decorated
Return type: Callable

climind.data_types.timeseries.make_combined_series(all_datasets: List[TimeSeriesAnnual]) → TimeSeriesAnnual

Combine a list of datasets into a single TimeSeriesAnnual by taking the arithmetic mean of all available datasets for each year. Merges the metadata for all the input time series.

Parameters: all_datasets (List[TimeSeriesAnnual]) – List of datasets to be combined
Returns: TimeSeriesAnnual which is the mean of all availabale datasets in each year.
Return type: TimeSeriesAnnual

climind.data_types.timeseries.superset_dataset_list(all_datasets: List[TimeSeriesAnnual], variables: List[str]) → List[List[TimeSeriesAnnual]]

Given a list of variables, create a list where each entry is a list of all TimeSeriesAnnual objects corresponding to the variable in that index position.

Parameters

all_datasets (List[TimeSeriesAnnual]) – List of datasets
variables (List[str]) – List of variable names

Returns

List of lists of TimeSeriesAnnual.

Return type

List[List[TimeSeriesAnnual]]

climind.data_types.timeseries.write_dataset_summary_file(all_datasets, csv_filename)

climind.data_types.timeseries.write_dataset_summary_file_with_metadata(all_datasets: List[Union[TimeSeriesAnnual, TimeSeriesMonthly, TimeSeriesIrregular]], csv_filename: Union[str, Path]) → None

Given a list of time series data sets, write them out in a single BADC CSV format csv file with complete metadata.

Parameters

all_datasets (List[Union[TimeSeriesAnnual, TimeSeriesMonthly, TimeSeriesIrregular]]) – A list of time series which are going to be equalised
csv_filename (str or Path) – The name of the file to which the summary will be written.

Return type

None