climind.data_types package

There are two main data types implemented in this package: timeseries and grids. In each of those two cases, the data set consists of a data-carrying part and a metadata part. For timeseries, the data-carrying part is a pandas dataframe and for a grid, it’s an xarray dataset.

Submodules

climind.data_types.grid module

class climind.data_types.grid.GridAnnual(input_data, metadata: CombinedMetadata)

Bases: object

A GridAnnual combines an xarray Dataset with a CombinedMetadata to bring together data and metadata in one object. It represents annual averages of data.

Create an annual gridded data set from an xarray Dataset and CombinedMetadata object.

Parameters
  • input_data (xa.Dataset) – xarray dataset

  • metadata (CombinedMetadata) – CombinedMetadata object

get_end_year() int

Get the last year in the dataset

Returns

Last year in the data set

Return type

int

get_start_year() int

Get the first year in the dataset

Returns

First year in the dataset

Return type

int

get_year_range(start_year: int, end_year: int)

Select a range of consecutive years from the data set.

Parameters
  • start_year (int) – start year

  • end_year (int) – end year

Returns

Returns a GridAnnual containing only data within the specified year range.

Return type

GridAnnual

rank()

Return a data set where the values are the ranks of each grid cell value.

Returns

Return a GridAnnual containing the values as ranks from highest (1) to lowest.

Return type

GridAnnual

running_average(n_year: int)

Calculate an n_year running average of the data in the dataset

Parameters

n_year (int) – Number of years for which the running average is calculated

Returns

Annual gridded dataset which contains the running averages

Return type

GridAnnual

select_year_range(start_year: int, end_year: int)

Select a particular range of consecutive years from the data set and throw away the rest.

Parameters
  • start_year (int) – First year of selection

  • end_year (int) – Final year of selction

Returns

Returns a GridAnnual containing only data within the specified year range.

Return type

GridAnnual

update_history(message: str) None

Update the history metadata

Parameters

message (str) – Message to be added to history

Return type

None

write_grid(filename: Path, metadata_filename: Optional[Path] = None, name: Optional[str] = None) None

Write the grid to file.

Parameters
  • filename (Path) – Filename to write grid to

  • metadata_filename (Path) – Filename to write metadata to

  • name (str) – Optional name to give the data set being written. Note that names should be unique in any data archive.

Return type

None

class climind.data_types.grid.GridMonthly(input_data: xarray.Dataset, metadata: CombinedMetadata)

Bases: object

A GridMonthly combines an xarray Dataset with a CombinedMetadata to bring together data and metadata in one object. It represents monthly averages of data on a regular grid.

Create a :class:’.GridMonthly` object from an xarray Dataset and a CombinedMetadata object.

Parameters
  • input_data (xa.Dataset) – xarray dataset

  • metadata (CombinedMetadata) – CombinedMetadata object

calculate_regional_average(regions, region_number, land_only=True) TimeSeriesMonthly

Calculate a regional average from the grid. The region is specified by a geopandas Geodataframe and the index (region_number) of the chosen shape. By default, the output is masked to land areas only, this can be switched off by setting land_only to False.

Parameters
  • regions (Geodataframe) – geopandas Geodataframe specifying the region to be average over

  • region_number (int) – the index of the particular region in the Geodataframe

  • land_only (bool) – By defauly output is masked to land areas only, to calculate a full area average set land_only to False

Returns

Returns time series of area averages.

Return type

ts.TimeSeriesMonthly

calculate_regional_average_missing(regions, region_number, threshold=0.3, land_only=True) TimeSeriesMonthly

Calculate a regional average from the grid. The region is specified by a geopandas Geodataframe and the index (region_number) of the chosen shape. By default, the output is masked to land areas only, this can be switched off by setting land_only to False.

Parameters
  • regions (Geodataframe) – geopandas Geodataframe specifying the region to be average over

  • region_number (int) – the index of the particular region in the Geodataframe

  • threshold (float) – If the area covered by data in the region drops below this threshold then NaN is returned.

  • land_only (bool) – By defauly output is masked to land areas only, to calculate a full area average set land_only to False

Returns

Returns time series of area averages.

Return type

ts.TimeSeriesMonthly

calculate_time_mean(cumulative=False)

Calculate the time mean of the map

Returns

Returns a GridMonthly containing the time mean of the data.

Return type

GridMonthly

get_last_month() datetime

Get the date of the last month in the dataset

Returns

Date of the last month in the dataset

Return type

datetime

make_annual()

Calculate an annual average from a monthly grid by taking the arithmetic mean of available monthly anomalies.

Returns

Return annual average of the grid

Return type

GridAnnual

rebaseline(first_year: int, final_year: int) xarray.Dataset

Change the baseline of the data to the period between first_year and final_year by subtracting the average of the available data between those two years (inclusive).

Parameters
  • first_year (int) – First year of climatology period

  • final_year (int) – Final year of climatology period

Returns

Changes the dataset in place, but also returns the dataset if needed

Return type

xa.Dataset

select_period(start_year: int, start_month: int, end_year: int, end_month: int)

Select a period from the grid specifed by start year and month and end year and month, inclusive.

Parameters
  • start_year (int) – Year of start date

  • start_month (int) – Month of start date

  • end_year (int) – Year of end date

  • end_month (int) – Month of end date

Returns

Returns a GridMonthly containing only data within the specified date range.

Return type

GridMonthly

select_year_and_month(year: int, month: int)

Select a particular month from the data set and throw away the rest.

Parameters
  • year (int) – Year of selection

  • month (int) – Month of selection

Returns

Returns a GridMonthly containing only data within the specified year range.

Return type

GridMonthly

update_history(message: str) None

Update the history metadata with a message.

Parameters

message (str) – Message to be added to history

Return type

None

climind.data_types.grid.get_1d_transfer(zero_point_original: float, grid_space_original: float, zero_point_target: float, grid_space_target: float, index_in_original: int) tuple

Find the overlapping grid spacings for a new grid based on an index in the old grid

Parameters
  • zero_point_original (float) – longitude or latitude of the zero-indexed grid cell

  • grid_space_original (float) – grid spacing in degrees

  • zero_point_target (float) – longitude or latitude of the zero-indexed grid cells in the targe grid

  • grid_space_target (float) – grid spacing in degrees of the target grid

  • index_in_original (int) – index of the gridcell in the original grid

Returns

Returns, the longitude of the first grid cell in the new grid, the number of steps, and the first and last indices on the new grid.

Return type

tuple

climind.data_types.grid.get_start_and_end_year(all_datasets: List[GridAnnual]) Tuple[int, int]

Given a list of GridAnnual datasets, find the earliest start year and the latest end year

Parameters

all_datasets (List[GridAnnual]) – List of datasets for which we want to find the first and last year

Return type

Tuple[int, int]

climind.data_types.grid.make_standard_grid(out_grid: numpy.ndarray, start_date: datetime, freq: str, number_of_times: int) xarray.Dataset

Make the standard 5x5 grid from a numpy array, start date, temporal frequency and number of time steps.

Parameters
  • out_grid (np.ndarray) – Numpy array containing the data. Shape should be (number_of_times, 36, 72)

  • start_date (datetime) – Date of the first time step

  • freq (str) – Temporal frequency

  • number_of_times (int) – Number of time steps, should match the first dimension of the out_grid

Returns

xarray Dataset containing the data in out_grid with the specified temporal frequency and number of time steps

Return type

xa.Dataset

climind.data_types.grid.make_xarray(target_grid, times, latitudes, longitudes, variable: str = 'tas_mean') xarray.Dataset

Make a xarray Dataset for a regular lat-lon grid from a numpy grid (ntime, nlat, nlon), and arrays of time (ntime), latitude (nlat) and longitude (nlon).

Parameters
  • target_grid (np.ndarray) – numpy array of shape (ntime, nlat, nlon)

  • times (np.ndarray) – Array of times, shape (ntime)

  • latitudes (np.ndarray) – Array of latitudes, shape (nlat)

  • longitudes (np.ndarray) – Array of longitudes, shape (nlon)

  • variable (str) – Variable name

Returns

Dataset built from the input components

Return type

xa.Dataset

climind.data_types.grid.median_of_datasets(all_datasets: List[GridAnnual]) GridAnnual

Calculate the median of a list of GridAnnual data sets

Parameters

all_datasets (List[GridAnnual]) – List of GridAnnual datasets from which the medians will be calculated.

Return type

GridAnnual

climind.data_types.grid.process_datasets(all_datasets: List[GridAnnual], grid_type: str) GridAnnual

Calculate the median or range (depending on selected type) of a list of GridAnnual data sets. Medians are calculated on a grid cell by grid cell basis based on all available data in the list of data sets.

Parameters
  • all_datasets (List[GridAnnual]) – list of GridAnnual data sets

  • grid_type (str) – Either ‘median’ or ‘range’

Returns

Data set containing the median (or half-range) values from all the data sets supplied

Return type

GridAnnual

climind.data_types.grid.range_of_datasets(all_datasets: List[GridAnnual]) GridAnnual

Calculate the half-range of a list of GridAnnual data sets

Parameters

all_datasets (List[GridAnnual]) – List of GridAnnual datasets from which the ranges will be calculated.

Return type

GridAnnual

climind.data_types.grid.rank_array(in_array: numpy.ndarray) int

Rank array

Parameters

in_array (np.ndarray) – Array to be ranked

Return type

int

climind.data_types.grid.simple_regrid(ingrid: numpy.ndarray, lon0: float, lat0: float, dx: float, target_dy: float) numpy.ndarray

Perform a simple regridding, using a simple average of grid cells from the original grid that fall within the target grid cell.

Parameters
  • ingrid (np.ndarray) – Starting grid which we want to regrid

  • lon0 (float) – Longitude of zero-indexed grid cell in longitudinal direction

  • lat0 (float) – Latitude of zero-indexed grid cell in latitudinal direction

  • dx (float) – Grid spacing in degrees

  • target_dy (float) – Target grid spacing

Returns

Returns regridded array.

Return type

np.ndarray

climind.data_types.timeseries module

class climind.data_types.timeseries.AveragesCollection(all_datasets)

Bases: object

A simple class to perform specific tasks on lists of TimeSeriesAnnual

best_estimate()
count()
lower_range()
range()
upper_range()
class climind.data_types.timeseries.TimeSeries(metadata: Optional[CombinedMetadata] = None)

Bases: ABC

A base class for representing time series data sets. Note that this class should not generally be used and only its subclasses TimeSeriesMonthly, TimeSeriesAnnual and TimeSeriesIrregular should be used. This class contains shared functionality from these classes but does not work on its own.

add_offset(**kwargs)
get_first_and_last_year() Tuple[int, int]

Get the first and last year in the series

Returns

first and last year

Return type

Tuple[int, int]

abstract get_string_date_range() str

Create a string which specifies the date range covered by the time series

Return type

str

manually_set_baseline(**kwargs)
select_year_range(**kwargs)
update_history(message: str) None

Update the history metadata

Parameters

message (str) – Message to be added to history

Return type

None

write_generic_csv(filename: Path, metadata_filename: Path, monthly: bool, uncertainty: bool, irregular: bool, columns_to_write: List[str]) None

Write the dataset out into csv format

Parameters
  • filename (Path) – Path of the csv file to which the data will be written.

  • metadata_filename (Path) – Path of the json file to which the data will be written.

  • monthly (bool) – Set to True for monthly data

  • uncertainty (bool) – Set to True to print uncertainties

  • irregular (bool) – Set to True for irregular data

  • columns_to_write (List[str]) – List of the columns from the dataframe to be written to the data file

Return type

None

class climind.data_types.timeseries.TimeSeriesAnnual(years: list, data: list, metadata=None, uncertainty: Optional[list] = None)

Bases: TimeSeries

A TimeSeriesAnnual combines a pandas Dataframe with a CombinedMetadata to bring together data and metadata in one object. It represents annual averages of data.

Create TimeSeriesAnnual object from its components.

Parameters
  • years (list) – List of years

  • data (list) – List of data values

  • metadata (CombinedMetadata) – Dictionary containing the metadata

df

Pandas dataframe containing the time and data information

Type

pd.DataFrame

metadata

Dictionary containing the metadata. The only guaranteed entry is ‘history’

Type

dict

add_year(year: int, value: float, uncertainty: Optional[float] = None) None

Add a year of data.

Parameters
  • year (int) – the year to be added

  • value (float) – the data value to be added

  • uncertainty – the uncertainty of the data value to be added (optional)

Return type

None

generate_dates(time_units: str) List[datetime]

Given a string specifying the required time units (something like days since 1800-01-01 00:00:00.0), generate a list of times from the time series corresponding to those units.

Parameters

time_units (str) – String specifying the units to use for generating the times e.g. “days since 1800-01-01 00:00:00.0”

Returns

List of dates

Return type

List[datetime]

get_rank_from_year(**kwargs)
get_string_date_range() str

Create a string which specifies the date range covered by the TimeSeriesAnnual in the format YYYY-YYYY

Returns

String that specifies the date range covered

Return type

str

get_uncertainty_from_year(**kwargs)
get_value_from_year(**kwargs)
get_year_axis() List[float]

Return a year axis with dates represented as decimal years.

Returns

List of dates as decimal years.

Return type

List[float]

get_year_from_rank(**kwargs)
static make_from_df(df: pandas.DataFrame, metadata: CombinedMetadata)

Create a TimeSeriesAnnual from a pandas data frame.

Parameters
  • df (pd.DataFrame) – Pandas dataframe containing columns ‘year’ and ‘data’

  • metadata (dict) – Dictionary containing the metadata

Returns

TimeSeriesAnnual created from the elements in the dataframe and metadata.

Return type

TimeSeriesAnnual

rebaseline(**kwargs)
running_mean(**kwargs)
running_stdev(**kwargs)
select_decade(**kwargs)
write_csv(filename, metadata_filename=None)

Write the timeseries to a csv file with the specified filename. The format used for writing is given by the BADC CSV format. This has a lot of upfront metadata before the data section. An option for writing a metadata file is also provided.

Parameters
  • filename (Path) – Path of the filename to write the data to

  • metadata_filename (Path) – Path of the filename to write the metadata to

Return type

None

write_simple_csv(filename)
class climind.data_types.timeseries.TimeSeriesIrregular(years: List[int], months: List[int], days: List[int], data: List[float], metadata: Optional[CombinedMetadata] = None, uncertainty: Optional[List[float]] = None)

Bases: TimeSeries

A TimeSeriesIrregular combines a pandas Dataframe with a CombinedMetadata to bring together data and metadata in one object. It represents non-monthly, non-annual averages of data such as weekly, or 5-day averages.

Create TimeSeriesIrregular object.

Parameters
  • years (List[int]) – List of integers specifying the year of each data point

  • months (List[int]) – List of integers specifying the month of each data point

  • days (List[int]) – List of integers specifying the day of each data point

  • data (List[float]) – List of floats with the data values

  • metadata (CombinedMetadata) – CombinedMetadata object holding the metadata for the dataset

  • uncertainty (List[float]) – List of floats with the uncertainty values for each data point

generate_dates(time_units: str) List[int]

Given a string specifying the required time units (something like days since 1800-01-01 00:00:00.0), generate a list of times from the time series corresponding to those units.

Parameters

time_units (str) – String specifying the units to use for generating the times e.g. “days since 1800-01-01 00:00:00.0”

Return type

List[int]

get_start_and_end_dates() Tuple[datetime, datetime]

Get the first and last dates in the dataset

Return type

Tuple[datetime, datetime]

get_string_date_range() str

Create a string which specifies the date range covered by the TimeSeriesIrregular in the format YYYY.MM.DD-YYYY.MM.DD

Returns

String that specifies the date range covered

Return type

str

get_year_axis() List[float]

Return a year in which all dates are represented as decimal years. January 1st 1984 is 1984.00.

Returns

List of dates represented as decimal years.

Return type

List[float]

make_monthly(**kwargs)
rebaseline(**kwargs)
write_csv(filename: Path, metadata_filename: Optional[Path] = None) None

Write the timeseries to a csv file with the specified filename. The format used for writing is given by the BADC CSV format. This has a lot of upfront metadata before the data section. An option for writing a metadata file is also provided.

Parameters
  • filename (Path) – Path of the filename to write the data to

  • metadata_filename (Path) – Path of the filename to write the metadata to

Return type

None

class climind.data_types.timeseries.TimeSeriesMonthly(years: List[int], months: List[int], data: List[float], metadata: Optional[CombinedMetadata] = None, uncertainty: Optional[List[float]] = None)

Bases: TimeSeries

A TimeSeriesMonthly combines a pandas Dataframe with a CombinedMetadata to bring together data and metadata in one object. It represents monthly averages of data.

Create TimeSeriesMonthly object.

Parameters
  • years (List[int]) – List of years

  • months (List[int]) – List of months

  • data (List[float]) – List of data values

  • metadata (CombinedMetadata) – CombinedMetadata object containing the metadata

  • uncertainty (Optional[List[float]]) –

df

Pandas dataframe used to contain the time and data information.

Type

pd.DataFrame

metadata

Dictionary containing metadata. The only guaranteed entry is “history”

Type

dict

generate_dates(time_units: str) List[int]

Given a string specifying the required time units (something like days since 1800-01-01 00:00:00.0), generate a list of times from the time series corresponding to those units.

Parameters

time_units (str) – String specifying the units to use for generating the times e.g. “days since 1800-01-01 00:00:00.0”

Return type

List[int]

get_rank_from_year_and_month(**kwargs)
get_start_and_end_dates() Tuple[datetime, datetime]

Get the first and last dates in the dataset

Returns

Start and end dates.

Return type

Tuple[datetime, datetime]

get_string_date_range() str

Create a string which specifies the date range covered by the TimeSeriesMonthly in the format YYYY.MM-YYYY.MM

Returns

String that specifies the date range covered

Return type

str

get_uncertainty(year: int, month: int) Optional[float]

Get the current uncertainty for a particular year and month

Parameters
  • year (int) – Year for which the uncertainty is required.

  • month (int) – Month for which the uncertainty is required.

Returns

Value for the specified year and month or None if it does not exist

Return type

Optional[float]

get_value(year: int, month: int) Optional[float]

Get the current value for a particular year and month

Parameters
  • year (int) – Year for which the value is required.

  • month (int) – Month for which the value is required.

Returns

Value for the specified year and month or None if it does not exist

Return type

Optional[float]

get_year_axis() List[float]

Return a year axis as decimal year. 1st January 1984 is 1984.00.

Returns

List of dates expressed as a decimal year.

Return type

List[float]

make_annual(**kwargs)
make_annual_by_selecting_month(**kwargs)
static make_from_df(df: pandas.DataFrame, metadata: CombinedMetadata)

Create a TimeSeriesMonthly from a pandas data frame.

Parameters
  • df (pd.DataFrame) – Pandas dataframe containing columns ‘year’ ‘month’ and ‘data’ (optionally ‘uncertainty’)

  • metadata (dict) – Dictionary containing the metadata

Returns

TimeSeriesMonthly built from input components.

Return type

TimeSeriesMonthly

rebaseline(**kwargs)
write_csv(filename: Path, metadata_filename: Optional[Path] = None) None

Write the TimeSeriesMonthly to a csv file with the specified filename. The format used for writing is given by the BADC CSV format. This has a lot of upfront metadata before the data section. An option for writing a metadata file is also provided.

Parameters
  • filename (Path) – Path of the filename to write the data to

  • metadata_filename (Path) – Path of the filename to write the metadata to

Return type

None

zero_on_month(**kwargs)
climind.data_types.timeseries.create_common_dataframe(dataframes: List[pandas.DataFrame], monthly: bool = False, annual: bool = False, irregular: bool = False) pandas.DataFrame

Given a list of dataframes make a single dataframe which has rows corresponding to all time steps in the input dataframes

Parameters
  • dataframes (List[pd.DataFrame]) – List of dataframes which are to be used as the basis for the common data frame

  • monthly (bool) – Set to true for monthly data

  • annual (bool) – Set to true for annual data

  • irregular (bool) – Set to true for daily/irregular data

Returns

Pandas dataframe with one row for each row in the input dataframes

Return type

pd.DataFrame

climind.data_types.timeseries.equalise_datasets(all_datasets: List[Union[TimeSeriesAnnual, TimeSeriesMonthly, TimeSeriesIrregular]]) pandas.DataFrame

Given a list of datasets

Parameters

all_datasets (List[Union[TimeSeriesAnnual, TimeSeriesMonthly, TimeSeriesIrregular]]) – List of time series datasets whose data is to be combined in a single data frame. The data column from each data set will be combined into a single data from with each data column becoming a column identified by the “name” of the data set from its metadata.

Returns

Pandas dataframe containing the data columns from all the input datasets.

Return type

pd.DataFrame

climind.data_types.timeseries.get_list_of_unique_variables(all_datasets: List[TimeSeriesAnnual]) List[str]

Given a list of TimeSeriesAnnual, get a list of the unique variable names represented in that list.

Parameters

all_datasets (List[TimeSeriesAnnual]) –

Returns

List of the unique variable names.

Return type

List[str]

climind.data_types.timeseries.get_start_and_end_year(all_datasets: List[TimeSeriesAnnual]) Tuple[Optional[int], Optional[int]]

Given a list of TimeSeriesAnnual, extract the first year in any of the data sets and the last year in any of the data sets.

Parameters

all_datasets (List[TimeSeriesAnnual]) – List of datasets from which to extract the earliest first year and latest final year.

Returns

Return the first and last years in the list of data sets

Return type

Tuple[Optional[int], Optional[int]]

climind.data_types.timeseries.log_activity(in_function: Callable) Callable

Decorator function to log name of function run and with which arguments. This aims to provide some traceability in the output.

Parameters

in_function (Callable) – The function to be decorated

Return type

Callable

climind.data_types.timeseries.make_combined_series(all_datasets: List[TimeSeriesAnnual]) TimeSeriesAnnual

Combine a list of datasets into a single TimeSeriesAnnual by taking the arithmetic mean of all available datasets for each year. Merges the metadata for all the input time series.

Parameters

all_datasets (List[TimeSeriesAnnual]) – List of datasets to be combined

Returns

TimeSeriesAnnual which is the mean of all availabale datasets in each year.

Return type

TimeSeriesAnnual

climind.data_types.timeseries.superset_dataset_list(all_datasets: List[TimeSeriesAnnual], variables: List[str]) List[List[TimeSeriesAnnual]]

Given a list of variables, create a list where each entry is a list of all TimeSeriesAnnual objects corresponding to the variable in that index position.

Parameters
  • all_datasets (List[TimeSeriesAnnual]) – List of datasets

  • variables (List[str]) – List of variable names

Returns

List of lists of TimeSeriesAnnual.

Return type

List[List[TimeSeriesAnnual]]

climind.data_types.timeseries.write_dataset_summary_file(all_datasets, csv_filename)
climind.data_types.timeseries.write_dataset_summary_file_with_metadata(all_datasets: List[Union[TimeSeriesAnnual, TimeSeriesMonthly, TimeSeriesIrregular]], csv_filename: Union[str, Path]) None

Given a list of time series data sets, write them out in a single BADC CSV format csv file with complete metadata.

Parameters
Return type

None