climind.data_manager package

The scripts and files in the data manager module, are used to create and process the metadata entities that underpin the dashboards.

There are basic metadata classes (BaseMetadata, CollectionMetadata and DatasetMetadata) which contain metadata and allow for simple tasks like checking whether the metadata they contain matches what’s in a dictionary.

Then there are classes (DataSet, DataCollection, DataArchive) that use the metadata classes to define data sets, collections of related data sets and data archives. These classes allow you to select subsets of data, download data sets and read them in.

The metadata themselves are stored in json files in the climind.metadata_files directory.

Submodules

climind.data_manager.metadata module

These metadata classes contain all the information about the datasets that are manipulated by the packages. The BaseMetadata class contains much of the functionality, with CollectionMetadata and DatasetMetadata inheriting that functionality and differing chiefly in the schemas used to validate their contents. The CombinedMetadata class comprises a CollectionMetadata object and a DatasetMetadata object.

class climind.data_manager.metadata.BaseMetadata(metadata: dict)

Bases: object

Simple class to store metadata and find matches. Metadata items can be set and recovered using a dictionary-like syntax:

metadata_object[‘key’] = value

value = metadata_object[‘key’]

And testing if a key-value pair exists is also dict-like

key in metadata_object

Create a BaseMetadata object from a dictionary containing the metaadata in key-value pairs.

Parameters: metadata (dict) – Dictionary containing the metadata

metadata

Contains the metadata information in key value pairs

Type: dict

fill_string(string_to_replace: str, replacement: str)

Replace string_to_replace with the replacement value in all elements of the metadata.

Parameters

string_to_replace (str) – string to be replaced in metadata elements
replacement (str) – replacement string

Return type

None

match_metadata(metadata_to_match: dict) → bool

Check if metadata match contents of dictionary, metadata_to_match. Only definite non-matches are rejected. If a key is not found in the dictionary this is not counted as a non-match.

Parameters: metadata_to_match (dict) – Key-value or key-list pairs for match

class climind.data_manager.metadata.CollectionMetadata(metadata: dict)

Bases: BaseMetadata

Class to store collection-level metadata, containing information that refers to all data sets in the collection.

Create CollectionMetadata from a dictionary containing metadata. Metadata are validated using the metadata_schema.json file.

Parameters: metadata (dict) – Dictionary containing metadata in key value pairs.

class climind.data_manager.metadata.CombinedMetadata(dataset: DatasetMetadata, collection: CollectionMetadata)

Bases: object

CombinedMetadata combines DatasetMetadata and CollectionMetadata in one single object so that both sets of metadata elements are available in one container.

creation_message() → None

Add a creation message to the dataset history and populate the wildcards in the metadata, such as AAAA (last modified/download time), YYYY (year), VVVV (version number).

Return type: None

match_metadata(metadata_to_match: dict) → bool

Test to see if metadata matches metadata to match. Returns True unless there is a mismatch between the required metadata_to_match and the metadata.

Parameters: metadata_to_match (dict) – Dictionary of metadata terms to match
Returns: Return True unless an element in metadata_to_match conflicts with an entry in the metadata
Return type: bool

write_metadata(filename: Path) → None

Write out the metadata in json format to a file specified by filename

Parameters: filename (Path) – Path of filename to be created
Return type: None

class climind.data_manager.metadata.DatasetMetadata(metadata: dict)

Bases: BaseMetadata

Class to store dataset-level metadata, containing information that refers specifically to a single data set.

Create DatasetMetadata from a dictionary containing metadata. Metadata are validated using the dataset_schema.json file.

Parameters: metadata (dict) – Dictionary containing metadata in key value pairs.

creation_message() → None

Add creation message to the history.

Return type: None

climind.data_manager.metadata.list_match(list_to_match: list, attribute: str) → bool

If attribute matches any item in list_to_match return True, otherwise False

Parameters

list_to_match (list) – List of metadata to match
attribute (str) – attribute to check against

Returns

Set to True if attribute matches element in list_to_match, False otherwise

Return type

bool

climind.data_manager.processing module

The classes and functions in this script describe groupings of metadata. The basic building block is a DataSet, which specifies a file (or files) which contains the data for a single data set. DataSet objects are grouped into DataCollection objects, which gather together all the individual data sets which are derived from a single product. For example, HadCRUT5 is a product and so it has a corresponding DataCollection made up of several DataSet objects. Finally, a DataArchive contains one or more DataCollection objects. All DataSet objects in a DataCollection will be the same variable. However, DataCollection objects in a DataArchive need not be the same variable.

class climind.data_manager.processing.DataArchive

Bases: object

A set of DataCollection objects. A class:DataArchive is the starting point for the analysis. Particular DataSet objects are selected from the class:DataArchive before plotting or summarising the data.

Create a DataArchive object, initially empty.

collections

A dictionary containing the DataCollection objects in the archive

Type: dict

add_collection(data_collection: DataCollection) → None

Add a DataCollection to the archive

Parameters: data_collection (DataCollection) – DataCollection to be added to the DataArchive
Return type: None

download(out_dir: Path) → None

Download all files in the DataArchive.

Parameters: out_dir (Path) – Directory to which the files should be downloaded
Return type: None

static from_directory(path_to_dir: Path)

Create a DataArchive from a directory of metadata. The directory should contain a set of json files each of which contains a set of metadata describing a DataCollection

Parameters: path_to_dir (Path) – Path to the directory containing the metadata files that will be used to populate the DataArchive
Returns: DataArchive containing all DataCollection objects described in the metadata files
Return type: DataArchive

read_datasets(out_dir: Path, **kwargs) → list

Read all the datasets in the DataArchive.

Parameters: out_dir (Path) – Path of directory containing the data
Returns: List of datasets specified by metadata in the archive.
Return type: list

select(metadata_to_match: dict)

Select datasets from the DataArchive that meet the metadata requirements specified in the metadata_to_match dictionary.

Parameters: metadata_to_match (dict) – Metadata to be matched. For each requirement, there should be a key-value pair
Returns: Returns DataArchive containing only data that match the metadata_to_match
Return type: DataArchive

class climind.data_manager.processing.DataCollection(metadata: dict)

Bases: object

A grouping of DataSet objects derived from a single product or source. e.g. HadCRUT5. This could include, for example, monthly and annual time series along with the gridded data.

Create DataCollection from a metadata dictionary.

Parameters: metadata (dict) –

global_attributes

Metadata containing the attributes that apply to all DataSets in the DataCollection

Type: CollectionMetadata

datasets

List containing all the DataSet objects in this collection

Type: List[DataSet]

add_dataset(ds: DataSet) → None

Add DataSet object to DataCollection

Parameters: ds (DataSet) – DataSet to be added
Return type: None

download(data_dir: Path) → None

Download all the data sets described by DataSet objects in the DataCollection.

Parameters: data_dir (Path) – Location to which the datasets should be downloaded
Return type: None

static from_file(filename: Path)

Given a file path create the DataCollection from metadata in that file

Parameters: filename (Path) – Filename of the metadata file in json format
Returns: DataCollection containing all the DataSet objects specified by the metadata file
Return type: DataCollection

get_collection_dir(data_dir: Path) → Path

Get the Path to the directory where the data for this DataCollection are stored. If the directory does not exist, then create it.

Parameters: data_dir (Path) – Path to the general data directory for managed data in the project
Returns: Path to the directory for this DataCollection.
Return type: Path

match_metadata(metadata_to_match: dict)

Given a dictionary of metadata keys and required values for each key, return a DataCollection which contains only data sets matching the specified metadata

Parameters: metadata_to_match (dict) – Dictionary containing key:value pairs that specify the data sets required in the output DataCollection
Returns: Return DataCollection that matches the metadata_to_match
Return type: DataCollection

read_datasets(out_dir: Path, **kwargs) → list

Read all the datasets described by DataSet objects in the DataCollection

Parameters: out_dir (Path) – Directory in which the datasets are found
Returns: Return list of all data sets described in the DataCollection.
Return type: list

to_file(filename: Path) → None

Write the DataCollection metadata to file in json format.

Parameters: filename (Path) – Path to the file to be written
Return type: None

class climind.data_manager.processing.DataSet(metadata: DatasetMetadata, global_metadata: CollectionMetadata)

Bases: object

A DataSet contains metadata for a single dataset (one that might be split across multiple files). For example, NSIDC monthly sea ice extent data is a single data set provided in 12 files, one for each month. In contrast, HadCRUT5 monthly global mean temperature is a single file. Both of these would be described by a DataSet. They can be used to read in the actual data.

Create a DataSet from DatasetMetadata and CollectionMetadata.

Parameters

metadata (DatasetMetadata) – DatasetMetadata containing the dataset metadata.
global_metadata (CollectionMetadata) – CollectionMetadata containing the global metadata

name

Name of the data set

Type: str

metadata

Dictionary of attributes

Type: dict

global_metadata

Dictionary of global attributes inherited from collection

Type: dict

download(out_dir: Path) → None

Download the data set using its “fetcher” function. Fetcher functions are contained in the fetchers package.

Parameters: out_dir (Path) – Directory to which the data set will be downloaded
Return type: None

match_metadata(metadata_to_match: dict) → bool

Check if there is a mismatch between attributes of DataSet and the contents of a dictionary, metadata_to_match. Only items that are in the attributes are checked.

Parameters: metadata_to_match (dict) – Dictionary of key-value or key-list pairs to match. If a key-list is provided then each element of the list is checked and a mismatch only occurs if all of the items in the list cause a mismatch.
Returns: Return True unless there is a mismatch in which case return False
Return type: bool

read_dataset(out_dir: Path, **kwargs)

Read in the dataset and output an object of the appropriate type.

Parameters: out_dir (Path) – Directory in which the data are to be found (dictated by the Collection)
Return type: Object of the appropriate type

climind.data_manager.processing.get_function(module_path: str, script_name: str, function_name: str) → Callable

For a particular module and script in that module, return the function with a specified name as a callable object.

Parameters

module_path (str) – The path to the module written using dot separation between directories
script_name (str) – The name of the script
function_name (str) – The name of the function in the script to be returned

Returns

Returns the function with the specified function name from the script with the specified script name in the specified module path

Return type

Callable