Data Submodule

This submodule contains tools for working with time series datasets. For clarity, we need some definitions:

  • An array is a Python object that has a similar signature to a numpy array or PyTorch tensor for indexing. In particular, it has shape and ndim properties and has a length.

  • A series is a 2-dimensional array, where the 0th dimension indexs channel and the 1st dimension indexs time. If a 1-dimensional array is passed to a function expecting a series, it will be interpreted as a univariate series and coerced to 2 dimensions.

  • A multiseries is a 3-dimensional array consisting of a collection of series, where the 0th dimension indexs the series, the 1st dimension indexs the channel, and the 2nd dimension indexs the time. If a 2-dimensional array is passed to a function expecting a multiseries, it will be interpreted as a single multivariate series and coerced to 3 dimensions.

  • A dataset is a collection of one or more multiseries. The multiseries in the dataset must all have broadcastable shapes, except for the number of channels, which is allowed to vary. That is, for each multiseries in the dataset, the 0th (series) and 2nd (time) dimensions must either be equal or one.

class torchcast.data.SeriesDataset(*data: ArrayLike, return_length: int | None = None, transform: Callable | None = None, metadata: Metadata | List[Metadata] | None = None)

This is a base class for time series datasets. It is expected to only be used in a subclass, such as torchcast.data.TensorSeriesDataset.

Data held by a SeriesDataset is always returned in shape (channels, time steps), so that it can be stacked to form a batch of series in shape (series, channels, time steps).

split_by_time(t: int | float) Tuple[SeriesDataset, SeriesDataset]

Splits the dataset by time.

Parameters:
  • t (int or float) – If this is an integer, then perform the split at

  • float (this time. If it is a) –

  • percentage (perform the split at this) –

  • time. (of the) –

class torchcast.data.TensorSeriesDataset(*data: ArrayLike, return_length: int | None = None, transform: Callable | None = None, metadata: Metadata | List[Metadata] | None = None)

This encapsulates one or more torch.Tensor containing a multiseries as a dataset, for use in a torch.utils.data.DataLoader. The underlying data can be stored either as a torch.Tensor or as a ListOfTensors.

class torchcast.data.H5SeriesDataset(path: str, keys: List[str] | str, return_length: int | None = None, transform: Callable | None = None, metadata: Metadata | List[Metadata] | None = None)

This encapsulates a h5py.File containing a series stored on disk.

Utility Classes

class torchcast.data.Metadata(name: str | None = None, channel_names: List[str] | None = None, series_names: List[str] | None = None)

Metadata encapsulates metadata about a multiseries. In a torchcast.data.SeriesDataset, each multiseries will have a corresponding Metadata object. All fields of Metadata are optional. The fields that may be available are:

  • name: Name of the series.

  • channel_names: A list of the names of each channel.

  • series_names: A list of the names of each series.

check_consistency(multiseries: ArrayLike)

Checks if an array-like object is compatible with the metadata.

class torchcast.data.ListOfTensors(tensors: List[ArrayLike])

This class encapsulates a list of torch.Tensor, and gives it an external API similar to a single torch.Tensor. This is used so that we can have a single multiseries whose constituent series are varying lengths without wasting memory.