Data Submodule¶
This submodule contains tools for working with time series datasets. For clarity, we need some definitions:
An array is a Python object that has a similar signature to a numpy array or PyTorch tensor for indexing. In particular, it has shape and ndim properties and has a length.
A series is a 2-dimensional array, where the 0th dimension indexs channel and the 1st dimension indexs time. If a 1-dimensional array is passed to a function expecting a series, it will be interpreted as a univariate series and coerced to 2 dimensions.
A multiseries is a 3-dimensional array consisting of a collection of series, where the 0th dimension indexs the series, the 1st dimension indexs the channel, and the 2nd dimension indexs the time. If a 2-dimensional array is passed to a function expecting a multiseries, it will be interpreted as a single multivariate series and coerced to 3 dimensions.
A dataset is a collection of one or more multiseries. The multiseries in the dataset must all have broadcastable shapes, except for the number of channels, which is allowed to vary. That is, for each multiseries in the dataset, the 0th (series) and 2nd (time) dimensions must either be equal or one.
- class torchcast.data.SeriesDataset(*data: ArrayLike, return_length: int | None = None, transform: Callable | None = None, metadata: Metadata | List[Metadata] | None = None)¶
This is a base class for time series datasets. It is expected to only be used in a subclass, such as
torchcast.data.TensorSeriesDataset
.Data held by a
SeriesDataset
is always returned in shape (channels, time steps), so that it can be stacked to form a batch of series in shape (series, channels, time steps).- split_by_time(t: int | float) Tuple[SeriesDataset, SeriesDataset] ¶
Splits the dataset by time.
- Parameters:
t (int or float) – If this is an integer, then perform the split at
float (this time. If it is a) –
percentage (perform the split at this) –
time. (of the) –
- class torchcast.data.TensorSeriesDataset(*data: ArrayLike, return_length: int | None = None, transform: Callable | None = None, metadata: Metadata | List[Metadata] | None = None)¶
This encapsulates one or more
torch.Tensor
containing a multiseries as a dataset, for use in atorch.utils.data.DataLoader
. The underlying data can be stored either as atorch.Tensor
or as aListOfTensors
.
- class torchcast.data.H5SeriesDataset(path: str, keys: List[str] | str, return_length: int | None = None, transform: Callable | None = None, metadata: Metadata | List[Metadata] | None = None)¶
This encapsulates a
h5py.File
containing a series stored on disk.
Utility Classes¶
- class torchcast.data.Metadata(name: str | None = None, channel_names: List[str] | None = None, series_names: List[str] | None = None)¶
Metadata
encapsulates metadata about a multiseries. In atorchcast.data.SeriesDataset
, each multiseries will have a correspondingMetadata
object. All fields ofMetadata
are optional. The fields that may be available are:name: Name of the series.
channel_names: A list of the names of each channel.
series_names: A list of the names of each series.
- check_consistency(multiseries: ArrayLike)¶
Checks if an array-like object is compatible with the metadata.
- class torchcast.data.ListOfTensors(tensors: List[ArrayLike])¶
This class encapsulates a list of
torch.Tensor
, and gives it an external API similar to a singletorch.Tensor
. This is used so that we can have a single multiseries whose constituent series are varying lengths without wasting memory.