downhill.dataset.Dataset

class downhill.dataset.Dataset(inputs, name=None, batch_size=32, iteration_size=None, axis=0)

This class handles batching and shuffling a dataset.

In downhill, losses are optimized using sets of data collected from the problem that generated the loss.

During optimization, data are grouped into “mini-batches”—that is, chunks that are larger than 1 sample and smaller than the entire set of samples; typically the size of a mini-batch is between 10 and 100, but the specific setting can be varied depending on your model, hardware, dataset, and so forth. These mini-batches must be presented to the optimization algorithm in pseudo-random order to match the underlying stochasticity assumptions of many optimization algorithms. This class handles the process of grouping data into mini-batches as well as iterating and shuffling these mini-batches dynamically as the dataset is consumed by the optimization algorithm.

For many tasks, a dataset is obtained as a large block of sample data, which in Python is normally assembled as a numpy ndarray. To use this class on such a dataset, just pass in a list or tuple containing numpy arrays; the number of these arrays must match the number of inputs that your loss computation requires.

There are some cases when a suitable set of training data would be prohibitively expensive to assemble in memory as a single numpy array. To handle these cases, this class can also handle a dataset that is provided via a Python callable. For more information on using callables to provide data to your model, see Using Callables.

Parameters:

inputs : ndarray, tuple, list, or callable

One or more sets of data.

If this parameter is callable, then mini-batches will be obtained by calling the callable with no arguments; the callable is expected to return a tuple of ndarrays that will be suitable for optimizing the loss at hand.

If this parameter is a list or tuple, it must contain ndarrays (or something similar with a shape attribute, like a pandas DataFrame). These are assumed to contain data for computing the loss; the length of this tuple or list should match the number of inputs required by the loss computation. If multiple ndarrays are provided, their lengths along the axis given by the axis parameter (defaults to 0) must match.

name : str, optional

A string that is used to describe this dataset. Usually something like ‘test’ or ‘train’.

batch_size : int, optional

The size of the mini-batches to create from the data sequences. Defaults to 32.

iteration_size : int, optional

The number of batches to yield for each call to iterate(). Defaults to the length of the data divided by batch_size. If the dataset is a callable, then the number is len(callable). If callable has no length, then the number is set to 100.

axis : int, optional

The axis along which to split the data arrays, if the first parameter is given as one or more ndarrays. If not provided, defaults to 0.

__init__(inputs, name=None, batch_size=32, iteration_size=None, axis=0)

Create a minibatch dataset from data arrays or a callable.

Methods

__init__(inputs[, name, batch_size, ...]) Create a minibatch dataset from data arrays or a callable.
iterate([shuffle]) Iterate over batches in the dataset.
shuffle() Shuffle the batches in the dataset.
iterate(shuffle=True)

Iterate over batches in the dataset.

This method generates iteration_size batches from the dataset and then returns.

Parameters:

shuffle : bool, optional

Shuffle the batches in this dataset if the iteration reaches the end of the batch list. Defaults to True.

shuffle()

Shuffle the batches in the dataset.

If this dataset was constructed using a callable, this method has no effect.