downhill.dataset.Dataset¶

class downhill.dataset.Dataset(inputs, name=None, batch_size=32, iteration_size=None, axis=0, rng=None)¶

This class handles batching and shuffling a dataset.

In downhill, losses are optimized using sets of data collected from the problem that generated the loss.

During optimization, data are grouped into “mini-batches”—that is, chunks that are larger than 1 sample and smaller than the entire set of samples; typically the size of a mini-batch is between 10 and 100, but the specific setting can be varied depending on your model, hardware, dataset, and so forth. These mini-batches must be presented to the optimization algorithm in pseudo-random order to match the underlying stochasticity assumptions of many optimization algorithms. This class handles the process of grouping data into mini-batches as well as iterating and shuffling these mini-batches dynamically as the dataset is consumed by the optimization algorithm.

For many tasks, a dataset is obtained as a large block of sample data, which in Python is normally assembled as a numpy ndarray. To use this class on such a dataset, just pass in a list or tuple containing numpy arrays; the number of these arrays must match the number of inputs that your loss computation requires.

There are some cases when a suitable set of training data would be prohibitively expensive to assemble in memory as a single numpy array. To handle these cases, this class can also handle a dataset that is provided via a Python callable. For more information on using callables to provide data to your model, see Using Callables.

Parameters:

inputs : callable or list of ndarray/sparse matrix/DataFrame/theano shared var

One or more sets of data.

If this parameter is callable, then mini-batches will be obtained by calling the callable with no arguments; the callable is expected to return a tuple of ndarray-like objects that will be suitable for optimizing the loss at hand.

If this parameter is a list (or a tuple), it must contain array-like objects: numpy.ndarray, scipy.sparse.csc_matrix, scipy.sparse.csr_matrix, pandas.DataFrame or theano.shared. These are assumed to contain data for computing the loss, so the length of this tuple or list should match the number of inputs required by the loss computation. If multiple arrays are provided, their lengths along the axis given by the axis parameter (defaults to 0) must match.

name : str, optional

A string that is used to describe this dataset. Usually something like ‘test’ or ‘train’.

batch_size : int, optional

The size of the mini-batches to create from the data sequences. If this is negative or zero, all data in the dataset will be used in one batch. Defaults to 32. This parameter has no effect if inputs is callable.

iteration_size : int, optional

The number of batches to yield for each call to iterate(). Defaults to the length of the data divided by batch_size. If the dataset is a callable, then the number is len(callable). If callable has no length, then the number is set to 100.

axis : int, optional

The axis along which to split the data arrays, if the first parameter is given as one or more ndarrays. If not provided, defaults to 0.

rng : numpy.random.RandomState or int, optional

A random number generator, or an integer seed for a random number generator. If not provided, the random number generator will be created with an automatically chosen seed.

__init__(inputs, name=None, batch_size=32, iteration_size=None, axis=0, rng=None)¶

Methods

`__init__`(inputs[, name, batch_size, ...])
`iterate`([shuffle])	Iterate over batches in the dataset.
`shuffle`()	Shuffle the batches in the dataset.

iterate(shuffle=True)¶

Iterate over batches in the dataset.

This method generates iteration_size batches from the dataset and then returns.

Parameters:

shuffle : bool, optional

Shuffle the batches in this dataset if the iteration reaches the end of the batch list. Defaults to True.

Yields:

batches : data batches

A sequence of batches—often from a training, validation, or test dataset.

shuffle()¶

Shuffle the batches in the dataset.

If this dataset was constructed using a callable, this method has no effect.