gluon.data

mxnet.gluon.data

Dataset utilities.

Datasets

Dataset

Abstract dataset class.

ArrayDataset(*args)

A dataset that combines multiple dataset-like objects, e.g.

RecordFileDataset(filename)

A dataset wrapping over a RecordIO (.rec) file.

SimpleDataset(data)

Simple Dataset wrapper for lists and arrays.

Sampling

Sampler

Base class for samplers.

SequentialSampler(length[, start])

Samples elements from [start, start+length) sequentially.

RandomSampler(length)

Samples elements from [0, length) randomly without replacement.

BatchSampler(sampler, batch_size[, last_batch])

Wraps over another Sampler and return mini-batches of samples.

IntervalSampler(length, interval[, rollover])

Samples elements from [0, length) at fixed intervals.

DataLoader

DataLoader(dataset[, batch_size, shuffle, …])

Loads data from a dataset and returns mini-batches of data.

API Reference

Dataset utilities.

Classes

ArrayDataset(*args)

A dataset that combines multiple dataset-like objects, e.g.

BatchSampler(sampler, batch_size[, last_batch])

Wraps over another Sampler and return mini-batches of samples.

DataLoader(dataset[, batch_size, shuffle, …])

Loads data from a dataset and returns mini-batches of data.

Dataset

Abstract dataset class.

FilterSampler(fn, dataset)

Samples elements from a Dataset for which fn returns True.

IntervalSampler(length, interval[, rollover])

Samples elements from [0, length) at fixed intervals.

RandomSampler(length)

Samples elements from [0, length) randomly without replacement.

RecordFileDataset(filename)

A dataset wrapping over a RecordIO (.rec) file.

Sampler

Base class for samplers.

SequentialSampler(length[, start])

Samples elements from [start, start+length) sequentially.

SimpleDataset(data)

Simple Dataset wrapper for lists and arrays.

class ArrayDataset(*args)[source]

Bases: mxnet.gluon.data.dataset.Dataset

A dataset that combines multiple dataset-like objects, e.g. Datasets, lists, arrays, etc.

The i-th sample is defined as (x1[i], x2[i], …).

Parameters

*args (one or more dataset-like objects) – The data arrays.

class BatchSampler(sampler, batch_size, last_batch='keep')[source]

Bases: mxnet.gluon.data.sampler.Sampler

Wraps over another Sampler and return mini-batches of samples.

Parameters
  • sampler (Sampler) – The source Sampler.

  • batch_size (int) – Size of mini-batch.

  • last_batch ({'keep', 'discard', 'rollover'}) –

    Specifies how the last batch is handled if batch_size does not evenly divide sequence length.

    If ‘keep’, the last batch will be returned directly, but will contain less element than batch_size requires.

    If ‘discard’, the last batch will be discarded.

    If ‘rollover’, the remaining elements will be rolled over to the next iteration.

Examples

>>> sampler = gluon.data.SequentialSampler(10)
>>> batch_sampler = gluon.data.BatchSampler(sampler, 3, 'keep')
>>> list(batch_sampler)
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
class DataLoader(dataset, batch_size=None, shuffle=False, sampler=None, last_batch=None, batch_sampler=None, batchify_fn=None, num_workers=0, pin_memory=False, pin_device_id=0, prefetch=None, thread_pool=False, timeout=120, try_nopython=None)[source]

Bases: object

Loads data from a dataset and returns mini-batches of data.

Parameters
  • dataset (Dataset) – Source dataset. Note that numpy and mxnet arrays can be directly used as a Dataset.

  • batch_size (int) – Size of mini-batch.

  • shuffle (bool) – Whether to shuffle the samples.

  • sampler (Sampler) – The sampler to use. Either specify sampler or shuffle, not both.

  • last_batch ({'keep', 'discard', 'rollover'}) –

    How to handle the last batch if batch_size does not evenly divide len(dataset).

    keep - A batch with less samples than previous batches is returned. discard - The last batch is discarded if its incomplete. rollover - The remaining samples are rolled over to the next epoch.

  • batch_sampler (Sampler) – A sampler that returns mini-batches. Do not specify batch_size, shuffle, sampler, and last_batch if batch_sampler is specified.

  • batchify_fn (callable) –

    Callback function to allow users to specify how to merge samples into a batch. Defaults to gluon.data.batchify.Stack().

    def default_batchify_fn(data):
        if isinstance(data[0], nd.NDArray):
            return nd.stack(*data)
        elif isinstance(data[0], np.ndarray):
            return np.stack(*data)
        elif isinstance(data[0], tuple):
            data = zip(*data)
            return [default_batchify_fn(i) for i in data]
        else:
            data = np.asarray(data)
            return np.ndarray(data, dtype=data.dtype)
    

  • num_workers (int, default 0) – The number of multiprocessing workers to use for data preprocessing.

  • pin_memory (boolean, default False) – If True, the dataloader will copy NDArrays into pinned memory before returning them. Copying from CPU pinned memory to GPU is faster than from normal CPU memory.

  • pin_device_id (int, default 0) – The device id to use for allocating pinned memory if pin_memory is True

  • prefetch (int, default is num_workers * 2) – The number of prefetching batches only works if num_workers > 0. If prefetch > 0, it allow worker process to prefetch certain batches before acquiring data from iterators. Note that using large prefetching batch will provide smoother bootstrapping performance, but will consume more shared_memory. Using smaller number may forfeit the purpose of using multiple worker processes, try reduce num_workers in this case. By default it defaults to num_workers * 2.

  • thread_pool (bool, default False) – If True, use threading pool instead of multiprocessing pool. Using threadpool can avoid shared memory usage. If DataLoader is more IO bounded or GIL is not a killing problem, threadpool version may achieve better performance than multiprocessing.

  • timeout (int, default is 120) – The timeout in seconds for each worker to fetch a batch data. Only modify this number unless you are experiencing timeout and you know it’s due to slow data loading. Sometimes full shared_memory will cause all workers to hang and causes timeout. In these cases please reduce num_workers or increase system shared_memory size instead.

  • try_nopython (bool or None, default is None) – Try compile python dataloading pipeline into pure MXNet c++ implementation. The benefit is potentially faster iteration, no shared_memory usage, and less processes managed by python. The compilation is not gauranteed to support all use cases, but it will fallback to python in case of failure. You can set try_nopython to False to disable auto-detection of the compilation feature or leave it to None to allow MXNet to determine it automatically. If you request try_nopython to True and the compilation fails, it will raise a RuntimeError with the failure reason.

class Dataset[source]

Bases: object

Abstract dataset class. All datasets should have this interface.

Subclasses need to override __getitem__, which returns the i-th element, and __len__, which returns the total number elements.

Note

An mxnet or numpy array can be directly used as a dataset.

Methods

filter(fn)

Returns a new dataset with samples filtered by the filter function fn.

sample(sampler)

Returns a new dataset with elements sampled by the sampler.

shard(num_shards, index)

Returns a new dataset includes only 1/num_shards of this dataset.

take(count)

Returns a new dataset with at most count number of samples in it.

transform(fn[, lazy])

Returns a new dataset with each sample transformed by the transformer function fn.

transform_first(fn[, lazy])

Returns a new dataset with the first element of each sample transformed by the transformer function fn.

filter(fn)[source]

Returns a new dataset with samples filtered by the filter function fn.

Note that if the Dataset is the result of a lazily transformed one with transform(lazy=False), the filter is eagerly applied to the transformed samples without materializing the transformed result. That is, the transformation will be applied again whenever a sample is retrieved after filter().

Parameters

fn (callable) – A filter function that takes a sample as input and returns a boolean. Samples that return False are discarded.

Returns

The filtered dataset.

Return type

Dataset

sample(sampler)[source]

Returns a new dataset with elements sampled by the sampler.

Parameters

sampler (Sampler) – A Sampler that returns the indices of sampled elements.

Returns

The result dataset.

Return type

Dataset

shard(num_shards, index)[source]

Returns a new dataset includes only 1/num_shards of this dataset.

For distributed training, be sure to shard before you randomize the dataset (such as shuffle), if you want each worker to reach a unique subset.

Parameters
  • num_shards (int) – A integer representing the number of data shards.

  • index (int) – A integer representing the index of the current shard.

Returns

The result dataset.

Return type

Dataset

take(count)[source]

Returns a new dataset with at most count number of samples in it.

Parameters

count (int or None) – A integer representing the number of elements of this dataset that should be taken to form the new dataset. If count is None, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.

Returns

The result dataset.

Return type

Dataset

transform(fn, lazy=True)[source]

Returns a new dataset with each sample transformed by the transformer function fn.

Parameters
  • fn (callable) – A transformer function that takes a sample as input and returns the transformed sample.

  • lazy (bool, default True) – If False, transforms all samples at once. Otherwise, transforms each sample on demand. Note that if fn is stochastic, you must set lazy to True or you will get the same result on all epochs.

Returns

The transformed dataset.

Return type

Dataset

transform_first(fn, lazy=True)[source]

Returns a new dataset with the first element of each sample transformed by the transformer function fn.

This is mostly applicable when each sample contains two components - features and label, i.e., (X, y), and you only want to transform the first element X (i.e., the features) while keeping the label y unchanged.

Parameters
  • fn (callable) – A transformer function that takes the first element of a sample as input and returns the transformed element.

  • lazy (bool, default True) – If False, transforms all samples at once. Otherwise, transforms each sample on demand. Note that if fn is stochastic, you must set lazy to True or you will get the same result on all epochs.

Returns

The transformed dataset.

Return type

Dataset

class FilterSampler(fn, dataset)[source]

Bases: mxnet.gluon.data.sampler.Sampler

Samples elements from a Dataset for which fn returns True.

Parameters
  • fn (callable) – A callable function that takes a sample and returns a boolean

  • dataset (Dataset) – The dataset to filter.

class IntervalSampler(length, interval, rollover=True)[source]

Bases: mxnet.gluon.data.sampler.Sampler

Samples elements from [0, length) at fixed intervals.

Parameters
  • length (int) – Length of the sequence.

  • interval (int) – The number of items to skip between two samples.

  • rollover (bool, default True) – Whether to start again from the first skipped item after reaching the end. If true, this sampler would start again from the first skipped item until all items are visited. Otherwise, iteration stops when end is reached and skipped items are ignored.

Examples

>>> sampler = contrib.data.IntervalSampler(13, interval=3)
>>> list(sampler)
[0, 3, 6, 9, 12, 1, 4, 7, 10, 2, 5, 8, 11]
>>> sampler = contrib.data.IntervalSampler(13, interval=3, rollover=False)
>>> list(sampler)
[0, 3, 6, 9, 12]
class RandomSampler(length)[source]

Bases: mxnet.gluon.data.sampler.Sampler

Samples elements from [0, length) randomly without replacement.

Parameters

length (int) – Length of the sequence.

class RecordFileDataset(filename)[source]

Bases: mxnet.gluon.data.dataset.Dataset

A dataset wrapping over a RecordIO (.rec) file.

Each sample is a string representing the raw content of an record.

Parameters

filename (str) – Path to rec file.

class Sampler[source]

Bases: object

Base class for samplers.

All samplers should subclass Sampler and define __iter__ and __len__ methods.

class SequentialSampler(length, start=0)[source]

Bases: mxnet.gluon.data.sampler.Sampler

Samples elements from [start, start+length) sequentially.

Parameters
  • length (int) – Length of the sequence.

  • start (int, default is 0) – The start of the sequence index.

class SimpleDataset(data)[source]

Bases: mxnet.gluon.data.dataset.Dataset

Simple Dataset wrapper for lists and arrays.

Parameters

data (dataset-like object) – Any object that implements len() and [].