gluon.data¶

mxnet.gluon.data

Dataset utilities.

Datasets¶

`Dataset`	Abstract dataset class.
`ArrayDataset`(*args)	A dataset that combines multiple dataset-like objects, e.g.
`RecordFileDataset`(filename)	A dataset wrapping over a RecordIO (.rec) file.
`SimpleDataset`(data)	Simple Dataset wrapper for lists and arrays.

Sampling¶

`Sampler`	Base class for samplers.
`SequentialSampler`(length[, start])	Samples elements from [start, start+length) sequentially.
`RandomSampler`(length)	Samples elements from [0, length) randomly without replacement.
`BatchSampler`(sampler, batch_size[, last_batch])	Wraps over another Sampler and return mini-batches of samples.
`IntervalSampler`(length, interval[, rollover])	Samples elements from [0, length) at fixed intervals.

DataLoader¶

DataLoader(dataset[, batch_size, shuffle, …])

Loads data from a dataset and returns mini-batches of data.

API Reference¶

Dataset utilities.

Classes

`ArrayDataset`(*args)	A dataset that combines multiple dataset-like objects, e.g.
`BatchSampler`(sampler, batch_size[, last_batch])	Wraps over another Sampler and return mini-batches of samples.
`DataLoader`(dataset[, batch_size, shuffle, …])	Loads data from a dataset and returns mini-batches of data.
`Dataset`	Abstract dataset class.
`FilterSampler`(fn, dataset)	Samples elements from a Dataset for which fn returns True.
`IntervalSampler`(length, interval[, rollover])	Samples elements from [0, length) at fixed intervals.
`RandomSampler`(length)	Samples elements from [0, length) randomly without replacement.
`RecordFileDataset`(filename)	A dataset wrapping over a RecordIO (.rec) file.
`Sampler`	Base class for samplers.
`SequentialSampler`(length[, start])	Samples elements from [start, start+length) sequentially.
`SimpleDataset`(data)	Simple Dataset wrapper for lists and arrays.

class ArrayDataset(*args)[source]¶

Bases: mxnet.gluon.data.dataset.Dataset

A dataset that combines multiple dataset-like objects, e.g. Datasets, lists, arrays, etc.

The i-th sample is defined as (x1[i], x2[i], …).

Parameters: *args (one or more dataset-like objects) – The data arrays.

class BatchSampler(sampler, batch_size, last_batch='keep')[source]¶

Bases: mxnet.gluon.data.sampler.Sampler

Wraps over another Sampler and return mini-batches of samples.

Parameters

sampler (Sampler) – The source Sampler.
batch_size (int) – Size of mini-batch.
last_batch ({'keep', 'discard', 'rollover'}) –
Specifies how the last batch is handled if batch_size does not evenly divide sequence length.

If ‘keep’, the last batch will be returned directly, but will contain less element than batch_size requires.

If ‘discard’, the last batch will be discarded.

If ‘rollover’, the remaining elements will be rolled over to the next iteration.

Examples

>>> sampler = gluon.data.SequentialSampler(10)
>>> batch_sampler = gluon.data.BatchSampler(sampler, 3, 'keep')
>>> list(batch_sampler)
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

class DataLoader(dataset, batch_size=None, shuffle=False, sampler=None, last_batch=None, batch_sampler=None, batchify_fn=None, num_workers=0, pin_memory=False, pin_device_id=0, prefetch=None, thread_pool=False, timeout=120, try_nopython=None)[source]¶

Bases: object

Loads data from a dataset and returns mini-batches of data.

Parameters

dataset (Dataset) – Source dataset. Note that numpy and mxnet arrays can be directly used as a Dataset.
batch_size (int) – Size of mini-batch.
shuffle (bool) – Whether to shuffle the samples.
sampler (Sampler) – The sampler to use. Either specify sampler or shuffle, not both.
last_batch ({'keep', 'discard', 'rollover'}) –
How to handle the last batch if batch_size does not evenly divide len(dataset).

keep - A batch with less samples than previous batches is returned. discard - The last batch is discarded if its incomplete. rollover - The remaining samples are rolled over to the next epoch.
batch_sampler (Sampler) – A sampler that returns mini-batches. Do not specify batch_size, shuffle, sampler, and last_batch if batch_sampler is specified.

batchify_fn (callable) –

Callback function to allow users to specify how to merge samples into a batch. Defaults to gluon.data.batchify.Stack().

def default_batchify_fn(data):
    if isinstance(data[0], nd.NDArray):
        return nd.stack(*data)
    elif isinstance(data[0], np.ndarray):
        return np.stack(*data)
    elif isinstance(data[0], tuple):
        data = zip(*data)
        return [default_batchify_fn(i) for i in data]
    else:
        data = np.asarray(data)
        return np.ndarray(data, dtype=data.dtype)

num_workers (int, default 0) – The number of multiprocessing workers to use for data preprocessing.
pin_memory (boolean, default False) – If True, the dataloader will copy NDArrays into pinned memory before returning them. Copying from CPU pinned memory to GPU is faster than from normal CPU memory.
pin_device_id (int, default 0) – The device id to use for allocating pinned memory if pin_memory is True
prefetch (int, default is num_workers * 2) – The number of prefetching batches only works if num_workers > 0. If prefetch > 0, it allow worker process to prefetch certain batches before acquiring data from iterators. Note that using large prefetching batch will provide smoother bootstrapping performance, but will consume more shared_memory. Using smaller number may forfeit the purpose of using multiple worker processes, try reduce num_workers in this case. By default it defaults to num_workers * 2.
thread_pool (bool, default False) – If True, use threading pool instead of multiprocessing pool. Using threadpool can avoid shared memory usage. If DataLoader is more IO bounded or GIL is not a killing problem, threadpool version may achieve better performance than multiprocessing.
timeout (int, default is 120) – The timeout in seconds for each worker to fetch a batch data. Only modify this number unless you are experiencing timeout and you know it’s due to slow data loading. Sometimes full shared_memory will cause all workers to hang and causes timeout. In these cases please reduce num_workers or increase system shared_memory size instead.
try_nopython (bool or None, default is None) – Try compile python dataloading pipeline into pure MXNet c++ implementation. The benefit is potentially faster iteration, no shared_memory usage, and less processes managed by python. The compilation is not gauranteed to support all use cases, but it will fallback to python in case of failure. You can set try_nopython to False to disable auto-detection of the compilation feature or leave it to None to allow MXNet to determine it automatically. If you request try_nopython to True and the compilation fails, it will raise a RuntimeError with the failure reason.

class Dataset[source]¶

Bases: object

Abstract dataset class. All datasets should have this interface.

Subclasses need to override __getitem__, which returns the i-th element, and __len__, which returns the total number elements.

Note

An mxnet or numpy array can be directly used as a dataset.

Methods

`filter`(fn)	Returns a new dataset with samples filtered by the filter function fn.
`sample`(sampler)	Returns a new dataset with elements sampled by the sampler.
`shard`(num_shards, index)	Returns a new dataset includes only 1/num_shards of this dataset.
`take`(count)	Returns a new dataset with at most count number of samples in it.
`transform`(fn[, lazy])	Returns a new dataset with each sample transformed by the transformer function fn.
`transform_first`(fn[, lazy])	Returns a new dataset with the first element of each sample transformed by the transformer function fn.

filter(fn)[source]¶

Returns a new dataset with samples filtered by the filter function fn.

Note that if the Dataset is the result of a lazily transformed one with transform(lazy=False), the filter is eagerly applied to the transformed samples without materializing the transformed result. That is, the transformation will be applied again whenever a sample is retrieved after filter().

Parameters: fn (callable) – A filter function that takes a sample as input and returns a boolean. Samples that return False are discarded.
Returns: The filtered dataset.
Return type: Dataset

sample(sampler)[source]¶

Returns a new dataset with elements sampled by the sampler.

Parameters: sampler (Sampler) – A Sampler that returns the indices of sampled elements.
Returns: The result dataset.
Return type: Dataset

shard(num_shards, index)[source]¶

Returns a new dataset includes only 1/num_shards of this dataset.

For distributed training, be sure to shard before you randomize the dataset (such as shuffle), if you want each worker to reach a unique subset.

Parameters

num_shards (int) – A integer representing the number of data shards.
index (int) – A integer representing the index of the current shard.

Returns

The result dataset.

Return type

Dataset

take(count)[source]¶

Returns a new dataset with at most count number of samples in it.

Parameters: count (int or None) – A integer representing the number of elements of this dataset that should be taken to form the new dataset. If count is None, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.
Returns: The result dataset.
Return type: Dataset

transform(fn, lazy=True)[source]¶

Returns a new dataset with each sample transformed by the transformer function fn.

Parameters

fn (callable) – A transformer function that takes a sample as input and returns the transformed sample.
lazy (bool, default True) – If False, transforms all samples at once. Otherwise, transforms each sample on demand. Note that if fn is stochastic, you must set lazy to True or you will get the same result on all epochs.

Returns

The transformed dataset.

Return type

Dataset

transform_first(fn, lazy=True)[source]¶

Returns a new dataset with the first element of each sample transformed by the transformer function fn.

This is mostly applicable when each sample contains two components - features and label, i.e., (X, y), and you only want to transform the first element X (i.e., the features) while keeping the label y unchanged.

Parameters

fn (callable) – A transformer function that takes the first element of a sample as input and returns the transformed element.
lazy (bool, default True) – If False, transforms all samples at once. Otherwise, transforms each sample on demand. Note that if fn is stochastic, you must set lazy to True or you will get the same result on all epochs.

Returns

The transformed dataset.

Return type

Dataset

class FilterSampler(fn, dataset)[source]¶

Bases: mxnet.gluon.data.sampler.Sampler

Samples elements from a Dataset for which fn returns True.

Parameters

fn (callable) – A callable function that takes a sample and returns a boolean
dataset (Dataset) – The dataset to filter.

class IntervalSampler(length, interval, rollover=True)[source]¶

Bases: mxnet.gluon.data.sampler.Sampler

Samples elements from [0, length) at fixed intervals.

Parameters

length (int) – Length of the sequence.
interval (int) – The number of items to skip between two samples.
rollover (bool, default True) – Whether to start again from the first skipped item after reaching the end. If true, this sampler would start again from the first skipped item until all items are visited. Otherwise, iteration stops when end is reached and skipped items are ignored.

Examples

>>> sampler = contrib.data.IntervalSampler(13, interval=3)
>>> list(sampler)
[0, 3, 6, 9, 12, 1, 4, 7, 10, 2, 5, 8, 11]
>>> sampler = contrib.data.IntervalSampler(13, interval=3, rollover=False)
>>> list(sampler)
[0, 3, 6, 9, 12]

class RandomSampler(length)[source]¶

Bases: mxnet.gluon.data.sampler.Sampler

Samples elements from [0, length) randomly without replacement.

Parameters: length (int) – Length of the sequence.

class RecordFileDataset(filename)[source]¶

Bases: mxnet.gluon.data.dataset.Dataset

A dataset wrapping over a RecordIO (.rec) file.

Each sample is a string representing the raw content of an record.

Parameters: filename (str) – Path to rec file.

class Sampler[source]¶

Bases: object

Base class for samplers.

All samplers should subclass Sampler and define __iter__ and __len__ methods.

class SequentialSampler(length, start=0)[source]¶

Bases: mxnet.gluon.data.sampler.Sampler

Samples elements from [start, start+length) sequentially.

Parameters

length (int) – Length of the sequence.
start (int, default is 0) – The start of the sequence index.

class SimpleDataset(data)[source]¶

Bases: mxnet.gluon.data.dataset.Dataset

Simple Dataset wrapper for lists and arrays.

Parameters: data (dataset-like object) – Any object that implements len() and [].

Did this page help you?

Yes

No

Thanks for your feedback!