Data Loading API

Overview

This document summarizes supported data formats and iterator APIs to read the data including

mxnet.io Data iterators for common data formats and utility functions.
mxnet.recordio Read and write for the RecordIO data format.
mxnet.image Image Iterators and image augmentation functions

First, let’s see how to write an iterator for a new data format. The following iterator can be used to train a symbol whose input data variable has name data and input label variable has name softmax_label. The iterator also provides information about the batch, including the shapes and name.

>>> nd_iter = mx.io.NDArrayIter(data={'data':mx.nd.ones((100,10))},
...                             label={'softmax_label':mx.nd.ones((100,))},
...                             batch_size=25)
>>> print(nd_iter.provide_data)
[DataDesc[data,(25, 10L),,NCHW]]
>>> print(nd_iter.provide_label)
[DataDesc[softmax_label,(25,),,NCHW]]

Let’s see a complete example of how to use data iterator in model training.

>>> data = mx.sym.Variable('data')
>>> label = mx.sym.Variable('softmax_label')
>>> fullc = mx.sym.FullyConnected(data=data, num_hidden=1)
>>> loss = mx.sym.SoftmaxOutput(data=fullc, label=label)
>>> mod = mx.mod.Module(loss, data_names=['data'], label_names=['softmax_label'])
>>> mod.bind(data_shapes=nd_iter.provide_data, label_shapes=nd_iter.provide_label)
>>> mod.fit(nd_iter, num_epoch=2)

A detailed tutorial is available at Iterators - Loading data.

Data iterators

io.NDArrayIter Returns an iterator for mx.nd.NDArray, numpy.ndarray, h5py.Dataset mx.nd.sparse.CSRNDArray or scipy.sparse.csr_matrix.
io.CSVIter Returns the CSV file iterator.
io.LibSVMIter Returns the LibSVM iterator which returns data with csr storage type.
io.ImageRecordIter Iterates on image RecordIO files
io.ImageRecordUInt8Iter Iterating on image RecordIO files
io.MNISTIter Iterating on the MNIST dataset.
recordio.MXRecordIO Reads/writes RecordIO data format, supporting sequential read and write.
recordio.MXIndexedRecordIO Reads/writes RecordIO data format, supporting random access.
image.ImageIter Image data iterator with a large number of augmentation choices.
image.ImageDetIter Image iterator with a large number of augmentation choices for detection.

Helper classes and functions

Data structures and other iterators provided in the mxnet.io packages.

io.DataDesc DataDesc is used to store name, shape, type and layout information of the data or the label.
io.DataBatch A data batch.
io.DataIter The base class for an MXNet data iterator.
io.ResizeIter Resize a data iterator to a given number of batches.
io.PrefetchingIter Performs pre-fetch for other data iterators.
io.MXDataIter A python wrapper a C++ data iterator.

Functions to read and write RecordIO files.

recordio.pack Pack a string into MXImageRecord.
recordio.unpack Unpack a MXImageRecord to string.
recordio.unpack_img Unpack a MXImageRecord to image.
recordio.pack_img Pack an image into MXImageRecord.

Develop a new iterator

Writing a new data iterator in Python is straightforward. Most MXNet training/inference programs accept an iterable object with provide_data and provide_label properties. This tutorial explains how to write an iterator from scratch.

The following example demonstrates how to combine multiple data iterators into a single one. It can be used for multiple modality training such as image captioning, in which images are read by ImageRecordIter while documents are read by CSVIter

class MultiIter:
    def __init__(self, iter_list):
        self.iters = iter_list
    def next(self):
        batches = [i.next() for i in self.iters]
        return DataBatch(data=[*b.data for b in batches],
                         label=[*b.label for b in batches])
    def reset(self):
        for i in self.iters:
            i.reset()
    @property
    def provide_data(self):
        return [*i.provide_data for i in self.iters]
    @property
    def provide_label(self):
        return [*i.provide_label for i in self.iters]

iter = MultiIter([mx.io.ImageRecordIter('image.rec'), mx.io.CSVIter('txt.csv')])

Parsing and performing another pre-processing such as augmentation may be expensive. If performance is critical, we can implement a data iterator in C++. Refer to src/io for examples.

Change batch layout

By default, the backend engine treats the first dimension of each data and label variable in data iterators as the batch size (i.e. NCHW or NT layout). In order to override the axis for batch size, the provide_data (and provide_label if there is label) properties should include the layouts. This is especially useful in RNN since TNC layouts are often more efficient. For example:

@property
def provide_data(self):
    return [DataDesc(name='seq_var', shape=(seq_length, batch_size), layout='TN')]

The backend engine will recognize the index of N in the layout as the axis for batch size.

API Reference

Data iterators for common data formats and utility functions.

Read and write for the RecordIO data format.

class mxnet.recordio.MXRecordIO(uri, flag)[source]

Reads/writes RecordIO data format, supporting sequential read and write.

>>> record = mx.recordio.MXRecordIO('tmp.rec', 'w')

>>> for i in range(5):
...    record.write('record_%d'%i)
>>> record.close()
>>> record = mx.recordio.MXRecordIO('tmp.rec', 'r')
>>> for i in range(5):
...    item = record.read()
...    print(item)
record_0
record_1
record_2
record_3
record_4
>>> record.close()
Parameters:
  • uri (string) – Path to the record file.
  • flag (string) – ‘w’ for write or ‘r’ for read.
open()[source]

Opens the record file.

close()[source]

Closes the record file.

reset()[source]

Resets the pointer to first item.

If the record is opened with ‘w’, this function will truncate the file to empty.

>>> record = mx.recordio.MXRecordIO('tmp.rec', 'r')
>>> for i in range(2):
...    item = record.read()
...    print(item)
record_0
record_1
>>> record.reset()  # Pointer is reset.
>>> print(record.read()) # Started reading from start again.
record_0
>>> record.close()
write(buf)[source]

Inserts a string buffer as a record.

>>> record = mx.recordio.MXRecordIO('tmp.rec', 'w')
>>> for i in range(5):
...    record.write('record_%d'%i)
>>> record.close()
Parameters:buf (string (python2), bytes (python3)) – Buffer to write.
read()[source]

Returns record as a string.

>>> record = mx.recordio.MXRecordIO('tmp.rec', 'r')
>>> for i in range(5):
...    item = record.read()
...    print(item)
record_0
record_1
record_2
record_3
record_4
>>> record.close()
Returns:buf – Buffer read.
Return type:string
class mxnet.recordio.MXIndexedRecordIO(idx_path, uri, flag, key_type=)[source]

Reads/writes RecordIO data format, supporting random access.

>>> for i in range(5):
...     record.write_idx(i, 'record_%d'%i)
>>> record.close()
>>> record = mx.recordio.MXIndexedRecordIO('tmp.idx', 'tmp.rec', 'r')
>>> record.read_idx(3)
record_3
Parameters:
  • idx_path (str) – Path to the index file.
  • uri (str) – Path to the record file. Only supports seekable file types.
  • flag (str) – ‘w’ for write or ‘r’ for read.
  • key_type (type) – Data type for keys.
close()[source]

Closes the record file.

seek(idx)[source]

Sets the current read pointer position.

This function is internally called by read_idx(idx) to find the current reader pointer position. It doesn’t return anything.

tell()[source]

Returns the current position of write head.

>>> record = mx.recordio.MXIndexedRecordIO('tmp.idx', 'tmp.rec', 'w')
>>> print(record.tell())
0
>>> for i in range(5):
...     record.write_idx(i, 'record_%d'%i)
...     print(record.tell())
16
32
48
64
80
read_idx(idx)[source]

Returns the record at given index.

>>> record = mx.recordio.MXIndexedRecordIO('tmp.idx', 'tmp.rec', 'w')
>>> for i in range(5):
...     record.write_idx(i, 'record_%d'%i)
>>> record.close()
>>> record = mx.recordio.MXIndexedRecordIO('tmp.idx', 'tmp.rec', 'r')
>>> record.read_idx(3)
record_3
write_idx(idx, buf)[source]

Inserts input record at given index.

>>> for i in range(5):
...     record.write_idx(i, 'record_%d'%i)
>>> record.close()
Parameters:
  • idx (int) – Index of a file.
  • buf – Record to write.
mxnet.recordio.IRHeader

An alias for HEADER. Used to store metadata (e.g. labels) accompanying a record. See mxnet.recordio.pack and mxnet.recordio.pack_img for example uses.

Parameters:
  • flag (int) – Available for convenience, can be set arbitrarily.
  • label (float or an array of float) – Typically used to store label(s) for a record.
  • id (int) – Usually a unique id representing record.
  • id2 (int) – Higher order bits of the unique id, should be set to 0 (in most cases).

alias of HEADER

mxnet.recordio.pack(header, s)[source]

Pack a string into MXImageRecord.

Parameters:
  • header (IRHeader) – Header of the image record. header.label can be a number or an array. See more detail in IRHeader.
  • s (str) – Raw image string to be packed.
Returns:

s – The packed string.

Return type:

str

Examples

>>> label = 4 # label can also be a 1-D array, for example: label = [1,2,3]
>>> id = 2574
>>> header = mx.recordio.IRHeader(0, label, id, 0)
>>> with open(path, 'r') as file:
...     s = file.read()
>>> packed_s = mx.recordio.pack(header, s)
mxnet.recordio.unpack(s)[source]

Unpack a MXImageRecord to string.

Parameters:s (str) – String buffer from MXRecordIO.read.
Returns:
  • header (IRHeader) – Header of the image record.
  • s (str) – Unpacked string.

Examples

>>> record = mx.recordio.MXRecordIO('test.rec', 'r')
>>> item = record.read()
>>> header, s = mx.recordio.unpack(item)
>>> header
HEADER(flag=0, label=14.0, id=20129312, id2=0)
mxnet.recordio.unpack_img(s, iscolor=-1)[source]

Unpack a MXImageRecord to image.

Parameters:
  • s (str) – String buffer from MXRecordIO.read.
  • iscolor (int) – Image format option for cv2.imdecode.
Returns:

  • header (IRHeader) – Header of the image record.
  • img (numpy.ndarray) – Unpacked image.

Examples

>>> record = mx.recordio.MXRecordIO('test.rec', 'r')
>>> item = record.read()
>>> header, img = mx.recordio.unpack_img(item)
>>> header
HEADER(flag=0, label=14.0, id=20129312, id2=0)
>>> img
array([[[ 23,  27,  45],
        [ 28,  32,  50],
        ...,
        [ 36,  40,  59],
        [ 35,  39,  58]],
       ...,
       [[ 91,  92, 113],
        [ 97,  98, 119],
        ...,
        [168, 169, 167],
        [166, 167, 165]]], dtype=uint8)
mxnet.recordio.pack_img(header, img, quality=95, img_fmt='.jpg')[source]

Pack an image into MXImageRecord.

Parameters:
  • header (IRHeader) – Header of the image record. header.label can be a number or an array. See more detail in IRHeader.
  • img (numpy.ndarray) – Image to be packed.
  • quality (int) – Quality for JPEG encoding in range 1-100, or compression for PNG encoding in range 1-9.
  • img_fmt (str) – Encoding of the image (.jpg for JPEG, .png for PNG).
Returns:

s – The packed string.

Return type:

str

Examples

>>> label = 4 # label can also be a 1-D array, for example: label = [1,2,3]
>>> id = 2574
>>> header = mx.recordio.IRHeader(0, label, id, 0)
>>> img = cv2.imread('test.jpg')
>>> packed_s = mx.recordio.pack_img(header, img)