Building Batches

Networks are trained with mini-batches of samples, e.g. a stack of images with their corresponding class labels. BuildBatch(batchsize) is used to build these batches. Note that constructing a batch of the correct format is often tricky, since it depends on the network architecture, the deep learning framework and error messages are sometimes not informative.

We start with an extremely simple toy example. Our data samples are single integer numbers. We build batches of size 2 and print them out

>>> samples = [[1], [2], [3]]
>>> build_batch = BuildBatch(2).input(0, 'number', int)
>>> samples >> build_batch >> Print() >> Consume()
[array([1, 2])]
[array([3])]

where input(column, format, dtype) specifies from which sample column to extract data for the batch, which format the data is in (e.g. numbers, vectors, images) and which data type to use for creation of the NumPy arrays.

Since the number of samples is not dividable by the batch size of 2 the last batch is shorter. If this is problematic you need to either ensure that the sample set size are dividable by batch size or filter them out. Most network libraries, however, allow to specify one dimension of the input tensor as None and can handle variable batch sizes.

Note

BuildBatch prefetches data to build a batch on the CPU, while another batch is processed by the network on the GPU. This parallelism can result in a hanging pipeline if there is no network to process the batches. If the code example above does not work for you, use BuildBatch(2, prefetch=0) instead!

Training batches contain inputs and possibly outputs/targets. The general format of training batches generated by BuildBatch is a list composed of two sublists containing Numpy arrays. The first sublist contains the input data and the second list contains the output data for the network:

[[<in_ndarray>, ...], [<out_ndarray>, ...]]

In the next example we generate batches with inputs and outputs. Each sample of the (training) data set contains two numbers, the first as input and the second as output (e.g. class label):

>>> samples = [[10,1], [20,2], [30,3]]
>>> build_batch = (BuildBatch(batchsize=2)
...                .input(0, 'number', float)
...                .output(1, 'number', int))
>>> samples >> build_batch >> Print() >> Consume()
[[array([10., 20.])], [array([1, 2])]]
[[array([30.])], [array([3])]]

We build the batch by extracting the first number from column 0 as input and converting it to float, and the number in sample column 1 becomes the output. input() copies data in the first sublist of the batch and output copies data in the second. Multiple inputs (e.g. BuildBatch().input(...).input(...)) will extend the first sublist and multiple outputs similarly will extend the second sublist of the batch.

Note that we can easily use the same number as input and output (e.g. to train an autoencoder), use both numbers as input, flip input and output or ignore sample columns when creating batches:

BuildBatch(2).input(0, 'number', int).output(0, 'number', int)  # Autoencoder
BuildBatch(2).input(0, 'number', int).input(1, 'number', int)   # Two inputs
BuildBatch(2).input(1, 'number', int).output(0, 'number', int)  # Flipped columns
BuildBatch(2).input(1, 'number', int)                           # Input only

Sample data can be of different formats such as numbers, vectors, tensors or images. Run help(BuildBatch.input) for an overview of the different formats supported.

Let us try a slightly more complex example, where our samples are vectors with a class index. We will construct batches of size 2 containing float32 vectors as inputs and one-hot encoded outputs for the class indices:

>>> from numpy import array
>>> N_CLASSES = 2
>>> samples = [(array([1, 2, 3]), 0),
...            (array([4, 5, 6]), 1),
...            (array([7, 8, 9]), 1)]
>>> build_batch = (BuildBatch(batchsize=2)
...                .input(0, 'vector', 'float32')
...                .output(1, 'one_hot', 'uint8', N_CLASSES))
>>> samples >> build_batch >> Print() >> Consume()
[[array([[1., 2., 3.],
         [4., 5., 6.]], dtype=float32)],
 [array([[1, 0],
         [0, 1]], dtype=uint8)]]
[[array([[7., 8., 9.]], dtype=float32)],
 [array([[0, 1]], dtype=uint8)]]

As you can see, the class index is converted into a one-hot encoded vector of length two and input data is converted to float vectors. For larger data, printing out batches for debugging is not informative. We can use PrintType() to print the shape and data type of the generated NumPy arrays within the batch data structure. The same code above but with Print replaced by PrintType, produces much more readable output:

>>> build_batch = (BuildBatch(2, verbose=True)
...                .input(0, 'vector', 'float32')
...                .output(1, 'one_hot', 'uint8', N_CLASSES))
>>> samples >> build_batch >> PrintType() >> Consume()
[[<ndarray> 2x3:float32], [<ndarray> 2x2:uint8]]
[[<ndarray> 1x3:float32], [<ndarray> 1x2:uint8]]

As a last example, let us work with some image data. We create a sample set with only three images, labeled ‘good’ or ‘bad’. We read these images, convert the string labels in sample column 1 to one-hot encoded vectors and build batches:

>>> LABELS = ['good', 'bad']
>>> N_CLASSES = len(LABELS)
>>> samples = [('nut_color.gif', 'good'),
...            ('nut_grayscale.gif', 'good'),
...            ('nut_monochrome.gif', 'bad')]
>>> read_image = ReadImage(0, 'tests/data/img_formats/*')
>>> to_rgb = TransformImage(0).by('gray2rgb')
>>> convert_label = ConvertLabel(1, LABELS)
>>> build_batch = (BuildBatch(2)
...                .input(0, 'image', 'float32')
...                .output(1, 'one_hot', 'uint8', N_CLASSES))
>>> samples >> read_image >> to_rgb >> convert_label >> build_batch >> PrintType() >> Consume()
[[<ndarray> 2x213x320x3:float32], [<ndarray> 2x2:uint8]]
[[<ndarray> 1x213x320x3:float32], [<ndarray> 1x2:uint8]]

Note that we are reading a mixture of RGB and grayscale images with differing numbers of (color) channels that cannot be combined in a batch. We use the transformation gray2rgb to convert the single channel grayscale image to a three channel image.

The input array of the first batch is of shape 2x213x320x3, where the individual dimension are batchsize x image-rows x image-cols x image-channels. The output array has two one-hot vectors of length two. Some deep learning frameworks require the channel axis of image data to come first. The image format function of BuildBatch has a flag to add or move a channel axis (for details run help(batcher.build_image_batch)). If we run the same code but with channelfirst=True the print out of the batch shows the channel axis right after the batch axis and before the image row and colum axes:

>>> build_batch = (BuildBatch(2, verbose=True)
...                .input(0, 'image', 'float32', channelfirst=True)
...                .output(1, 'one_hot', 'uint8', N_CLASSES))
>>> samples >> read_image >> to_rgb >> convert_label >> build_batch >> PrintType() >> Consume()
[[<ndarray> 2x3x213x320:float32], [<ndarray> 2x2:uint8]]
[[<ndarray> 1x3x213x320:float32], [<ndarray> 1x2:uint8]]

For more complex scenarios (e.g. 3D input data) have a look at the tensor formatter (help(batcher.build_tensor_batch)), which allows you to construct batches from arbitrary tensors and to reorder axis. To wrap things up, here the schematics for a typical training pipeline:

train_samples, test_samples = read_samples >> SplitRandom(ratio=0.7)

EPOCHS = 100
for epoch in range(EPOCHS):
    (train_samples >> read_image >> transform >> augment >>
     Shuffle(100) >> build_batch >> network.train() >> Consume())

Note that we shuffle the data after augmentation to ensure that each mini-batch contains a good distribution of different class examples. How to plug in a network for training and inference is the topic of the next section.