Introduction

Deep-learning code is characterized by

  • data pre-processing on CPU and training on GPU

  • mix of common and task-specific pre-processing steps

  • training in epochs

  • mini-batches of training data

  • data transformation such as scaling, cropping and others

  • data augmentation to increase amount of training data

  • check-pointing of network weights

  • logging of training progress

nuts-ml provides generic components, so called nuts that implement these functions and can be arranged in task-specific data processing pipelines. This leads to more readable and robust code that can be modified easily, which is in contrast to the monolithic, unstructured code often found for data pre-processing.

Below an (incomplete) example of a typical pipeline for network training with nuts-ml. The >> operator defines the flow of data through the pipeline

t_loss = (train_samples >> PrintProgress(train_samples) >>
                           augment >> rerange >> Shuffle(100) >>
                           build_batch >> network.train() >> Mean())
print("training loss  :", t_loss)

Canonical pipeline

A canonical pipeline for the processing of image data can be described by the following components

_images/pipeline.gif
  • Reader: sample data stored in CSV files, Pandas tables, databases or other data sources is read,

  • Splitter: samples are split into training, validation and sets, and stratified if necessary,

  • Loader: image data is loaded for each sample when needed,

  • Transformer: images are transformed, e.g. cropped or resized,

  • Augmenter: images are augmented to increase data size by random rotations, flipping, changes to contrast, or others,

  • Batcher: the transformed and augmented images are organized in mini-batches for GPU processing,

  • Network: a neural network is trained and evaluated on the GPU,

  • Logger: the network performance (loss, accuracy, …) is logged or plotted.

Depending on the specific application, the mode (training, testing, evaluation, …) or data type (image, video, text) some of the processing steps will differ, others will remain the same, but in general many components (nuts) can be shared.

Library

nuts-ml is a library that provides common data-processing and machine learning components as nuts. It is based on nuts-flow, which itself is largely based on Python iterators and the itertools library.

_images/architecture.png

nuts-flow wraps iterators and itertool functions into nuts that provide a >> operator, which enables the composition of iterators in pipelines. For instance, a nested itertool expression such as the following

>>> list(islice(ifilter(lambda x: x > 5, range(10)), 3))
[6, 7, 8]

can be flattened and more clearly written with nuts-flow as

>>> Range(10) >> Filter(_ > 5) >> Take(3) >> Collect()
[6, 7, 8]

nuts-ml extends nuts-flow with nuts specifically designed for machine learning and (image) data processing. The following code example shows a nuts-ml pipeline with image transformation, image augmentation, batching, network training, evaluation and check-pointing.

Example

rerange = TransformImage(0).by('rerange', 0, 255, 0, 1, 'float32')
build_batch = (BuildBatch(BATCH_SIZE)
               .input(0, 'image', 'float32')
               .output(1, 'one_hot', 'uint8', NUM_CLASSES))
p = 0.1
augment = (AugmentImage(0)
           .by('identical', 1.0)
           .by('brightness', p, [0.7, 1.3])
           .by('color', p, [0.7, 1.3])
           .by('shear', p, [0, 0.1])
           .by('fliplr', p)
           .by('rotate', p, [-10, 10]))

network = create_network()
train_samples, test_samples = load_samples()

for epoch in range(NUM_EPOCHS):
    t_loss, t_acc = (train_samples >>
                     PrintProgress(train_samples) >>
                     augment >>
                     rerange >>
                     Shuffle(100) >>
                     build_batch >>
                     network.train() >>
                     Unzip())
    print("train loss :", t_loss >> Mean())
    print("train acc  :", 100 * (t_acc >> Mean()))

e_acc = test_samples >> rerange >> build_batch >> network.evaluate([categorical_accuracy])
print("test acc   :", 100 * e_acc)

The complete code and more examples can be found under nutsml/examples . See the tutorial for a detailed explanation of the code.