nuts-ml does not have a specific type for data samples but most functions operate on `tuples`, `lists` or `numpy arrays`, with a preference for tuples. For instance, a sample of the Iris data set could be represented as the tuple `(4.9, 3.1, 1.5, 0.2, 'Iris-setosa')` and the entire data set as a list of tuples.

## Basics¶

nuts-ml is designed to read data in an iterative fashion to allow the processing of arbitrarily large data sets. It largely relies on nuts-flow, which is documented here. In the following a short introduction of the basic principles.

We start by importing nuts-flow and nuts-ml

```>>> from nutsflow import *
>>> from nutsml import *
```

and create a tiny, in-memory example data set:

```>>> data = [(1,'odd'), (2, 'even'), (3, 'odd')]
```

Data pipelines in nuts-ml require a sink that pulls the data. The two most common ones are Consume and Collect but there are many others.

`Consume()` consumes all data and returns nothing, while `Collect()` collects all data in a list. As an example we take the first two samples of the data set. Without a sink the pipeline is not processing anything at all and only the generator (stemming from `Take()`) is returned.

```>>> data >> Take(2)
itertools.islice at 0xbf160e8>
```

Adding a `Collect()` results in the processing of the data and gives us what we want:

```>>> data >> Take(2) >> Collect()
[(1, 'odd'), (2, 'even')]
```

The same pipeline using `Consume()` returns nothing

```>>> data >> Take(2) >> Consume()
```

but we can verify that samples are processed by inserting a Print nut:

```>>> data >> Print() >> Take(2) >> Consume()
(1, 'odd')
(2, 'even')
```

A broken pipeline or a pipeline without sink is a common problem that can be debugged easily by inserting `Print()` functions. Two other very common and important functions are Filter and Map.

As the name indicates, `Filter` is used to filter samples based on a provided boolean function:

```>>> data >> Filter(lambda s: s == 'odd') >> Print() >> Consume()
(1, 'odd')
(3, 'odd')
```

or maybe more clearly with additional printing

```>>> def is_odd(sample):
...     return sample == 'odd'
>>> data >> Print('before: {},{}') >> Filter(is_odd) >> Print('after : {},{}') >> Consume()
before: 1,odd
after : 1,odd
before: 2,even
before: 3,odd
after : 3,odd
```

`Map` applies a function to the samples of a data set, e.g.

```>>> def add_two(sample):
...     number, label = sample
...     return number + 2, label
>>> data >> Map(add_two) >> Collect()
[(3, 'odd'), (4, 'even'), (5, 'odd')]
```

There is a convenience nut MapCol that maps a function to a specific column (or columns) of a sample. This allows us to write more succinctly

```>>> add_two = lambda number: number + 2
>>> data >> MapCol(0, add_two) >> Collect()
[(3, 'odd'), (4, 'even'), (5, 'odd')]
```

For simple expressions a Scala like syntax can be used to further shorten the code:

```>>> data >> MapCol(0, _ + 2) >> Collect()
[(3, 'odd'), (4, 'even'), (5, 'odd')]
```

Let’s combine what we learned and construct a pipeline that extracts the first number in the data set that is even and converts the labels to upper case.

```>>> to_upper = lambda label: label.upper()
>>> is_even = lambda number: number % 2 == 0
>>> first_even = (data >> FilterCol(0, is_even) >>
... MapCol(1, to_upper) >> Take(1) >> Collect())
[(2, 'EVEN')]
```

Here we used FilterCol instead of `Filter` to filter for the contents in column `0` (the numbers) of the samples. Note that we wrap the pipeline into brackets allowing it to run over multiple lines. Alternatively, we could refactor the code as follows to shorten the pipeline:

```>>> to_upper = MapCol(1, lambda label: label.upper())
>>> is_even = FilterCol(0, lambda number: number % 2 == 0)
>>> first_even = data >> is_even >> to_upper >> Head(1)
[(2, 'EVEN')]
```

This concludes the basics. In the following examples we will read data sets in different formats from the file system and the web.

## TXT files¶

Let us start with reading data from a simple text file. Here a tiny example file `tests/data/and.txt`

```x1,x2,y
0,0,no
0,1,no
1,0,no
1,1,yes
```

We can loads the file content with Python’s `open` function that returns an iterator over the lines and collect them in a `list`

```>>> from nutsflow import *
>>> open('tests/data/and.txt') >> Collect()
['x1,x2,y\n', '0,0,no\n', '0,1,no\n', '1,0,no\n', '1,1,yes']
```

Of course, `open('tests/data/and.txt').readlines()` would have achieved the same. However, samples as strings are not very useful. We would like samples to be represented as tuples or lists containing column values. First, we therefore define a nut function that strips white spaces from lines and splits a line into its components:

```>>> split = Map(lambda line : line.strip().split(','))
```

This as a `Map` because it will be applied to each line of the file. Let us try it out by reading the header of the file

```>>> lines = open('tests/data/and.txt')
>>> lines >> split >> Head(1)
[['x1', 'x2', 'y']]
```

where `Head(n)` is a sink that collects the first `n` lines in a list (here only one line). As expected, we get the header with the column names. Since `open` returns an iterator `lines` is ready to deliver the remaining lines of the file. For instance, we could now write

```>>> lines >> split >> Print() >> Consume()
['0', '0', 'no']
['0', '1', 'no']
['1', '0', 'no']
['1', '1', 'yes']
```

which prints out the samples following the header. Note that `Consume` does not collect the samples - it just consumes them and returns nothing. Good for debugging but not suitable for further processing. We therefore rerun the code and collect the samples in a list. But careful! The `lines` iterator has been consumed. We have to reopen the file to restart the iterator:

```>>> lines = open('tests/data/and.txt')
>>> lines >> Drop(1) >> split >> Collect()
[['0', '0', 'no'], ['0', '1', 'no'], ['1', '0', 'no'], ['1', '1', 'yes']]
```

We use `Collect` to collect the samples and `Drop(1)` means that we skip the header line when reading the file.

Next we need to convert the strings containing numbers to actual numbers. `MapCol` can be used to map Python’s `int` function on specific columns of the samples; here columns `0` and `1` of the samples contain integers:

```>>> lines = open('tests/data/and.txt')
>>> to_int = MapCol((0, 1), int)
>>> samples = lines >> skip_header >> split >> to_int >> Collect()
>>> print(samples)
[(0, 0, 'no'), (0, 1, 'no'), (1, 0, 'no'), (1, 1, 'yes')]
```

Of course we had to reload `lines` again and just for readability gave the `Drop(1)` function a meaningful name (`skip_header`). We end up with a nice pipeline that lazily processes individual lines, is modular and easy to understand: `lines >> skip_header >> split >> to_int >> Collect()` The equivalent Python code without using nuts-flow/ml or `itertools` would be

```def split(line):
return line.strip().split(',')

def to_int(sample):
x1, x2, label = sample
return [int(x1), int(x2), label]

lines = open('tests/data/and.txt')
next(lines)
samples = [to_int(split(line)) for line in lines]
```

If you prefer Python functions but still want to use pipelining, the functions can be converted into nuts:

```@nut_function
def Split(line):
return line.strip().split(',')

@nut_function
def ToInt(sample):
x1, x2, label = sample
return [int(x1), int(x2), label]

lines = open('tests/data/and.txt')
samples = lines >> Drop(1) >> Split() >> ToInt() >> Collect()
```

As a final example, we will convert the class labels that are currently strings to integer numbers – usually needed for training a machine learning classifier. We could define the following nut and add it to the pipeline:

```>>> label2int = MapCol(2, lambda label: 1 if label=='yes' else 0)
>>> open('tests/data/and.txt') >> skip_header >> split >> to_int >> label2int >> Collect()
[(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)]
```

However, nutsml already has the ConvertLabel nut and we can simply write instead:

```>>> labels = ['no', 'yes']
>>> convert = ConvertLabel(2, labels)
>>> samples = (open('tests/data/and.txt') >> skip_header >> split >> to_int >>
... convert >> Print() >> Collect())
(0, 0, 0)
(0, 1, 0)
(1, 0, 0)
(1, 1, 1)
```

Using ConvertLabel has the additional advantage that the conversion back from integers to strings is trivial:

```>>> samples >> convert >> Print() >> Consume()
(0, 0, 'no')
(0, 1, 'no')
(1, 0, 'no')
(1, 1, 'yes')
```

ConvertLabel(column, labels) takes as parameter the column in a sample that contains the class label (here column 2) and a list of labels. If the class label is a strings it converts to an integer and vice versa. ConvertLabel can also convert to one-hot-encoded vectors and back:

```>>> convert = ConvertLabel(2, labels, onehot=True)
>>> samples = (open('tests/data/and.txt') >> skip_header >> split >> to_int >>
... convert >> Print() >> Collect())
(0, 0, [1, 0])
(0, 1, [1, 0])
(1, 0, [1, 0])
(1, 1, [0, 1])
```
```>>> samples >> convert >> Print() >> Consume()
(0, 0, 'no')
(0, 1, 'no')
(1, 0, 'no')
(1, 1, 'yes')
```

## CSV files¶

You will have noticed that the `tests/data/and.txt` file used above is actually a file in CSV (Comma Separated Values) format. Reading of CSV files is so common that Python has a dedicated CSV library for it. Similarily, nuts-flow provides a ReadCSV nut, and nuts-ml has the even more powerful ReadPandas nut. For instance, we could write

```>>> filepath = 'tests/data/and.csv'
>>>    samples = reader >> Collect()
>>> print(samples)
[(0, 0, 'no'), (0, 1, 'no'), (1, 0, 'no'), (1, 1, 'yes')]
```

which also properly closes the data file – a detail we have neglected before. The code becomes even simpler with the `ReadPandas` nut but note that this nut reads all data in memory:

```>>> from nutsml import ReadPandas
>>> samples = ReadPandas('tests/data/and.csv') >> Collect()
>>> print(samples)
[(0, 0, 'no'), (0, 1, 'no'), (1, 0, 'no'), (1, 1, 'yes')]
```

The advantage is that it drops the header, splits the lines and converts numbers to integers automatically. `ReadPandas` furthermore can read TSV (Tab Separated Values) files and other format. Finally, `ReadPandas` can easily extract or reorder columns or filter rows:

```>>> columns = ['y', 'x1']
>>> ReadPandas('tests/data/and.csv', columns=columns) >> Print() >> Consume()
('no', 0)
('no', 0)
('no', 1)
('yes', 1)
```
```>>> rows = 'x1>0'
>>> ReadPandas('tests/data/and.csv', rows, columns) >> Print() >> Consume()
('no', 1)
('yes', 1)
```

## NumPy arrays¶

To use NumPy arrays as data sources we need to wrap them into an iterator. In the following example we create an identity matrix, iterate over the rows, and print them:

```>>> import numpy as np
>>> data = np.eye(4)
>>> iter(data) >> Print() >> Consume()
[1.0, 0.0, 0.0, 0.0]
[0.0, 1.0, 0.0, 0.0]
[0.0, 0.0, 1.0, 0.0]
[0.0, 0.0, 0.0, 1.0]
```

Note that NumPy arrays larger than memory can be loaded and then processed with np.load(filename, mmap_mode=’r’).

## Web files¶

nuts-flow/ml allows us to download and process data files from the web on the fly. Alternatively you can download the file and the process its content as described above. In the following example, however, we download and process the Iris data set line by line. First, we open the URL to the data set located on the UCI machine learning server:

```>>> import urllib
>>> url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
>>> lines = urllib.request.urlopen
```

We now can inspect the first two lines of the data set:

```>>> lines(url) >> Head(2)
[b'5.1,3.5,1.4,0.2,Iris-setosa\n',
b'4.9,3.0,1.4,0.2,Iris-setosa\n']
```

Here, `lines` is just a renaming of the `urllib.request.urlopen` function and `Head(2)` collects the first two lines. You will notice that (since Py3K) the lines are in binary (b) format. The following code convert lines to strings, strips the the newline, and splits at comma to give as samples with columns:

```>>> to_columns = Map(lambda l: l.decode('utf-8').strip().split(','))
>>> lines(url) >> to_columns >> Head(2)
[['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'],
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']]
```

The four numeric features in columns 0 to 3 of the samples are still strings but we want floats. Mapping the `float` function on those columns will do it:

```>>> to_float = MapCol((0,1,2,3), float)
>>> lines(url) >> to_columns >> to_float >> Head(2)
[(5.1, 3.5, 1.4, 0.2, 'Iris-setosa'),
(4.9, 3.0, 1.4, 0.2, 'Iris-setosa')]
```

Finally, we are going to replace the class labels (e.g. `'Iris-setosa'`) by numeric class indices. We could look the the names of the classes up, but being lazy we extract them directly via

```>>> skip_empty = Filter(lambda cols: len(cols) == 5)
>>> labels = lines(url) >> to_columns >> skip_empty >> Get(4) >> Dedupe() >> Collect()
['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
```

where `Get(4)` gets the elements in the fourth column of the sample and `Dedupe()` removes all duplicate labels. We need `skip_empty`, since the data set contains an empty line at the end.

We now can use the extracted `labels` and the `ConvertLabel` nut to convert the class labels in column 4 from strings to class indices. For showcasing, we download the entire data set put print only every 20-th sample.

```>>> (lines(url) >> to_columns >> skip_empty >> to_float >>
...  ConvertLabel(4, labels) >> Print(every_n=20) >> Consume())
(5.1, 3.8, 1.5, 0.3, 0)
(5.1, 3.4, 1.5, 0.2, 0)
(5.2, 2.7, 3.9, 1.4, 1)
(5.7, 2.6, 3.5, 1.0, 1)
(5.7, 2.8, 4.1, 1.3, 1)
(6.0, 2.2, 5.0, 1.5, 2)
(6.9, 3.1, 5.4, 2.1, 2)
```

## Label directories¶

A common method to organize data and assign labels to large data objects such as text files, audio recordings or images is to create directories with labels as names and to store the data objects in the corresponding directories.

For an example let us assume two classes (`0` and `1`) and three text files that are arranged in the following file structure

```- books
- 0
- text0.txt
- 1
- text1.txt
- text11.txt
```

nuts-ml supports the reading of such file structures via the ReadLabelDirs() nut. The following code demonstrates its usage:

```>>> samples = ReadLabelDirs('books', '*.txt')
>>> samples >> Take(3) >> Print() >> Consume()
('books/0/text0.txt', '0')
('books/1/text1.txt', '1')
('books/1/text11.txt', '1')
```

Note that this code does not load the actual text data but the file paths only. However, we could easily implement a `Process` nut that loads and processes the text files individually without loading all texts in memory at once. For instance, converting text files into word count dictionaries.

```@nut_function
def Process(sample):
filepath, label = sample
with open(filepath) as f:
counts = f.read().split(' ') >> CountValues()
return counts, labels