Reading data samples¶
nuts-ml does not have a specific type for data samples
but most functions operate on tuples
, lists
or numpy arrays
,
with a preference for tuples. For instance, a sample of the Iris data set could be
represented as the tuple (4.9, 3.1, 1.5, 0.2, 'Iris-setosa')
and the entire
data set as a list of tuples.
Basics¶
nuts-ml is designed to read data in an iterative fashion to allow the processing of arbitrarily large data sets. It largely relies on nuts-flow, which is documented here. In the following a short introduction of the basic principles.
We start by importing nuts-ml
>>> from nutsml import *
and create a tiny, in-memory example data set:
>>> data = [(1,'odd'), (2, 'even'), (3, 'odd')]
Data pipelines in nuts-ml require a sink that pulls the data. The two most common ones are Consume and Collect but there are many others.
Consume()
consumes all data and returns nothing, while Collect()
collects all data in a list.
As an example we take the first two samples of the data set. Without a sink the pipeline is
not processing anything at all and only the generator (stemming from Take()
) is returned.
>>> data >> Take(2)
itertools.islice at 0xbf160e8>
Adding a Collect()
results in the processing of the data and gives us what we want:
>>> data >> Take(2) >> Collect()
[(1, 'odd'), (2, 'even')]
The same pipeline using Consume()
returns nothing
>>> data >> Take(2) >> Consume()
but we can verify that samples are processed by inserting a Print nut:
>>> data >> Print() >> Take(2) >> Consume()
(1, 'odd')
(2, 'even')
A broken pipeline or a pipeline without sink is a common problem that can be debugged
by inserting Print()
functions. Very useful is also PrintType
that prints
type information in addition to data values:
>>> data >> PrintType() >> Take(2) >> Consume()
(<int> 1, <str> odd)
(<int> 2, <str> even)
Two other commonly used functions are Filter and Map.
As the name indicates, Filter
is used to filter samples based on a provided
boolean function:
>>> data >> Filter(lambda s: s[1] == 'odd') >> Print() >> Consume()
(1, 'odd')
(3, 'odd')
or maybe more clearly with additional printing
>>> def is_odd(sample):
... return sample[1] == 'odd'
>>> data >> Print('before: {},{}') >> Filter(is_odd) >> Print('after : {},{}') >> Consume()
before: 1,odd
after : 1,odd
before: 2,even
before: 3,odd
after : 3,odd
Map
applies a function to the individual samples of a data set, e.g.
>>> def add_two(sample):
... number, label = sample
... return number + 2, label
>>> data >> Map(add_two) >> Collect()
[(3, 'odd'), (4, 'even'), (5, 'odd')]
There is a convenience nut MapCol that maps a function to a specific column (or columns) of a sample. This allows us to write more succinctly
>>> add_two = lambda number: number + 2
>>> data >> MapCol(0, add_two) >> Collect()
[(3, 'odd'), (4, 'even'), (5, 'odd')]
Let’s combine what we have learned and construct a pipeline that extracts the first number in the data set that is even and converts the labels to upper case.
>>> to_upper = lambda label: label.upper()
>>> is_even = lambda number: number % 2 == 0
>>> first_even = (data >> FilterCol(0, is_even) >>
... MapCol(1, to_upper) >> Take(1) >> Collect())
[(2, 'EVEN')]
Here we used FilterCol
instead of Filter
to filter for the contents in column 0
(the numbers) of
the samples. Note that we wrap the pipeline into brackets, allowing it to run over multiple lines.
Alternatively, we could refactor the code as follows to shorten the pipeline:
>>> to_upper = MapCol(1, lambda label: label.upper())
>>> is_even = FilterCol(0, lambda number: number % 2 == 0)
>>> first_even = data >> is_even >> to_upper >> Head(1)
[(2, 'EVEN')]
This concludes the basics. In the following examples we will read data sets in different formats from the file system and the web.
TXT files¶
Let us start with reading data from a simple text file. Here a tiny example file
tests/data/and.txt
x1,x2,y
0,0,no
0,1,no
1,0,no
1,1,yes
We can loads the file content with Python’s open
function that returns an
iterator over the lines and collect them in a list
>>> open('tests/data/and.txt') >> Collect()
['x1,x2,y\n', '0,0,no\n', '0,1,no\n', '1,0,no\n', '1,1,yes']
Of course, open('tests/data/and.txt').readlines()
would have achieved the same.
However, samples as strings are not very useful. We would like samples to be
represented as tuples or lists containing column values. First, we therefore define a
nut function that strips white spaces from lines and splits a line into
its components:
>>> split = Map(lambda line : line.strip().split(','))
This as a Map
because it will be applied to each line of the file.
Let us try it out by reading the header of the file
>>> lines = open('tests/data/and.txt')
>>> lines >> split >> Head(1)
[['x1', 'x2', 'y']]
where Head(n)
is a sink that collects the first n
lines in a list (here only one line).
As expected, we get the header with the column names.
Since open
returns an iterator, lines
is ready to deliver the remaining
lines of the file. For instance, we could now write
>>> lines >> split >> Print() >> Consume()
['0', '0', 'no']
['0', '1', 'no']
['1', '0', 'no']
['1', '1', 'yes']
which prints out the samples following the header.
Note that Consume
does not collect the samples - it just consumes them and
returns nothing. Good for debugging but not suitable for further processing.
We therefore rerun the code and collect the samples in a list. But careful!
The lines
iterator has been consumed. We have to reopen the file to
restart the iterator:
>>> lines = open('tests/data/and.txt')
>>> lines >> Drop(1) >> split >> Collect()
[['0', '0', 'no'], ['0', '1', 'no'], ['1', '0', 'no'], ['1', '1', 'yes']]
We use Collect
to collect the samples and Drop(1)
means that we
skip the header line when reading the file.
Next we need to convert the strings containing numbers to actual numbers.
MapCol
can be used to map Python’s int
function on specific columns of the
samples; here columns 0
and 1
of the samples contain integers:
>>> lines = open('tests/data/and.txt')
>>> to_int = MapCol((0, 1), int)
>>> skip_header = Drop(1)
>>> samples = lines >> skip_header >> split >> to_int >> Collect()
>>> print(samples)
[(0, 0, 'no'), (0, 1, 'no'), (1, 0, 'no'), (1, 1, 'yes')]
Of course, we had to reload lines
again and just for readability gave the
Drop(1)
function a meaningful name (skip_header
). We end up with a
short pipeline that lazily processes individual lines, is modular and
easy to understand: lines >> skip_header >> split >> to_int >> Collect()
The equivalent Python code without using nuts-flow/ml or itertools
would be
def split(line):
return line.strip().split(',')
def to_int(sample):
x1, x2, label = sample
return int(x1), int(x2), label
lines = open('tests/data/and.txt')
next(lines)
samples = [to_int(split(line)) for line in lines]
If you prefer Python functions but still want to use pipelining, the Python functions can be converted into nuts and then piped together as before:
@nut_function
def Split(line):
return line.strip().split(',')
@nut_function
def ToInt(sample):
x1, x2, label = sample
return int(x1), int(x2), label
lines = open('tests/data/and.txt')
samples = lines >> Drop(1) >> Split() >> ToInt() >> Collect()
As a final example, we will convert the class labels that are currently strings to integer numbers – usually needed for training a machine learning classifier. We could define the following nut and add it to the pipeline:
>>> label2int = MapCol(2, lambda label: 1 if label=='yes' else 0)
>>> open('tests/data/and.txt') >> skip_header >> split >> to_int >> label2int >> Collect()
[(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)]
However, nutsml already has the ConvertLabel nut and we can simply write instead:
>>> labels = ['no', 'yes']
>>> convert = ConvertLabel(2, labels)
>>> samples = (open('tests/data/and.txt') >> skip_header >> split >> to_int >>
... convert >> Print() >> Collect())
(0, 0, 0)
(0, 1, 0)
(1, 0, 0)
(1, 1, 1)
Using ConvertLabel has the additional advantage that the conversion back from integers to strings is trivial:
>>> samples >> convert >> Print() >> Consume()
(0, 0, 'no')
(0, 1, 'no')
(1, 0, 'no')
(1, 1, 'yes')
ConvertLabel(column, labels) takes as parameter the column in a sample that contains the class label (here column 2) and a list of labels. If the class label is a strings it converts to an integer and vice versa. ConvertLabel can also convert to one-hot-encoded vectors and back:
>>> convert = ConvertLabel(2, labels, onehot=True)
>>> samples = (open('tests/data/and.txt') >> skip_header >> split >> to_int >>
... convert >> Print() >> Collect())
(0, 0, [1, 0])
(0, 1, [1, 0])
(1, 0, [1, 0])
(1, 1, [0, 1])
>>> samples >> convert >> Print() >> Consume()
(0, 0, 'no')
(0, 1, 'no')
(1, 0, 'no')
(1, 1, 'yes')
CSV files¶
You will have noticed that the file tests/data/and.txt
used above is actually a
text file in CSV (Comma Separated Values) format.
x1,x2,y
0,0,no
0,1,no
1,0,no
1,1,yes
Reading of CSV files is so common that Python has a dedicated CSV library for it. Similarily, nuts-ml provides ReadCSV, ReadNamedCSV and ReadPandas to read CSV and similar file formats directly.
For instance, we could use ReadCSV()
to read the file contents as follows:
>>> filepath = 'tests/data/and.csv'
>>> with ReadCSV(filepath, skipheader=1, fmtfunc=(int,int,str)) as reader:
>>> samples = reader >> Collect()
>>> print(samples)
[(0, 0, 'no'), (0, 1, 'no'), (1, 0, 'no'), (1, 1, 'yes')]
This code also properly closes the data file – a detail we have neglected before.
Note that we skip the header (skipheader=1
) and convert the strings in the file to
integers for the first two columns (fmtfunc=(int,int,str)
).
Provided a CSV file has a header, we could also use ReadNamedCSV()
, which returns
the more informative named tuples
instead of plain tuples:
>>> with ReadNamedCSV(filepath, fmtfunc=(int,int,str)) as reader:
>>> reader >> Print() >> Consume()
Row(x1=0, x2=0, y='no')
Row(x1=0, x2=1, y='no')
Row(x1=1, x2=0, y='no')
Row(x1=1, x2=1, y='yes')
The code becomes even simpler with ReadPandas
, which picks good
data types for columns automatically. Note, however, that this nut is
not lazy and reads all data in memory
>>> from nutsml import ReadPandas
>>> ReadPandas(filepath) >> Print() >> Consume()
Row(x1=0, x2=0, y='no')
Row(x1=0, x2=1, y='no')
Row(x1=1, x2=0, y='no')
Row(x1=1, x2=1, y='yes')
ReadPandas
furthermore can read TSV (Tab Separated Values) files and
other similar formats, and can easily extract or reorder columns or filter rows:
>>> colnames = ['y', 'x1']
>>> ReadPandas(filepath, colnames=colnames) >> Print() >> Consume()
Row(y='no', x1=0)
Row(y='no', x1=0)
Row(y='no', x1=1)
Row(y='yes', x1=1)
>>> rows = 'y == "no"'
>>> ReadPandas(filepath, rows, colnames) >> Print() >> Consume()
Row(y='no', x1=0)
Row(y='no', x1=0)
Row(y='no', x1=1)
NumPy arrays¶
To use NumPy arrays as data sources we need to wrap them into an iterator. In the following example we create an identity matrix, iterate over the rows, and print them:
>>> import numpy as np
>>> data = np.eye(4)
>>> iter(data) >> Print() >> Consume()
[1.0, 0.0, 0.0, 0.0]
[0.0, 1.0, 0.0, 0.0]
[0.0, 0.0, 1.0, 0.0]
[0.0, 0.0, 0.0, 1.0]
Note that NumPy arrays larger than memory can be loaded and then processed with np.load(filename, mmap_mode=’r’).
Web files¶
nuts-ml allows us to download and process data files from the web on the fly. Alternatively you can download the file and the process its content as described above. In the following example, however, we download and process the Iris data set line by line. First, we open the URL to the data set located on the UCI machine learning server:
>>> import urllib
>>> url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
>>> lines = urllib.request.urlopen
We now can inspect the first two lines of the data set:
>>> lines(url) >> Head(2)
[b'5.1,3.5,1.4,0.2,Iris-setosa\n',
b'4.9,3.0,1.4,0.2,Iris-setosa\n']
Here, lines
is just a renaming of the urllib.request.urlopen
function and Head(2)
collects the first two lines. You will notice that the lines are in binary (b) format.
The following code convert lines to strings, strips the the newline, and splits at comma
to give us samples with columns:
>>> to_columns = Map(lambda l: l.decode('utf-8').strip().split(','))
>>> lines(url) >> to_columns >> Head(2)
[['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'],
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']]
The four numeric features in columns 0 to 3 of the samples are still strings but we want floats.
Mapping the float
function on those columns will do it:
>>> to_float = MapCol((0,1,2,3), float)
>>> lines(url) >> to_columns >> to_float >> Head(2)
[(5.1, 3.5, 1.4, 0.2, 'Iris-setosa'),
(4.9, 3.0, 1.4, 0.2, 'Iris-setosa')]
Finally, we are going to replace the class labels (e.g. 'Iris-setosa'
) by
numeric class indices. We could look the the names of the classes up, but
being lazy we extract them directly via
>>> skip_empty = Filter(lambda cols: len(cols) == 5)
>>> labels = lines(url) >> to_columns >> skip_empty >> Get(4) >> Dedupe() >> Collect()
['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
where Get(4)
gets the elements in the fourth column of the sample and Dedupe()
removes all duplicate labels. We need skip_empty
, since the data set contains an
empty line at the end.
We now can use the extracted labels
and the ConvertLabel
nut to convert
the class labels in column 4 from strings to class indices. For showcasing, we
download the entire data set but print only every 20-th sample.
>>> (lines(url) >> to_columns >> skip_empty >> to_float >>
... ConvertLabel(4, labels) >> Print(every_n=20) >> Consume())
(5.1, 3.8, 1.5, 0.3, 0)
(5.1, 3.4, 1.5, 0.2, 0)
(5.2, 2.7, 3.9, 1.4, 1)
(5.7, 2.6, 3.5, 1.0, 1)
(5.7, 2.8, 4.1, 1.3, 1)
(6.0, 2.2, 5.0, 1.5, 2)
(6.9, 3.1, 5.4, 2.1, 2)
Label directories¶
A common method to organize data and assign labels to large data objects such as text files, audio recordings or images is to create directories with labels as names and to store the data objects in the corresponding directories.
For an example let us assume two classes (0
and 1
) and three text files
that are arranged in the following file structure
- books
- 0
- text0.txt
- 1
- text1.txt
- text11.txt
nuts-ml supports the reading of such file structures via ReadLabelDirs(). The following code demonstrates its usage:
>>> samples = ReadLabelDirs('books', '*.txt')
>>> samples >> Take(3) >> Print() >> Consume()
('books/0/text0.txt', '0')
('books/1/text1.txt', '1')
('books/1/text11.txt', '1')
Note that this code does not load the actual text data but the file paths only.
However, we could easily implement a Process
nut that loads and processes
the text files individually without loading all texts in memory at once.
For instance, converting text files into word count dictionaries.
@nut_function
def Process(sample):
filepath, label = sample
with open(filepath) as f:
counts = f.read().split(' ') >> CountValues()
return counts, labels
samples = ReadLabelDirs('books', '*.txt')
word_counts = samples >> Process() >> Collect()