Data transformation

The (element-wise) transformation of data is, together with filtering, at the very core of data flows and nuts-flow provides various nuts for this purpose.

Elementwise transformations

The most common transformation is the mapping of a function on a flow.

Map

Map(func) nut takes a function and applies it to each element of the input iterable. See the following examples: and

>>> from nutsflow import *
>>> from nutsflow import _
>>> Range(5) >> Map(lambda x : x * x) >> Collect()
[0, 1, 4, 9, 16]
>>> Range(5) >> Map(_ * 2) >> Collect()
[0, 2, 4, 6, 8]
>>> Range(5) >> Map(_ > 2) >> Collect()
[False, False, False, True, True]
>>> Range(5) >> Map(str) >> Collect()
['0', '1', '2', '3', '4']

Note that Map can transform elements of the flow in arbitrary ways but cannot change the number of elements in the flow.

MapMulti

Occasionally, it is necessary to apply different, independent mappings to the same data. One way is to process the data for each mapping individually, e.g.

>>> times2 = Range(5) >> Map(_ * 2) >> Collect()
>>> greater3 = Range(5) >> Map(_ > 3) >> Collect()

However, if the generation or reading of the input data is computationally expensive it is more efficient to use MapMulti and avoid rereading the input multiple times.

>>> times2, greater3 = Range(5) >> MapMulti(_ * 2, _ > 3)
>>> times2 >> Collect()
[0, 2, 4, 6, 8]
>>> greater3 >> Collect()
[False, False, False, False, True]

Note that MapMulti performs an arbitray number of mappings at the same time and returns iterators for each mapping.

Tabular data

Often input data is organized in rows (records) and columns, and transformations for selected columns only are needed.

MapCol

MapCol(columns, func) maps a function to the specified columns of the input data and leaves other columns unchanged.

Given the following table with tuples as records

>>> table = [ (1, 2),
...           (3, 4) ]

the example flow below negates all numbers in column 0:

>>> negate = lambda x: -x
>>> table >> MapCol(0, negate) >> Print() >> Consume()
(-1, 2)
(-3, 4)

or let us convert each number in the second column to a string:

>>> table >> MapCol(1, str) >> Collect()
[(1, '2'), (3, '4')]

MapCol can apply the same mapping to multiple columns at the same time. For instance, checking if numbers in columns 0 and 1 are greater than two:

>>> table >> MapCol((0, 1), _ > 2) >> Collect()
[(False, False), (True, True)]

Note that input data must be an iterable of tuples or other indexable objects and the flow iterates over these records. To iterate over all elements of a table individually use Flatten.

Append

Append(items) allows to append a single item or sequence of items to the rows of the input data. For instance, given the table above the following code adds an x to each row:

>>> table >> Append('x') >> Print() >> Consume()
(1, 2, 'x')
(3, 4, 'x')

Appending (or merging) a column or table is equally easy:

>>> new_col = ['a', 'b']
>>> table >> Append(new_col) >> Print() >> Consume()
(1, 2, 'a')
(3, 4, 'b')
>>> table2 = [ ('a', 'c'),
...            ('b', 'd') ]
>>> table >> Append(table2) >> Print() >> Consume()
(1, 2, 'a', 'c')
(3, 4, 'b', 'd')

Insert

Insert(column, items) operates just like Append but allows to specify the column where the new data is to be inserted:

>>> table >> Insert(1,'x') >> Print() >> Consume()
(1, 'x', 2)
(3, 'x', 4)
>>> table >> Insert(0,table2) >> Print() >> Consume()
('a', 'c', 1, 2)
('b', 'd', 3, 4)

Insert() and Append() are often useful to enumerate rows:

>>> table2 >> Insert(0, Enumerate()) >> Print() >> Consume()
(0, 'a', 'c')
(1, 'b', 'd')

Note the difference to using Zip, which nests the data:

>>> table2 >> Zip(Enumerate()) >> Print() >> Consume()
(('a', 'c'), 0)
(('b', 'd'), 1)

Get

Get(start, end, step) operates similar to Python’s slicing [start:end:step] and extracts individual elements or slices from table records. For instance, given the following table

>>> table = [ (1, 2, 3),
...           (4, 5, 6) ]

Get(1) extracts all elements in column 1 of the table:

>>> table >> Get(1) >> Collect()
[2, 5]

Note that, since a single column was extracted, the output is a list of numbers and not a list of tuples anymore.

Get(0, 2) extracts column 0 to 1:

>>> table >> Get(0, 2) >> Print() >> Consume()
(1, 2)
(4, 5)

and Get(0, 3, 2) extracts column 0 to 2 with stride 2:

>>> table >> Get(0, 3, 2) >> Collect()
[(1, 3), (4, 6)]

Note that in agreement with Python’s slicing the index of the end column is exclusive.

GetCols

The Get nut described above can extract only consecutive table columns in order. GetCols(*columns) allows to extract arbitray columns in arbitrary order. Given the following table

>>> table = [ (1, 2, 3),
...           (4, 5, 6) ]

GetCols(1) extracts column 1 of the table:

>>> table >> GetCols(1) >> Collect()
[(2,), (5,)]

Note that in contrast to Get(1) a list of (single element) tuples is returned.

The following example extracts columns 2, 1, and 0, and effectively reverses the column order of the table:

>>> table >> GetCols(2, 1, 0) >> Print() >> Consume()
(3, 2, 1)
(6, 5, 4)

GetCols can even duplicate columns, e.g. duplicating column 1 and removing column 0 can be achieved as follows:

>>> table >> GetCols(1, 1, 2) >> Print() >> Consume()
(2, 2, 3)
(5, 5, 6)

Flatten data

Hierarchical data structures such as lists of lists frequently need to be converted to flat structures. Flatten and FlatMap are two nuts for flatting data.

Flatten

Flatten flattens all iterables within the input and returns an iterator over the result. For instance:

>>> [(1, 2), (3, 4, 5), 6] >> Flatten() >> Collect()
[1, 2, 3, 4, 5, 6]

Note that only one level is flattend. Deeper structures remain unchanged

>>> [(1, 2), ((3, 4), 5), 6] >> Flatten() >> Collect()
[1, 2, (3, 4), 5, 6]

but can be, of course, flattend by sucessive calls of Flatten:

>>> [(1, 2), ((3, 4), 5), 6] >> Flatten() >> Flatten() >> Collect()
[1, 2, 3, 4, 5, 6]

FlatMap

A common operation is a Map followed by a Flatten and FlatMap is a nut that provides this operation in one call. See the following examples to dublicate all numbers in a list of numbers:

>>> dup = lambda x: (x, x)
>>> [0, 1, 2] >> Map(dup) >> Collect()
[(0, 0), (1, 1), (2, 2)]
>>> [0, 1, 2] >> Map(dup) >> Flatten() >> Collect()
[0, 0, 1, 1, 2, 2]
>>> [0, 1, 2] >> FlatMap(dup) >> Collect()
[0, 0, 1, 1, 2, 2]