Writing to Sinks

Sinks are typically at the end of a data flow and needed to drive the flow. Without a sink the flow is not processing any data.

Python functions

All Python functions that accept iterators or iterables can serve as sinks, e.g. list, set, dict, sum, file.writelines, …, but since they are not nuts they do not support the >> operator and need to be called as functions. Here some examples

>>> from nutsflow import *
>>> from nutsflow import _
>>> list(Range(10) >> Filter(_ < 4) >> Square())
[0, 1, 4, 9]
>>> set([1, 2, 1, 3] >> Square())
{1, 4, 9}
>>> dict([('one', 1), ('four', 2)] >> MapCol(1, Square()))
{'four': 4, 'one': 1}
with open(filepath) as f:
    f.writelines(Range(4) >> Square() >> Format('{}'))

Nuts

Collect

The most commonly used sink is Collect, which collects all elements of the input iterable in a list.

>>> Range(10) >> Filter(_ < 4) >> Collect()
[0, 1, 2, 3]

Collect(container) also allows to specify a container to collect the data in. Any Python function that accept iterators or iterables is a valid container, e.g.

>>> [1, 2, 1, 3] >> Square() >> Collect(set)
{1, 4, 9}
>>> [('one', 1), ('four', 2)] >> MapCol(1, Square()) >> Collect(dict)
{'four': 4, 'one': 1}
>>> Range(10) >> Square() >> Collect(sum)
285
>>> Range(5) >> Map(str) >> Collect(':'.join)
'0:1:2:3:4'

Collect stores all data in memory and is not suitable for large data sets. In such a case use WriteCSV to write data to the file system.

Head and Tail

Collect collects all elements of the input. Often only the first or last n elements are needed. Head(n) collects the first n elements and Tail(n) the last n elements

>>> Range(10) >> Head(4)
[0, 1, 2, 3]
>>> Range(10) >> Tail(4)
[6, 7, 8, 9]

Similar to Collect, Head and Tail allow to specify a container to store the result in

>>> [1, 2, 1, 3, 2] >> Head(3, set)
{1, 2}
>>> Range(10) >> Tail(3, sum)
24

Common nuts

nuts-flow provides nuts for common aggregator functions such as Sum, Min, Max, ArgMax, ArgMin, and Join. For instance, instead of writing

>>> Range(10) >> Collect(sum)
45

one can simply write

>>> Range(10) >> Sum()
45

Join is the nuts equivalent of Python’s join method but automatically converts numbers to strings, e.g.

>>> Range(5) >> Join(':')
'0:1:2:3:4'

in contrast to:

>>> Range(5) >> Map(str) >> Collect(':'.join)
'0:1:2:3:4'

Min and Max return the minimum or the maximum element of a data flow and allow to specify a key function and a default value in case of an empty data stream. For instance, find the longest string

>>> ['1', '123', '12'] >> Max(key=len)
'123'

and return the empty string if there is no data

>>> [] >> Max(len, default='')
''

ArgMin and ArgMax return the index of the smallest or largest element and possibly the element itself. For example, the index of the longest string

>>> ['12', '1', '123'] >> ArgMax(key=len)
2

or the index and the string itself

>>> ['12', '1', '123'] >> ArgMax(len, retvalue=True)
(2, '123')

A default value is also supported to deal with empty input data

>>> [] >> ArgMax(default=(0, None), retvalue=True)
(0, None)
>>> [] >> ArgMax(default='empty')
'empty'

Count and CountValues

To count the number of elements in a flow or the numbers of different elements in a flow Count and CountValues are provided.

Count simply consumes the data flow and counts the number of elements

>>> [1, 2, 1, 3, 2] >> Count()
5
>>> 'abaacc' >> Count()
6

while CountValues counts the frequencies of the different values and returns a dictionary

>>> 'abaacc' >> CountValues()
{'a': 3, 'c': 2, 'b': 1}

CountValues can also return the relative frequencies instead of the absolute counts

>>> 'aabaab' >> CountValues(True)
{'a': 1.0, 'b': 0.5}

Reduce

Reduce(func [,initiaizer]) reduces a flow of data elements to a single element, using a given function. Reduce is a thin wrapper around Python’s reduce function.

The following example computes the product of a list of numbers

>>> [1, 2, 3] >> Reduce(lambda a, b: a * b)
6

Reduce can be called with an initalizer, which specifies the first element used in the reduction

>>> ['one', 'two'] >> Reduce(lambda a, b: a + b, 'start')
'startonetwo'

Consume

If a data flow has side effects (e.g. printing, writing to a file) but no interesting result itself the Consume nut can be used. It drives a data flow but does not collect or discards any of its results. For instance, the following flow has the side effect of printing numbers:

>>> Range(3) >> Print() >> Consume()
0
1
2

In contrast, the following flow processes data but returns nothing

>>> Range(3) >> Square() >> Consume()

while the next flow has no sink and therefore only returns an iterator object but does not process any data

>>> Range(3) >> Square() >> Print()
<itertools.imap object at ...>

The former because there is no side effect and the latter because there is no sink that drives the flow.

WriteCSV

WriteCSV(filepath, cols, skipheader, fmtfunc, **kwargs) writes data in Comma Separated Values format (CSV) to the specified file. For instance,

[(1, 2), (3, 4)] >> WriteCSV('data.csv')

would create the file data.csv with the following content

1,2
3,4

However, to ensure that files are closed safely it is preferable to use WriteCSV in conjunction with the with statement

with WriteCSV('data.csv') as writer:
   [(1, 2), (3, 4)] >> writer

It is possible to select the columns to write and to skip a given number of header lines if needed. For example,

with WriteCSV('data.csv', cols=(1,0), skipheader=1) as writer:
   [('a', 'b', 'c'), (1, 2, 3), (4, 5, 6)] >> writer

will write the following data to data.csv:

2,1
5,4

while

with WriteCSV('data.csv') as writer:
   [('a', 'b', 'c'), (1, 2, 3), (4, 5, 6)] >> writer

will write

a,b,c
1,2,3
4,5,6

In addition to CSV other formats such as Tab Separated Values (TSV) can be written by providing the appropriate delimiter

with WriteCSV('data.csv', delimiter='\t') as writer:
   [(1,2), (3,4)] >> writer

and values can be formatted using fmtfunc. For example,

with WriteCSV('data.csv', fmtfunc=lambda x: 'num:'+str(x)) as writer:
   [(1, 2, 3), (4, 5, 6)] >> writer

will output

num:1,num:2,num:3
num:4,num:5,num:6

Note that data does not need to be organized in tuples. Simple data streams can be written as well:

with WriteCSV('data.csv') as writer:
    Range(10) >> writer

WriteCSV is a thin wrapper around Pythons csv.writer and the kwargs of WriteCSV are passed on to csv.writer. See https://docs.python.org/2/library/csv.html for more details.