Writing to Sinks¶
Sinks are typically at the end of a data flow and needed to drive the flow. Without a sink the flow is not processing any data.
Python functions¶
All Python functions that accept iterators or iterables can
serve as sinks, e.g. list
, set
, dict
, sum
,
file.writelines
, …, but since they are not nuts
they do not support the >>
operator and need to be
called as functions. Here some examples
>>> from nutsflow import *
>>> from nutsflow import _
>>> list(Range(10) >> Filter(_ < 4) >> Square())
[0, 1, 4, 9]
>>> set([1, 2, 1, 3] >> Square())
{1, 4, 9}
>>> dict([('one', 1), ('four', 2)] >> MapCol(1, Square()))
{'four': 4, 'one': 1}
with open(filepath) as f:
f.writelines(Range(4) >> Square() >> Format('{}'))
Nuts¶
Collect¶
The most commonly used sink is Collect
,
which collects all elements of the input iterable in a list.
>>> Range(10) >> Filter(_ < 4) >> Collect()
[0, 1, 2, 3]
Collect(container)
also allows to specify a container to collect
the data in. Any Python function that accept iterators or iterables
is a valid container, e.g.
>>> [1, 2, 1, 3] >> Square() >> Collect(set)
{1, 4, 9}
>>> [('one', 1), ('four', 2)] >> MapCol(1, Square()) >> Collect(dict)
{'four': 4, 'one': 1}
>>> Range(10) >> Square() >> Collect(sum)
285
>>> Range(5) >> Map(str) >> Collect(':'.join)
'0:1:2:3:4'
Collect
stores all data in memory and is not suitable
for large data sets. In such a case use WriteCSV
to write data to the file system.
Head and Tail¶
Collect collects all elements of the input. Often only
the first or last n elements are needed. Head(n)
collects
the first n elements and Tail(n)
the last n elements
>>> Range(10) >> Head(4)
[0, 1, 2, 3]
>>> Range(10) >> Tail(4)
[6, 7, 8, 9]
Similar to Collect, Head
and Tail
allow to
specify a container to store the result in
>>> [1, 2, 1, 3, 2] >> Head(3, set)
{1, 2}
>>> Range(10) >> Tail(3, sum)
24
Common nuts¶
nuts-flow provides nuts for common aggregator functions
such as Sum
, Min
, Max
, ArgMax
, ArgMin
,
and Join
. For instance, instead of writing
>>> Range(10) >> Collect(sum)
45
one can simply write
>>> Range(10) >> Sum()
45
Join
is the nuts equivalent of Python’s join
method
but automatically converts numbers to strings, e.g.
>>> Range(5) >> Join(':')
'0:1:2:3:4'
in contrast to:
>>> Range(5) >> Map(str) >> Collect(':'.join)
'0:1:2:3:4'
Min
and Max
return the minimum or the maximum element
of a data flow and allow to specify a key function and a
default value in case of an empty data stream. For instance,
find the longest string
>>> ['1', '123', '12'] >> Max(key=len)
'123'
and return the empty string if there is no data
>>> [] >> Max(len, default='')
''
ArgMin
and ArgMax
return the index of the smallest or
largest element and possibly the element itself. For example,
the index of the longest string
>>> ['12', '1', '123'] >> ArgMax(key=len)
2
or the index and the string itself
>>> ['12', '1', '123'] >> ArgMax(len, retvalue=True)
(2, '123')
A default value is also supported to deal with empty input data
>>> [] >> ArgMax(default=(0, None), retvalue=True)
(0, None)
>>> [] >> ArgMax(default='empty')
'empty'
Count and CountValues¶
To count the number of elements in a flow or the numbers of
different elements in a flow Count
and CountValues
are provided.
Count
simply consumes the data flow and counts the number
of elements
>>> [1, 2, 1, 3, 2] >> Count()
5
>>> 'abaacc' >> Count()
6
while CountValues
counts the frequencies of the different values
and returns a dictionary
>>> 'abaacc' >> CountValues()
{'a': 3, 'c': 2, 'b': 1}
CountValues
can also return the relative frequencies instead
of the absolute counts
>>> 'aabaab' >> CountValues(True)
{'a': 1.0, 'b': 0.5}
Reduce¶
Reduce(func [,initiaizer])
reduces a flow of data elements to a
single element, using a given function. Reduce
is a thin wrapper around
Python’s reduce
function.
The following example computes the product of a list of numbers
>>> [1, 2, 3] >> Reduce(lambda a, b: a * b)
6
Reduce
can be called with an initalizer, which specifies the first
element used in the reduction
>>> ['one', 'two'] >> Reduce(lambda a, b: a + b, 'start')
'startonetwo'
Consume¶
If a data flow has side effects (e.g. printing, writing to a file)
but no interesting result itself the Consume
nut can be used.
It drives a data flow but does not collect or discards any
of its results. For instance, the following flow has the
side effect of printing numbers:
>>> Range(3) >> Print() >> Consume()
0
1
2
In contrast, the following flow processes data but returns nothing
>>> Range(3) >> Square() >> Consume()
while the next flow has no sink and therefore only returns an iterator object but does not process any data
>>> Range(3) >> Square() >> Print()
<itertools.imap object at ...>
The former because there is no side effect and the latter because there is no sink that drives the flow.
WriteCSV¶
WriteCSV(filepath, cols, skipheader, fmtfunc, **kwargs)
writes
data in Comma Separated Values format (CSV) to the specified file.
For instance,
[(1, 2), (3, 4)] >> WriteCSV('data.csv')
would create the file data.csv
with the following content
1,2
3,4
However, to ensure that files are closed safely it is preferable to
use WriteCSV
in conjunction with the with
statement
with WriteCSV('data.csv') as writer:
[(1, 2), (3, 4)] >> writer
It is possible to select the columns to write and to skip a given number of header lines if needed. For example,
with WriteCSV('data.csv', cols=(1,0), skipheader=1) as writer:
[('a', 'b', 'c'), (1, 2, 3), (4, 5, 6)] >> writer
will write the following data to data.csv
:
2,1
5,4
while
with WriteCSV('data.csv') as writer:
[('a', 'b', 'c'), (1, 2, 3), (4, 5, 6)] >> writer
will write
a,b,c
1,2,3
4,5,6
In addition to CSV other formats such as Tab Separated Values (TSV) can be written by providing the appropriate delimiter
with WriteCSV('data.csv', delimiter='\t') as writer:
[(1,2), (3,4)] >> writer
and values can be formatted using fmtfunc
. For example,
with WriteCSV('data.csv', fmtfunc=lambda x: 'num:'+str(x)) as writer:
[(1, 2, 3), (4, 5, 6)] >> writer
will output
num:1,num:2,num:3
num:4,num:5,num:6
Note that data does not need to be organized in tuples. Simple data streams can be written as well:
with WriteCSV('data.csv') as writer:
Range(10) >> writer
WriteCSV
is a thin wrapper around Pythons csv.writer
and
the kwargs
of WriteCSV
are passed on to csv.writer
.
See https://docs.python.org/2/library/csv.html for more details.