.. _sinks: Writing to Sinks ================ Sinks are typically at the end of a data flow and needed to *drive* the flow. Without a sink the flow is not processing any data. Python functions ---------------- All Python functions that accept iterators or iterables can serve as sinks, e.g. ``list``, ``set``, ``dict``, ``sum``, ``file.writelines``, ..., but since they are not nuts they do not support the ``>>`` operator and need to be called as functions. Here some examples >>> from nutsflow import * >>> from nutsflow import _ >>> list(Range(10) >> Filter(_ < 4) >> Square()) [0, 1, 4, 9] >>> set([1, 2, 1, 3] >> Square()) {1, 4, 9} >>> dict([('one', 1), ('four', 2)] >> MapCol(1, Square())) {'four': 4, 'one': 1} .. code:: with open(filepath) as f: f.writelines(Range(4) >> Square() >> Format('{}')) Nuts ---- Collect ^^^^^^^ The most commonly used sink is ``Collect``, which collects all elements of the input iterable in a list. >>> Range(10) >> Filter(_ < 4) >> Collect() [0, 1, 2, 3] ``Collect(container)`` also allows to specify a container to collect the data in. Any Python function that accept iterators or iterables is a valid container, e.g. >>> [1, 2, 1, 3] >> Square() >> Collect(set) {1, 4, 9} >>> [('one', 1), ('four', 2)] >> MapCol(1, Square()) >> Collect(dict) {'four': 4, 'one': 1} >>> Range(10) >> Square() >> Collect(sum) 285 >>> Range(5) >> Map(str) >> Collect(':'.join) '0:1:2:3:4' ``Collect`` stores all data in memory and is not suitable for large data sets. In such a case use :ref:`WriteCSV` to write data to the file system. Head and Tail ^^^^^^^^^^^^^ :ref:`Collect` collects **all** elements of the input. Often only the first or last *n* elements are needed. ``Head(n)`` collects the first *n* elements and ``Tail(n)`` the last *n* elements >>> Range(10) >> Head(4) [0, 1, 2, 3] >>> Range(10) >> Tail(4) [6, 7, 8, 9] Similar to :ref:`Collect`, ``Head`` and ``Tail`` allow to specify a container to store the result in >>> [1, 2, 1, 3, 2] >> Head(3, set) {1, 2} >>> Range(10) >> Tail(3, sum) 24 Common nuts ^^^^^^^^^^^ **nuts-flow** provides nuts for common aggregator functions such as ``Sum``, ``Min``, ``Max``, ``ArgMax``, ``ArgMin``, and ``Join``. For instance, instead of writing >>> Range(10) >> Collect(sum) 45 one can simply write >>> Range(10) >> Sum() 45 ``Join`` is the nuts equivalent of Python's ``join`` method but automatically converts numbers to strings, e.g. >>> Range(5) >> Join(':') '0:1:2:3:4' in contrast to: >>> Range(5) >> Map(str) >> Collect(':'.join) '0:1:2:3:4' ``Min`` and ``Max`` return the minimum or the maximum element of a data flow and allow to specify a key function and a default value in case of an empty data stream. For instance, find the longest string >>> ['1', '123', '12'] >> Max(key=len) '123' and return the empty string if there is no data >>> [] >> Max(len, default='') '' ``ArgMin`` and ``ArgMax`` return the **index** of the smallest or largest element and possibly the element itself. For example, the index of the longest string >>> ['12', '1', '123'] >> ArgMax(key=len) 2 or the index and the string itself >>> ['12', '1', '123'] >> ArgMax(len, retvalue=True) (2, '123') A default value is also supported to deal with empty input data >>> [] >> ArgMax(default=(0, None), retvalue=True) (0, None) >>> [] >> ArgMax(default='empty') 'empty' Count and CountValues ^^^^^^^^^^^^^^^^^^^^^ To count the number of elements in a flow or the numbers of different elements in a flow ``Count`` and ``CountValues`` are provided. ``Count`` simply consumes the data flow and counts the number of elements >>> [1, 2, 1, 3, 2] >> Count() 5 >>> 'abaacc' >> Count() 6 while ``CountValues`` counts the frequencies of the different values and returns a dictionary >>> 'abaacc' >> CountValues() {'a': 3, 'c': 2, 'b': 1} ``CountValues`` can also return the *relative frequencies* instead of the *absolute counts* >>> 'aabaab' >> CountValues(True) {'a': 1.0, 'b': 0.5} Reduce ^^^^^^ ``Reduce(func [,initiaizer])`` reduces a flow of data elements to a single element, using a given function. ``Reduce`` is a thin wrapper around Python's `reduce `_ function. The following example computes the product of a list of numbers >>> [1, 2, 3] >> Reduce(lambda a, b: a * b) 6 ``Reduce`` can be called with an initalizer, which specifies the first element used in the reduction >>> ['one', 'two'] >> Reduce(lambda a, b: a + b, 'start') 'startonetwo' Consume ^^^^^^^ If a data flow has side effects (e.g. printing, writing to a file) but no interesting result itself the ``Consume`` nut can be used. It drives a data flow but does not collect or discards any of its results. For instance, the following flow has the side effect of printing numbers: >>> Range(3) >> Print() >> Consume() 0 1 2 In contrast, the following flow processes data but returns nothing >>> Range(3) >> Square() >> Consume() while the next flow has no sink and therefore only returns an iterator object but does not process any data >>> Range(3) >> Square() >> Print() The former because there is no side effect and the latter because there is no sink that drives the flow. WriteCSV ^^^^^^^^ ``WriteCSV(filepath, cols, skipheader, fmtfunc, **kwargs)`` writes data in *Comma Separated Values format* (CSV) to the specified file. For instance, .. code:: [(1, 2), (3, 4)] >> WriteCSV('data.csv') would create the file ``data.csv`` with the following content :: 1,2 3,4 However, to ensure that files are closed safely it is preferable to use ``WriteCSV`` in conjunction with the ``with`` statement .. code:: with WriteCSV('data.csv') as writer: [(1, 2), (3, 4)] >> writer It is possible to select the columns to write and to skip a given number of header lines if needed. For example, .. code:: with WriteCSV('data.csv', cols=(1,0), skipheader=1) as writer: [('a', 'b', 'c'), (1, 2, 3), (4, 5, 6)] >> writer will write the following data to ``data.csv``: :: 2,1 5,4 while .. code:: with WriteCSV('data.csv') as writer: [('a', 'b', 'c'), (1, 2, 3), (4, 5, 6)] >> writer will write :: a,b,c 1,2,3 4,5,6 In addition to CSV other formats such as *Tab Separated Values* (TSV) can be written by providing the appropriate delimiter .. code:: with WriteCSV('data.csv', delimiter='\t') as writer: [(1,2), (3,4)] >> writer and values can be formatted using ``fmtfunc``. For example, .. code:: with WriteCSV('data.csv', fmtfunc=lambda x: 'num:'+str(x)) as writer: [(1, 2, 3), (4, 5, 6)] >> writer will output :: num:1,num:2,num:3 num:4,num:5,num:6 Note that data does not need to be organized in tuples. Simple data streams can be written as well: .. code:: with WriteCSV('data.csv') as writer: Range(10) >> writer ``WriteCSV`` is a thin wrapper around Pythons ``csv.writer`` and the ``kwargs`` of ``WriteCSV`` are passed on to ``csv.writer``. See https://docs.python.org/2/library/csv.html for more details.