.. _sources: Reading from Sources ==================== All data flows start with a *source*. Sources are Python iterables and a small set of specific nuts. As a general rule, sources must appear on the left side of the ``>>`` operator and can never appear on the right side. Iterables --------- Some examples of Python iterables and iterators that can be used as sources: >>> from nutsflow import * >>> range(5) >> Collect() [0, 1, 2, 3, 4] >>> ['a', 'ab', 'abc'] >> Map(len) >> Collect() [1, 2, 3] >>> 'text' >> Map(lambda c: c.upper()) >> Join() 'TEXT' >>> {1:'one', 2:'two'} >> Collect() [1, 2] >>> {1:'one', 2:'two'}.items() >> Collect() [(1, 'one'), (2, 'two')] .. code:: with open(filepath) as lines: lines >> Filter(lambda l: l.startswith('ERR')) >> Print() >> Consume() Source nuts ----------- **nuts-flow** has a few special source nuts. Range ^^^^^ ``Range(start [,end [, step]])`` essential operates the same as ``range`` but depletes. The following examples demonstrates the difference: >>> numbers = Range(5) >>> numbers >> Head(3) [0, 1, 2] >>> numbers >> Head(3) [3, 4] >>> numbers >> Head(3) [] Subsequent calls deplete the numbers iterator created with ``Range``, while ``range`` returns a new iterator every time when called and does not deplete: >>> numbers = range(5) >>> numbers >> Head(3) [0, 1, 2] >>> numbers >> Head(3) [0, 1, 2] Enumerate ^^^^^^^^^ ``Enumerate(start=0 [, step])`` returns an iterator over increasing integer numbers. In contrast to :ref:`Range` it does not have an upper limit and iterates indefinitely. >>> Enumerate(1) >> Zip('abc') >> Collect() [(1, 'a'), (2, 'b'), (3, 'c')] Often ``Enumerate`` is used to add line numbers to the lines of a file: .. code:: # Collect line numbers of empty lines with open(filepath) as lines: (Enumerate() >> Zip(lines) >> Filter(lambda (i,l): not l) >> Get(0) >> Collect()) Product ^^^^^^^ ``Product`` is the functional equivalent of a nested loop. It generates the cartesian product of the input iterables. For instance, the following example returns the coordinates of a 2x3 grid: >>> Product(Range(2), Range(3)) >> Collect() [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)] Each element of each input iterable is combined with each element of the other input iterables. Repeat ^^^^^^ The ``Repeat(value [, times]))`` nut returns the specified value the given number of times or indefinitely if not specified: >>> Repeat('a', 3) >> Collect() ['a', 'a', 'a'] >>> Repeat(1) >> Take(4) >> Collect() [1, 1, 1, 1] ReadNamedCSV ^^^^^^^^^^^^ **nuts-flow** supports reading from Comma Separated Format (CSV) files with header names via the ``ReadNamedCSV(filepath, colnames, fmtfunc, rowname, **kwargs)`` nut. Given the correct delimiter also files in Tab Separated Format (TSV) or other column formats can be read. Given a CSV file with the following content .. code:: A,B,C 1,2,3 4,5,6 the code below reads the rows as named tuples, and converts the elements of the row into integers (fmtfunc=int): >>> filepath = 'tests/data/data.csv' >>> with ReadNamedCSV(filepath, fmtfunc=int) as reader: ... reader >> Print() >> Consume() Row(A=1, B=2, C=3) Row(A=4, B=5, C=6) Different convert functions for columns are suppported: >>> fmtfuncs = (int, str, float) >>> with ReadNamedCSV(filepath, fmtfunc=fmtfuncs) as reader: ... reader >> Print() >> Consume() Row(A=1, B='2', C=3.0) Row(A=4, B='5', C=6.0) ``ReadNamedCSV`` allows to read specific columns in a given/different order. Here we read columns 'B' and 'C' only in swapped order: >>> with ReadCSV(filepath, ('C', 'B')) as reader: ... reader >> Print() >> Consume() Row(C='3', B='2') Row(C='6', B='5') Finally, if 'Row' is not a good tuple name, it can be changed: >>> with ReadNamedCSV(filepath, rowname='Sample') as reader: ... reader >> Print() >> Consume() Sample(A='1', B='2', C='3') Sample(A='4', B='5', C='6') ReadCSV ^^^^^^^ ``ReadCSV()`` is very similar to ``ReadNamedCSV`` but can read CSV files without header information and returns (unnamed) tuples. >>> filepath = 'tests/data/data.csv' >>> with ReadCSV(filepath, skipheader=1, fmtfunc=int) as reader: ... reader >> Print() >> Consume() ... (1, 2, 3) (4, 5, 6)