Reading from Sources

All data flows start with a source. Sources are Python iterables and a small set of specific nuts. As a general rule, sources must appear on the left side of the >> operator and can never appear on the right side.

Iterables

Some examples of Python iterables and iterators that can be used as sources:

>>> from nutsflow import *
>>> range(5) >> Collect()
[0, 1, 2, 3, 4]
>>> ['a', 'ab', 'abc'] >> Map(len) >> Collect()
[1, 2, 3]
>>> 'text' >> Map(lambda c: c.upper()) >> Join()
'TEXT'
>>> {1:'one', 2:'two'} >> Collect()
[1, 2]
>>> {1:'one', 2:'two'}.items() >> Collect()
[(1, 'one'), (2, 'two')]
with open(filepath) as lines:
  lines >> Filter(lambda l: l.startswith('ERR')) >> Print() >> Consume()

Source nuts

nuts-flow has a few special source nuts.

Range

Range(start [,end [, step]]) essential operates the same as range but depletes. The following examples demonstrates the difference:

>>> numbers = Range(5)
>>> numbers >> Head(3)
[0, 1, 2]
>>> numbers >> Head(3)
[3, 4]
>>> numbers >> Head(3)
[]

Subsequent calls deplete the numbers iterator created with Range, while range returns a new iterator every time when called and does not deplete:

>>> numbers = range(5)
>>> numbers >> Head(3)
[0, 1, 2]
>>> numbers >> Head(3)
[0, 1, 2]

Enumerate

Enumerate(start=0 [, step]) returns an iterator over increasing integer numbers. In contrast to Range it does not have an upper limit and iterates indefinitely.

>>> Enumerate(1) >> Zip('abc') >> Collect()
[(1, 'a'), (2, 'b'), (3, 'c')]

Often Enumerate is used to add line numbers to the lines of a file:

# Collect line numbers of empty lines
with open(filepath) as lines:
  (Enumerate() >> Zip(lines) >> Filter(lambda (i,l): not l) >>
  Get(0) >> Collect())

Product

Product is the functional equivalent of a nested loop. It generates the cartesian product of the input iterables. For instance, the following example returns the coordinates of a 2x3 grid:

>>> Product(Range(2), Range(3)) >> Collect()
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)]

Each element of each input iterable is combined with each element of the other input iterables.

Repeat

The Repeat(value [, times])) nut returns the specified value the given number of times or indefinitely if not specified:

>>> Repeat('a', 3) >> Collect()
['a', 'a', 'a']
>>> Repeat(1) >> Take(4) >> Collect()
[1, 1, 1, 1]

ReadNamedCSV

nuts-flow supports reading from Comma Separated Format (CSV) files with header names via the ReadNamedCSV(filepath, colnames, fmtfunc, rowname, **kwargs) nut. Given the correct delimiter also files in Tab Separated Format (TSV) or other column formats can be read. Given a CSV file with the following content

A,B,C
1,2,3
4,5,6

the code below reads the rows as named tuples, and converts the elements of the row into integers (fmtfunc=int):

>>> filepath = 'tests/data/data.csv'
>>> with ReadNamedCSV(filepath, fmtfunc=int) as reader:
...     reader >> Print() >> Consume()
Row(A=1, B=2, C=3)
Row(A=4, B=5, C=6)

Different convert functions for columns are suppported:

>>> fmtfuncs = (int, str, float)
>>> with ReadNamedCSV(filepath, fmtfunc=fmtfuncs) as reader:
...     reader >> Print() >> Consume()
Row(A=1, B='2', C=3.0)
Row(A=4, B='5', C=6.0)

ReadNamedCSV allows to read specific columns in a given/different order. Here we read columns ‘B’ and ‘C’ only in swapped order:

>>> with ReadCSV(filepath, ('C', 'B')) as reader:
...     reader >> Print() >> Consume()
Row(C='3', B='2')
Row(C='6', B='5')

Finally, if ‘Row’ is not a good tuple name, it can be changed:

>>> with ReadNamedCSV(filepath, rowname='Sample') as reader:
...     reader >> Print() >> Consume()
Sample(A='1', B='2', C='3')
Sample(A='4', B='5', C='6')

ReadCSV

ReadCSV() is very similar to ReadNamedCSV but can read CSV files without header information and returns (unnamed) tuples.

>>> filepath = 'tests/data/data.csv'
>>> with ReadCSV(filepath, skipheader=1, fmtfunc=int) as reader:
...     reader >> Print() >> Consume()
...
(1, 2, 3)
(4, 5, 6)