.. _transforming: Data transformation =================== The (element-wise) transformation of data is, together with :ref:`filtering `, at the very core of data flows and **nuts-flow** provides various nuts for this purpose. Elementwise transformations --------------------------- The most common transformation is the mapping of a function on a flow. Map ^^^ ``Map(func)`` nut takes a function and applies it to each element of the input iterable. See the following examples: and >>> from nutsflow import * >>> from nutsflow import _ >>> Range(5) >> Map(lambda x : x * x) >> Collect() [0, 1, 4, 9, 16] >>> Range(5) >> Map(_ * 2) >> Collect() [0, 2, 4, 6, 8] >>> Range(5) >> Map(_ > 2) >> Collect() [False, False, False, True, True] >>> Range(5) >> Map(str) >> Collect() ['0', '1', '2', '3', '4'] Note that ``Map`` can transform elements of the flow in arbitrary ways but cannot change the number of elements in the flow. MapMulti ^^^^^^^^ Occasionally, it is necessary to apply different, independent mappings to the same data. One way is to process the data for each mapping individually, e.g. >>> times2 = Range(5) >> Map(_ * 2) >> Collect() >>> greater3 = Range(5) >> Map(_ > 3) >> Collect() However, if the generation or reading of the input data is computationally expensive it is more efficient to use ``MapMulti`` and avoid rereading the input multiple times. >>> times2, greater3 = Range(5) >> MapMulti(_ * 2, _ > 3) >>> times2 >> Collect() [0, 2, 4, 6, 8] >>> greater3 >> Collect() [False, False, False, False, True] Note that ``MapMulti`` performs an arbitray number of mappings at the same time and returns iterators for each mapping. Tabular data ------------ Often input data is organized in rows (records) and columns, and transformations for selected columns only are needed. MapCol ^^^^^^ ``MapCol(columns, func)`` maps a function to the specified columns of the input data and leaves other columns unchanged. Given the following table with tuples as records >>> table = [ (1, 2), ... (3, 4) ] the example flow below negates all numbers in column 0: >>> negate = lambda x: -x >>> table >> MapCol(0, negate) >> Print() >> Consume() (-1, 2) (-3, 4) or let us convert each number in the second column to a string: >>> table >> MapCol(1, str) >> Collect() [(1, '2'), (3, '4')] ``MapCol`` can apply the same mapping to multiple columns at the same time. For instance, checking if numbers in columns 0 and 1 are greater than two: >>> table >> MapCol((0, 1), _ > 2) >> Collect() [(False, False), (True, True)] Note that input data must be an iterable of tuples or other indexable objects and the flow iterates over these records. To iterate over all elements of a table individually use :ref:`Flatten`. Append ^^^^^^ ``Append(items)`` allows to append a single item or sequence of items to the rows of the input data. For instance, given the table above the following code adds an `x` to each row: >>> table >> Append('x') >> Print() >> Consume() (1, 2, 'x') (3, 4, 'x') Appending (or merging) a column or table is equally easy: >>> new_col = ['a', 'b'] >>> table >> Append(new_col) >> Print() >> Consume() (1, 2, 'a') (3, 4, 'b') >>> table2 = [ ('a', 'c'), ... ('b', 'd') ] >>> table >> Append(table2) >> Print() >> Consume() (1, 2, 'a', 'c') (3, 4, 'b', 'd') Insert ^^^^^^ ``Insert(column, items)`` operates just like ``Append`` but allows to specify the column where the new data is to be inserted: >>> table >> Insert(1,'x') >> Print() >> Consume() (1, 'x', 2) (3, 'x', 4) >>> table >> Insert(0,table2) >> Print() >> Consume() ('a', 'c', 1, 2) ('b', 'd', 3, 4) ``Insert()`` and ``Append()`` are often useful to enumerate rows: >>> table2 >> Insert(0, Enumerate()) >> Print() >> Consume() (0, 'a', 'c') (1, 'b', 'd') Note the difference to using ``Zip``, which nests the data: >>> table2 >> Zip(Enumerate()) >> Print() >> Consume() (('a', 'c'), 0) (('b', 'd'), 1) Get ^^^ ``Get(start, end, step)`` operates similar to Python's slicing ``[start:end:step]`` and extracts individual elements or slices from table records. For instance, given the following table >>> table = [ (1, 2, 3), ... (4, 5, 6) ] ``Get(1)`` extracts all elements in column 1 of the table: >>> table >> Get(1) >> Collect() [2, 5] Note that, since a single column was extracted, the output is a list of numbers and not a list of tuples anymore. ``Get(0, 2)`` extracts column 0 to 1: >>> table >> Get(0, 2) >> Print() >> Consume() (1, 2) (4, 5) and ``Get(0, 3, 2)`` extracts column 0 to 2 with stride 2: >>> table >> Get(0, 3, 2) >> Collect() [(1, 3), (4, 6)] Note that in agreement with Python's slicing the index of the ``end`` column is *exclusive*. GetCols ^^^^^^^ The ``Get`` nut described above can extract only consecutive table columns in order. ``GetCols(*columns)`` allows to extract arbitray columns in arbitrary order. Given the following table >>> table = [ (1, 2, 3), ... (4, 5, 6) ] ``GetCols(1)`` extracts column 1 of the table: >>> table >> GetCols(1) >> Collect() [(2,), (5,)] Note that in contrast to ``Get(1)`` a list of (single element) tuples is returned. The following example extracts columns 2, 1, and 0, and effectively reverses the column order of the table: >>> table >> GetCols(2, 1, 0) >> Print() >> Consume() (3, 2, 1) (6, 5, 4) ``GetCols`` can even duplicate columns, e.g. duplicating column 1 and removing column 0 can be achieved as follows: >>> table >> GetCols(1, 1, 2) >> Print() >> Consume() (2, 2, 3) (5, 5, 6) Flatten data ------------ Hierarchical data structures such as lists of lists frequently need to be converted to flat structures. ``Flatten`` and ``FlatMap`` are two nuts for flatting data. Flatten ^^^^^^^ ``Flatten`` flattens all iterables within the input and returns an iterator over the result. For instance: >>> [(1, 2), (3, 4, 5), 6] >> Flatten() >> Collect() [1, 2, 3, 4, 5, 6] Note that only one level is flattend. Deeper structures remain unchanged >>> [(1, 2), ((3, 4), 5), 6] >> Flatten() >> Collect() [1, 2, (3, 4), 5, 6] but can be, of course, flattend by sucessive calls of ``Flatten``: >>> [(1, 2), ((3, 4), 5), 6] >> Flatten() >> Flatten() >> Collect() [1, 2, 3, 4, 5, 6] FlatMap ^^^^^^^ A common operation is a ``Map`` followed by a ``Flatten`` and ``FlatMap`` is a nut that provides this operation in one call. See the following examples to dublicate all numbers in a list of numbers: >>> dup = lambda x: (x, x) >>> [0, 1, 2] >> Map(dup) >> Collect() [(0, 0), (1, 1), (2, 2)] >>> [0, 1, 2] >> Map(dup) >> Flatten() >> Collect() [0, 0, 1, 1, 2, 2] >>> [0, 1, 2] >> FlatMap(dup) >> Collect() [0, 0, 1, 1, 2, 2]