Filtering¶
Apart from reading and writing, filtering and transforming are the most common operations within data flows. This sections presents various nuts used to filter, partition or group data.
Filter¶
A common task is to remove elements from a data flow. nuts-flow
provides Filter
and FilterFalse
for this purpose. In the
following example all number greater than five are extracted:
>>> from nutsflow import *
>>> Range(10) >> Filter(lambda x: x > 5) >> Collect()
[6, 7, 8, 9]
FilterFalse
is simply the negation of Filter
and extracts
number smaller or equal to five:
>>> Range(10) >> FilterFalse(lambda x: x > 5) >> Collect()
[0, 1, 2, 3, 4, 5]
Filter
and FilterFalse
take a predicate (Lambda) function that
must return a boolean value. If the predicate function is very simple
it can be written shorter using underscore syntax:
>>> from nutsflow import _
>>> Range(10) >> Filter(_ > 5) >> Collect()
[6, 7, 8, 9]
>>> Range(10) >> FilterFalse(_ > 5) >> Collect()
[0, 1, 2, 3, 4, 5]
Partition¶
If both ‘sides’ of a filter, the elements accepted and the elements
rejected, are wanted the Partition
nut can be used:
>>> greater, smaller = Range(10) >> Partition(_ > 5)
>>> greater >> Collect()
[6, 7, 8, 9]
>>> smaller >> Collect()
[0, 1, 2, 3, 4, 5]
>>> odd, even = Range(10) >> Partition(_ % 2)
>>> odd >> Collect()
[1, 3, 5, 7, 9]
>>> even >> Collect()
[0, 2, 4, 6, 8]
Note that Partition
returns a tuple containing two iterators.
GroupBy¶
Similar, but more powerful than Partition
is GroupBy
, which allows
to group the elements of the flow according to a key function:
>>> Range(10) >> GroupBy(_ > 5) >> Collect()
[(False, [0, 1, 2, 3, 4, 5]), (True, [6, 7, 8, 9])]
GroupBy
returns an iterator over the groups, where each group is
a tuple with the result of the key function first and the elements of
the group second. If the result of the key function is not required
the nokey flag can be set to True
:
>>> Range(10) >> GroupBy(_ > 5, nokey=True) >> Collect()
[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9]]
In contrast to Partition
, GroupBy
is not limited to a boolean
key function. For instance, to group by the remainder of the division
by 3 simply call
>>> Range(10) >> GroupBy(_ % 3) >> Collect()
[(0, [0, 3, 6, 9]), (1, [1, 4, 7]), (2, [2, 5, 8])]
GroupBy
loads all data in memory and should be avoided for large data sets.
If the data is sorted GroupBySorted
can be used instead.
TakeWhile and DropWhile¶
Occasionally, it is necessary to run a data flow until a certain
condition is met. TakeWhile(func)
takes elements from the
iterable as long as the predicate function is true.
In the following example all number are collected until
the first negative number is encountered:
>>> [2, 1, -1, 3, 4, -1] >> TakeWhile(_ > 0) >> Collect()
[2, 1]
Similarily, DropWhile(func)
skips all elements while the predicate function
is true and returns the remainder of the iterable:
>>> [2, 1, -1, 3, 4, -1] >> DropWhile(_ > 0) >> Collect()
[-1, 3, 4, -1]