Module row_filter

Module row_filter 

Source
Expand description

Utilities to push down of DataFusion filter predicates (any DataFusion PhysicalExpr that evaluates to a [BooleanArray]) to the parquet decoder level in arrow-rs.

DataFusion will use a ParquetRecordBatchStream to read data from parquet into [RecordBatch]es.

The ParquetRecordBatchStream takes an optional RowFilter which is itself a Vec of Box<dyn ArrowPredicate>. During decoding, the predicates are evaluated in order, to generate a mask which is used to avoid decoding rows in projected columns which do not pass the filter which can significantly reduce the amount of compute required for decoding and thus improve query performance.

Since the predicates are applied serially in the order defined in the RowFilter, the optimal ordering depends on the exact filters. The best filters to execute first have two properties:

  1. They are relatively inexpensive to evaluate (e.g. they read column chunks which are relatively small)

  2. They filter many (contiguous) rows, reducing the amount of decoding required for subsequent filters and projected columns

If requested, this code will reorder the filters based on heuristics try and reduce the evaluation cost.

The basic algorithm for constructing the RowFilter is as follows

  1. Break conjunctions into separate predicates. An expression like a = 1 AND (b = 2 AND c = 3) would be separated into the expressions a = 1, b = 2, and c = 3.
  2. Determine whether each predicate can be evaluated as an ArrowPredicate.
  3. Determine, for each predicate, the total compressed size of all columns required to evaluate the predicate.
  4. Determine, for each predicate, whether all columns required to evaluate the expression are sorted.
  5. Re-order the predicate by total size (from step 3).
  6. Partition the predicates according to whether they are sorted (from step 4)
  7. β€œCompile” each predicate Expr to a DatafusionArrowPredicate.
  8. Build the RowFilter with the sorted predicates followed by the unsorted predicates. Within each partition, predicates are still be sorted by size.

StructsΒ§

DatafusionArrowPredicate πŸ”’
A β€œcompiled” predicate passed to ParquetRecordBatchStream to perform row-level filtering during parquet decoding.
FilterCandidate πŸ”’
A candidate expression for creating a RowFilter.
FilterCandidateBuilder πŸ”’
Helper to build a FilterCandidate.
PushdownChecker πŸ”’

FunctionsΒ§

build_row_filter
Build a [RowFilter] from the given predicate Expr if possible
can_expr_be_pushed_down_with_schemas
Recurses through expr as a tree, finds all columns, and checks if any of them would prevent this expression from being predicate pushed down. If any of them would, this returns false. Otherwise, true. Note that the schema passed in here is not the physical file schema (as it is not available at that point in time); it is the schema of the table that this expression is being evaluated against minus any projected columns and partition columns.
columns_sorted πŸ”’
For a given set of Columns required for predicate Expr determine whether all columns are sorted.
pushdown_columns πŸ”’
size_of_columns πŸ”’
Calculate the total compressed size of all Column’s required for predicate Expr.