Module joins

Module joins 

Source
Expand description

DataFusion Join implementations

ModulesΒ§

cross_join πŸ”’
Defines the cross join plan for loading the left side of the cross join and producing batches in parallel for the right partitions
hash_join πŸ”’
HashJoinExec Partitioned Hash Join Operator
join_filter πŸ”’
join_hash_map πŸ”’
This file contains the implementation of the JoinHashMap struct, which is used to store the mapping between hash values based on the build side [β€œon” values] to a list of indices with this key’s value.
nested_loop_join πŸ”’
NestedLoopJoinExec: joins without equijoin (equality predicates).
piecewise_merge_join πŸ”’
PiecewiseMergeJoin is currently experimental
sort_merge_join πŸ”’
Sort Merge Join Execution Plan Operator
stream_join_utils πŸ”’
This file contains common subroutines for symmetric hash join related functionality, used both in join calculations and optimization rules.
symmetric_hash_join πŸ”’
This file implements the symmetric hash join algorithm with range-based data pruning to join two (potentially infinite) streams.
utils
Join related functionality used both on logical and physical plans

StructsΒ§

CrossJoinExec
Cross Join Execution Plan
HashJoinExec
Join execution plan: Evaluates equijoin predicates in parallel on multiple partitions using a hash table and an optional filter list to apply post join.
NestedLoopJoinExec
NestedLoopJoinExec is a build-probe join operator designed for joins that do not have equijoin keys in their ON clause.
PiecewiseMergeJoinExec
PiecewiseMergeJoinExec is a join execution plan that only evaluates single range filter and show much better performance for these workloads than NestedLoopJoin
SortMergeJoinExec
Join execution plan that executes equi-join predicates on multiple partitions using Sort-Merge join algorithm and applies an optional filter post join. Can be used to join arbitrarily large inputs where one or both of the inputs don’t fit in the available memory.
SymmetricHashJoinExec
A symmetric hash join with range conditions is when both streams are hashed on the join key and the resulting hash tables are used to join the streams. The join is considered symmetric because the hash table is built on the join keys from both streams, and the matching of rows is based on the values of the join keys in both streams. This type of join is efficient in streaming context as it allows for fast lookups in the hash table, rather than having to scan through one or both of the streams to find matching rows, also it only considers the elements from the stream that fall within a certain sliding window (w/ range conditions), making it more efficient and less likely to store stale data. This enables operating on unbounded streaming data without any memory issues.

EnumsΒ§

PartitionMode
Hash join Partitioning mode
StreamJoinPartitionMode
Partitioning mode to use for symmetric hash join

Type AliasesΒ§

JoinOn
The on clause of the join, as vector of (left, right) columns.
JoinOnRef
Reference for JoinOn.
SharedBitmapBuilder πŸ”’
Shared bitmap for visited left-side indices