Expand description
This file implements the symmetric hash join algorithm with range-based data pruning to join two (potentially infinite) streams.
A SymmetricHashJoinExec plan takes two children plan (with appropriate
output ordering) and produces the join output according to the given join
type and other options.
This plan uses the OneSideHashJoiner object to facilitate join calculations
for both its children.
Structsยง
- OneSide
Hash Joiner - Symmetric
Hash Join Exec - A symmetric hash join with range conditions is when both streams are hashed on the join key and the resulting hash tables are used to join the streams. The join is considered symmetric because the hash table is built on the join keys from both streams, and the matching of rows is based on the values of the join keys in both streams. This type of join is efficient in streaming context as it allows for fast lookups in the hash table, rather than having to scan through one or both of the streams to find matching rows, also it only considers the elements from the stream that fall within a certain sliding window (w/ range conditions), making it more efficient and less likely to store stale data. This enables operating on unbounded streaming data without any memory issues.
- Symmetric
Hash ๐Join Stream - A stream that issues [RecordBatch]es as they arrive from the right of the join.
Enumsยง
- SHJStream
State - Represents the various states of an symmetric hash join stream operation.
Constantsยง
Functionsยง
- build_
side_ ๐determined_ results - This function produces unmatched record results based on the build side, join type and other parameters.
- calculate_
indices_ ๐by_ join_ type - Calculate indices by join type.
- determine_
prune_ ๐length - Determine the pruning length for
buffer. - join_
with_ ๐probe_ batch - This method performs a join between the build side input buffer and the probe side batch.
- lookup_
join_ ๐hashmap - This method performs lookups against JoinHashMap by hash values of join-key columns, and handles potential hash collisions.
- need_
to_ ๐produce_ result_ in_ final - This method determines if the result of the join should be produced in the final step or not.