Module symmetric_hash_join

Module symmetric_hash_join 

Source
Expand description

This file implements the symmetric hash join algorithm with range-based data pruning to join two (potentially infinite) streams.

A SymmetricHashJoinExec plan takes two children plan (with appropriate output ordering) and produces the join output according to the given join type and other options.

This plan uses the OneSideHashJoiner object to facilitate join calculations for both its children.

Structsยง

OneSideHashJoiner
SymmetricHashJoinExec
A symmetric hash join with range conditions is when both streams are hashed on the join key and the resulting hash tables are used to join the streams. The join is considered symmetric because the hash table is built on the join keys from both streams, and the matching of rows is based on the values of the join keys in both streams. This type of join is efficient in streaming context as it allows for fast lookups in the hash table, rather than having to scan through one or both of the streams to find matching rows, also it only considers the elements from the stream that fall within a certain sliding window (w/ range conditions), making it more efficient and less likely to store stale data. This enables operating on unbounded streaming data without any memory issues.
SymmetricHashJoinStream ๐Ÿ”’
A stream that issues [RecordBatch]es as they arrive from the right of the join.

Enumsยง

SHJStreamState
Represents the various states of an symmetric hash join stream operation.

Constantsยง

HASHMAP_SHRINK_SCALE_FACTOR ๐Ÿ”’

Functionsยง

build_side_determined_results ๐Ÿ”’
This function produces unmatched record results based on the build side, join type and other parameters.
calculate_indices_by_join_type ๐Ÿ”’
Calculate indices by join type.
determine_prune_length ๐Ÿ”’
Determine the pruning length for buffer.
join_with_probe_batch ๐Ÿ”’
This method performs a join between the build side input buffer and the probe side batch.
lookup_join_hashmap ๐Ÿ”’
This method performs lookups against JoinHashMap by hash values of join-key columns, and handles potential hash collisions.
need_to_produce_result_in_final ๐Ÿ”’
This method determines if the result of the join should be produced in the final step or not.