pub trait GroupValues: Send {
// Required methods
fn intern(
&mut self,
cols: &[Arc<dyn Array>],
groups: &mut Vec<usize>,
) -> Result<(), DataFusionError>;
fn size(&self) -> usize;
fn is_empty(&self) -> bool;
fn len(&self) -> usize;
fn emit(
&mut self,
emit_to: EmitTo,
) -> Result<Vec<Arc<dyn Array>>, DataFusionError>;
fn clear_shrink(&mut self, batch: &RecordBatch);
}Expand description
Stores the group values during hash aggregation.
§Background
In a query such as SELECT a, b, count(*) FROM t GROUP BY a, b, the group values
identify each group, and correspond to all the distinct values of (a,b).
-- Input has 4 rows with 3 distinct combinations of (a,b) ("groups")
create table t(a int, b varchar)
as values (1, 'a'), (2, 'b'), (1, 'a'), (3, 'c');
select a, b, count(*) from t group by a, b;
----
1 a 2
2 b 1
3 c 1§Design
Managing group values is a performance critical operation in hash aggregation. The major operations are:
- Intern: Quickly finding existing and adding new group values
- Emit: Returning the group values as an array
There are multiple specialized implementations of this trait optimized for
different data types and number of columns, optimized for these operations.
See new_group_values for details.
§Group Ids
Each distinct group in a hash aggregation is identified by a unique group id (usize) which is assigned by instances of this trait. Group ids are continuous without gaps, starting from 0.
Required Methods§
Sourcefn intern(
&mut self,
cols: &[Arc<dyn Array>],
groups: &mut Vec<usize>,
) -> Result<(), DataFusionError>
fn intern( &mut self, cols: &[Arc<dyn Array>], groups: &mut Vec<usize>, ) -> Result<(), DataFusionError>
Calculates the group id for each input row of cols, assigning new
group ids as necessary.
When the function returns, groups must contain the group id for each
row in cols.
If a row has the same value as a previous row, the same group id is assigned. If a row has a new value, the next available group id is assigned.
Sourcefn size(&self) -> usize
fn size(&self) -> usize
Returns the number of bytes of memory used by this GroupValues
Sourcefn is_empty(&self) -> bool
fn is_empty(&self) -> bool
Returns true if this GroupValues is empty
Sourcefn len(&self) -> usize
fn len(&self) -> usize
The number of values (distinct group values) stored in this GroupValues
Sourcefn emit(
&mut self,
emit_to: EmitTo,
) -> Result<Vec<Arc<dyn Array>>, DataFusionError>
fn emit( &mut self, emit_to: EmitTo, ) -> Result<Vec<Arc<dyn Array>>, DataFusionError>
Emits the group values
Sourcefn clear_shrink(&mut self, batch: &RecordBatch)
fn clear_shrink(&mut self, batch: &RecordBatch)
Clear the contents and shrink the capacity to the size of the batch (free up memory usage)