Struct Group Values Column &nbsp; Copy item path

impl<const STREAMING: bool> GroupValuesColumn<STREAMING>

pub fn try_new(schema: SchemaRef) -> Result<Self>

Create a new instance of GroupValuesColumn if supported for the specified schema

fn scalarized_intern( &mut self, cols: &[ArrayRef], groups: &mut Vec<usize>, ) -> Result<()>

Scalarized intern

This is used only for streaming aggregation, because streaming aggregation depends on the order between input rows and their corresponding group indices.

For example, assuming input rows in cols with 4 new rows (not equal to exist rows in group_values, and need to create new groups for them):

  row1 (hash collision with the exist rows)
  row2
  row3 (hash collision with the exist rows)
  row4

§In `scalarized_intern`, their `group indices` will be

  row1 --> 0
  row2 --> 1
  row3 --> 2
  row4 --> 3

Group indices order agrees with their input order, and the streaming aggregation depends on this.

§However In `vectorized_intern`, their `group indices` will be

  row1 --> 2
  row2 --> 0
  row3 --> 3
  row4 --> 1

Group indices order are against with their input order, and this will lead to error in streaming aggregation.

fn vectorized_intern( &mut self, cols: &[ArrayRef], groups: &mut Vec<usize>, ) -> Result<()>

Vectorized intern

This is used in non-streaming aggregation without requiring the order between rows in cols and corresponding groups in group_values.

The vectorized approach can offer higher performance for avoiding row by row downcast for cols and being able to implement even more optimizations(like simd).

fn collect_vectorized_process_context( &mut self, batch_hashes: &[u64], groups: &mut [usize], )

Collect vectorized context by checking hash values of cols in map

If bucket not found

Build and insert the new inlined group index view and its hash value to map
Add row index to vectorized_append_row_indices
Set group index to row in groups

bucket found

Add row index to vectorized_equal_to_row_indices
Check if the group index view is inlined or non_inlined: If it is inlined, add to vectorized_equal_to_group_indices directly. Otherwise get all group indices from group_index_lists, and add them.

fn vectorized_append(&mut self, cols: &[ArrayRef]) -> Result<()>

Perform vectorized_append`` for rowsinvectorized_append_row_indices`

fn vectorized_equal_to(&mut self, cols: &[ArrayRef], groups: &mut [usize])

Perform vectorized_equal_to

Perform vectorized_equal_to for rows in vectorized_equal_to_group_indices and group_indices in vectorized_equal_to_group_indices.
Check equal_to_results:

If found equal to rows, set the group_indices to rows in groups.

If found not equal to rows, just add them to scalarized_indices, and perform scalarized_intern for them after. Usually, such rows having same hash but different value with exists rows are very few.

fn scalarized_intern_remaining( &mut self, cols: &[ArrayRef], batch_hashes: &[u64], groups: &mut [usize], ) -> Result<()>

It is possible that some input rows have the same hash values with the exist rows, but have the different actual values the exists.

We can found them in vectorized_equal_to, and put them into scalarized_indices. And for these input rows, we will perform the scalarized_intern similar as what in GroupValuesColumn.

This design can make the process simple and still efficient enough:

§About making the process simple

Some corner cases become really easy to solve, like following cases:

  input row1 (same hash value with exist rows, but value different)
  input row1
  ...
  input row1

After performing vectorized_equal_to, we will found multiple input rows not equal to the exist rows. However such input rows are repeated, only one new group should be create for them.

If we don’t fallback to scalarized_intern, it is really hard for us to distinguish the such repeated rows in input rows. And if we just fallback, it is really easy to solve, and the performance is at least not worse than origin.

§About performance

The hash collision may be not frequent, so the fallback will indeed hardly happen. In most situations, scalarized_indices will found to be empty after finishing to preform vectorized_equal_to.

fn scalarized_equal_to_remaining( &self, group_index_view: &GroupIndexView, cols: &[ArrayRef], row: usize, groups: &mut [usize], ) -> bool

Trait Implementations§

impl<const STREAMING: bool> GroupValues for GroupValuesColumn<STREAMING>

fn intern(&mut self, cols: &[ArrayRef], groups: &mut Vec<usize>) -> Result<()>

Calculates the group id for each input row of cols, assigning new group ids as necessary. Read more

fn size(&self) -> usize

Returns the number of bytes of memory used by this GroupValues

fn is_empty(&self) -> bool

Returns true if this GroupValues is empty

fn len(&self) -> usize

The number of values (distinct group values) stored in this GroupValues

fn emit(&mut self, emit_to: EmitTo) -> Result<Vec<ArrayRef>>

Emits the group values

fn clear_shrink(&mut self, batch: &RecordBatch)

Clear the contents and shrink the capacity to the size of the batch (free up memory usage)

Auto Trait Implementations§

impl<const STREAMING: bool> !UnwindSafe for GroupValuesColumn<STREAMING>

Blanket Implementations§

impl<T> Any for T
where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

impl<T> Borrow<T> for T
where T: ?Sized,

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

impl<T> BorrowMut<T> for T
where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

impl<T> From<T> for T

fn from(t: T) -> T

Returns the argument unchanged.

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided [Span], returning an Instrumented wrapper. Read more

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

impl<T, U> Into for T
where U: From<T>,

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] only if self and other return Action::Follow. Read more

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] if either self or other returns Action::Follow. Read more

impl<T> Same for T

type Output = T

Should always be Self

impl<T, U> TryFrom for T
where U: Into<T>,

type Error = Infallible

The type returned in the event of a conversion error.

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

impl<T, U> TryInto for T
where U: TryFrom<T>,

type Error = >::Error

The type returned in the event of a conversion error.

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn vzip(self) -> V

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a [WithDispatch] wrapper. Read more

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a [WithDispatch] wrapper. Read more