ArrowBytesMap

Struct ArrowBytesMap 

Source
pub struct ArrowBytesMap<O, V>
where O: OffsetSizeTrait, V: Debug + PartialEq + Eq + Clone + Copy + Default,
{ output_type: OutputType, map: HashTable<Entry<O, V>>, map_size: usize, buffer: BufferBuilder<u8>, offsets: Vec<O>, random_state: RandomState, hashes_buffer: Vec<u64>, null: Option<(V, usize)>, }
Expand description

Optimized map for storing Arrow “bytes” types (String, LargeString, Binary, and LargeBinary) values that can produce the set of keys on output as GenericBinaryArray without copies.

Equivalent to HashSet<String, V> but with better performance if you need to emit the keys as an Arrow StringArray / BinaryArray. For other purposes it is the same as a HashMap<String, V>

§Generic Arguments

  • O: OffsetSize (String/LargeString)
  • V: payload type

§Description

This is a specialized HashMap with the following properties:

  1. Optimized for storing and emitting Arrow byte types (e.g. StringArray / BinaryArray) very efficiently by minimizing copying of the string values themselves, both when inserting and when emitting the final array.

  2. Retains the insertion order of entries in the final array. The values are in the same order as they were inserted.

Note this structure can be used as a HashSet by specifying the value type as (), as is done by ArrowBytesSet.

This map is used by the special COUNT DISTINCT aggregate function to store the distinct values, and by the GROUP BY operator to store group values when they are a single string array.

§Example

The following diagram shows how the map would store the four strings “Foo”, NULL, “Bar”, “TheQuickBrownFox”:

  • hashtable stores entries for each distinct string that has been inserted. The entries contain the payload as well as information about the value (either an offset or the actual bytes, see Entry docs for more details)

  • offsets stores offsets into buffer for each distinct string value, following the same convention as the offsets in a StringArray or LargeStringArray.

  • buffer stores the actual byte data

  • null: stores the index and payload of the null value, in this case the second value (index 1)

┌───────────────────────────────────┐    ┌─────┐    ┌────┐
│                ...                │    │  0  │    │FooB│
│ ┌──────────────────────────────┐  │    │  0  │    │arTh│
│ │      <Entry for "Bar">       │  │    │  3  │    │eQui│
│ │            len: 3            │  │    │  3  │    │ckBr│
│ │   offset_or_inline: "Bar"    │  │    │  6  │    │ownF│
│ │         payload:...          │  │    │     │    │ox  │
│ └──────────────────────────────┘  │    │     │    │    │
│                ...                │    └─────┘    └────┘
│ ┌──────────────────────────────┐  │
│ │<Entry for "TheQuickBrownFox">│  │    offsets    buffer
│ │           len: 16            │  │
│ │     offset_or_inline: 6      │  │    ┌───────────────┐
│ │         payload: ...         │  │    │    Some(1)    │
│ └──────────────────────────────┘  │    │ payload: ...  │
│                ...                │    └───────────────┘
└───────────────────────────────────┘
                                             null
              HashTable

§Entry Format

Entries stored in a ArrowBytesMap represents a value that is either stored inline or in the buffer

This helps the case where there are many short (less than 8 bytes) strings that are the same (e.g. “MA”, “CA”, “NY”, “TX”, etc)

                                                               ┌──────────────────┐
                                                 ─ ─ ─ ─ ─ ─ ─▶│...               │
                                                │              │TheQuickBrownFox  │
                                                               │...               │
                                                │              │                  │
                                                               └──────────────────┘
                                                │               buffer of u8

                                                │
                       ┌────────────────┬───────────────┬───────────────┐
 Storing               │                │ starting byte │  length, in   │
 "TheQuickBrownFox"    │   hash value   │   offset in   │  bytes (not   │
 (long string)         │                │    buffer     │  characters)  │
                       └────────────────┴───────────────┴───────────────┘
                             8 bytes          8 bytes       4 or 8


                        ┌───────────────┬─┬─┬─┬─┬─┬─┬─┬─┬───────────────┐
Storing "foobar"        │               │ │ │ │ │ │ │ │ │  length, in   │
(short string)          │  hash value   │?│?│f│o│o│b│a│r│  bytes (not   │
                        │               │ │ │ │ │ │ │ │ │  characters)  │
                        └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┴───────────────┘
                             8 bytes         8 bytes        4 or 8

Fields§

§output_type: OutputType

Should the output be String or Binary?

§map: HashTable<Entry<O, V>>

Underlying hash set for each distinct value

§map_size: usize

Total size of the map in bytes

§buffer: BufferBuilder<u8>

In progress arrow Buffer containing all values

§offsets: Vec<O>

Offsets into buffer for each distinct value. These offsets as used directly to create the final GenericBinaryArray. The ith string is stored in the range offsets[i]..offsets[i+1] in buffer. Null values are stored as a zero length string.

§random_state: RandomState

random state used to generate hashes

§hashes_buffer: Vec<u64>

buffer that stores hash values (reused across batches to save allocations)

§null: Option<(V, usize)>

(payload, null_index) for the ‘null’ value, if any NOTE null_index is the logical index in the final array, not the index in the buffer

Implementations§

Source§

impl<O: OffsetSizeTrait, V> ArrowBytesMap<O, V>
where V: Debug + PartialEq + Eq + Clone + Copy + Default,

Source

pub fn new(output_type: OutputType) -> Self

Source

pub fn take(&mut self) -> Self

Return the contents of this map and replace it with a new empty map with the same output type

Source

pub fn insert_if_new<MP, OP>( &mut self, values: &ArrayRef, make_payload_fn: MP, observe_payload_fn: OP, )
where MP: FnMut(Option<&[u8]>) -> V, OP: FnMut(V),

Inserts each value from values into the map, invoking payload_fn for each value if not already present, deferring the allocation of the payload until it is needed.

Note that this is different than a normal map that would replace the existing entry

§Arguments:

values: array whose values are inserted

make_payload_fn: invoked for each value that is not already present to create the payload, in order of the values in values

observe_payload_fn: invoked once, for each value in values, that was already present in the map, with corresponding payload value.

§Returns

The payload value for the entry, either the existing value or the newly inserted value

§Safety:

Note that make_payload_fn and observe_payload_fn are only invoked with valid values from values, not for the NULL value.

Source

fn insert_if_new_inner<MP, OP, B>( &mut self, values: &ArrayRef, make_payload_fn: MP, observe_payload_fn: OP, )
where MP: FnMut(Option<&[u8]>) -> V, OP: FnMut(V), B: ByteArrayType,

Generic version of Self::insert_if_new that handles ByteArrayType (both String and Binary)

Note this is the only function that is generic on [ByteArrayType], which avoids having to template the entire structure, making the code simpler and understand and reducing code bloat due to duplication.

See comments on insert_if_new for more details

Source

pub fn into_state(self) -> ArrayRef

Converts this set into a StringArray, LargeStringArray, BinaryArray, or LargeBinaryArray containing each distinct value that was inserted. This is done without copying the values.

The values are guaranteed to be returned in the same order in which they were first seen.

Source

pub fn len(&self) -> usize

Total number of entries (including null, if present)

Source

pub fn is_empty(&self) -> bool

Is the set empty?

Source

pub fn non_null_len(&self) -> usize

Number of non null entries

Source

pub fn size(&self) -> usize

Return the total size, in bytes, of memory used to store the data in this set, not including self

Trait Implementations§

Source§

impl<O: OffsetSizeTrait, V> Debug for ArrowBytesMap<O, V>
where V: Debug + PartialEq + Eq + Clone + Copy + Default,

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl<O, V> Freeze for ArrowBytesMap<O, V>
where V: Freeze,

§

impl<O, V> RefUnwindSafe for ArrowBytesMap<O, V>

§

impl<O, V> Send for ArrowBytesMap<O, V>
where V: Send,

§

impl<O, V> Sync for ArrowBytesMap<O, V>
where V: Sync,

§

impl<O, V> Unpin for ArrowBytesMap<O, V>
where V: Unpin, O: Unpin,

§

impl<O, V> UnwindSafe for ArrowBytesMap<O, V>
where V: UnwindSafe, O: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

§

impl<T> Instrument for T

§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided [Span], returning an Instrumented wrapper. Read more
§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
§

impl<T> PolicyExt for T
where T: ?Sized,

§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] only if self and other return Action::Follow. Read more
§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V

§

impl<T> WithSubscriber for T

§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,

§

impl<T> ErasedDestructor for T
where T: 'static,