Expand description
ParquetFormat: Parquet FileFormat abstractions
StructsΒ§
- Object
Store Fetch - [
MetadataFetch] adapter for reading bytes from an [ObjectStore] - Parallel
Parquet πWriter Options - Settings related to writing parquet files in parallel
- Parquet
Format - The Apache Parquet
FileFormatimplementation - Parquet
Format Factory - Factory struct used to create ParquetFormat
- Parquet
Sink - Implements
DataSinkfor writing to a parquet file.
ConstantsΒ§
- BUFFER_
FLUSH_ πBYTES - When writing parquet files in parallel, if the buffered Parquet data exceeds this size, it is flushed to object store
- INITIAL_
BUFFER_ πBYTES - Initial writing buffer size. Note this is just a size hint for efficiency. It will grow beyond the set value if needed.
FunctionsΒ§
- apply_
file_ schema_ type_ coercions - Apply necessary schema type coercions to make file schema match table schema.
- clear_
metadata π - Clears all metadata (Schema level and field level) on an iterator of Schemas
- coerce_
file_ schema_ to_ string_ type Deprecated - If the table schema uses a string type, coerce the file schema to use a string type.
- coerce_
file_ schema_ to_ view_ type Deprecated - Coerces the file schema if the table schema uses a view type.
- coerce_
int96_ to_ resolution - Coerces the file schemaβs Timestamps to the provided TimeUnit if Parquet schema contains INT96.
- column_
serializer_ πtask - Consumes a stream of [ArrowLeafColumn] via a channel and serializes them using an [ArrowColumnWriter] Once the channel is exhausted, returns the ArrowColumnWriter.
- concatenate_
parallel_ πrow_ groups - Consume RowGroups serialized by other parallel tasks and concatenate them in to the final parquet file, while flushing finalized bytes to an [ObjectStore]
- fetch_
parquet_ metadata Deprecated - Fetches parquet metadata from ObjectStore for given object
- fetch_
statistics Deprecated - Read and parse the statistics of the Parquet file at location
path - field_
with_ πnew_ type - Create a new field with the specified data type, copying the other properties from the input field
- get_
file_ πdecryption_ properties - output_
single_ πparquet_ file_ parallelized - Parallelizes the serialization of a single parquet file, by first serializing N independent RecordBatch streams in parallel to RowGroups in memory. Another task then stitches these independent RowGroups together and streams this large single parquet file to an ObjectStore in multiple parts.
- send_
arrays_ πto_ col_ writers - Sends the ArrowArrays in passed [RecordBatch] through the channels to their respective parallel column serializers.
- set_
writer_ πencryption_ properties - spawn_
column_ πparallel_ row_ group_ writer - Spawns a parallel serialization task for each column Returns join handles for each columns serialization task along with a send channel to send arrow arrays to each serialization task.
- spawn_
parquet_ πparallel_ serialization_ task - This task coordinates the serialization of a parquet file in parallel. As the query produces RecordBatches, these are written to a RowGroup via parallel [ArrowColumnWriter] tasks. Once the desired max rows per row group is reached, the parallel tasks are joined on another separate task and sent to a concatenation task. This task immediately continues to work on the next row group in parallel. So, parquet serialization is parallelized across both columns and row_groups, with a theoretical max number of parallel tasks given by n_columns * num_row_groups.
- spawn_
rg_ πjoin_ and_ finalize_ task - Spawns a tokio task which joins the parallel column writer tasks, and finalizes the row group
- statistics_
from_ parquet_ meta_ calc Deprecated - transform_
binary_ to_ string - Transform a schema so that any binary types are strings
- transform_
schema_ to_ view - Transform a schema to use view types for Utf8 and Binary
Type AliasesΒ§
- ColSender π
- Column
Writer πTask - RBStream
Serialize πResult - This is the return type of calling [ArrowColumnWriter].close() on each column i.e. the Vec of encoded columns which can be appended to a row group