Module file_format

Source

Expand description

ParquetFormat: Parquet FileFormat abstractions

Structs§

ObjectStoreFetch: [MetadataFetch] adapter for reading bytes from an [ObjectStore]
ParallelParquetWriterOptions 🔒: Settings related to writing parquet files in parallel
ParquetFormat: The Apache Parquet FileFormat implementation
ParquetFormatFactory: Factory struct used to create ParquetFormat
ParquetSink: Implements DataSink for writing to a parquet file.

Constants§

BUFFER_FLUSH_BYTES 🔒: When writing parquet files in parallel, if the buffered Parquet data exceeds this size, it is flushed to object store
INITIAL_BUFFER_BYTES 🔒: Initial writing buffer size. Note this is just a size hint for efficiency. It will grow beyond the set value if needed.

Functions§

apply_file_schema_type_coercions: Apply necessary schema type coercions to make file schema match table schema.
clear_metadata 🔒: Clears all metadata (Schema level and field level) on an iterator of Schemas
coerce_file_schema_to_string_typeDeprecated: If the table schema uses a string type, coerce the file schema to use a string type.
coerce_file_schema_to_view_typeDeprecated: Coerces the file schema if the table schema uses a view type.
coerce_int96_to_resolution: Coerces the file schema’s Timestamps to the provided TimeUnit if Parquet schema contains INT96.
column_serializer_task 🔒: Consumes a stream of [ArrowLeafColumn] via a channel and serializes them using an [ArrowColumnWriter] Once the channel is exhausted, returns the ArrowColumnWriter.
concatenate_parallel_row_groups 🔒: Consume RowGroups serialized by other parallel tasks and concatenate them in to the final parquet file, while flushing finalized bytes to an [ObjectStore]
fetch_parquet_metadataDeprecated: Fetches parquet metadata from ObjectStore for given object
fetch_statisticsDeprecated: Read and parse the statistics of the Parquet file at location path
field_with_new_type 🔒: Create a new field with the specified data type, copying the other properties from the input field
get_file_decryption_properties 🔒
output_single_parquet_file_parallelized 🔒: Parallelizes the serialization of a single parquet file, by first serializing N independent RecordBatch streams in parallel to RowGroups in memory. Another task then stitches these independent RowGroups together and streams this large single parquet file to an ObjectStore in multiple parts.
send_arrays_to_col_writers 🔒: Sends the ArrowArrays in passed [RecordBatch] through the channels to their respective parallel column serializers.
set_writer_encryption_properties 🔒
spawn_column_parallel_row_group_writer 🔒: Spawns a parallel serialization task for each column Returns join handles for each columns serialization task along with a send channel to send arrow arrays to each serialization task.
spawn_parquet_parallel_serialization_task 🔒: This task coordinates the serialization of a parquet file in parallel. As the query produces RecordBatches, these are written to a RowGroup via parallel [ArrowColumnWriter] tasks. Once the desired max rows per row group is reached, the parallel tasks are joined on another separate task and sent to a concatenation task. This task immediately continues to work on the next row group in parallel. So, parquet serialization is parallelized across both columns and row_groups, with a theoretical max number of parallel tasks given by n_columns * num_row_groups.
spawn_rg_join_and_finalize_task 🔒: Spawns a tokio task which joins the parallel column writer tasks, and finalizes the row group
statistics_from_parquet_meta_calcDeprecated
transform_binary_to_string: Transform a schema so that any binary types are strings
transform_schema_to_view: Transform a schema to use view types for Utf8 and Binary

Type Aliases§

ColSender 🔒
ColumnWriterTask 🔒
RBStreamSerializeResult 🔒: This is the return type of calling [ArrowColumnWriter].close() on each column i.e. the Vec of encoded columns which can be appended to a row group

Module file_format

Module file_format Copy item path

Structs§

Constants§

Functions§

Type Aliases§

Module file_format