Module file_format

Module file_format 

Source
Expand description

ParquetFormat: Parquet FileFormat abstractions

StructsΒ§

ObjectStoreFetch
[MetadataFetch] adapter for reading bytes from an [ObjectStore]
ParallelParquetWriterOptions πŸ”’
Settings related to writing parquet files in parallel
ParquetFormat
The Apache Parquet FileFormat implementation
ParquetFormatFactory
Factory struct used to create ParquetFormat
ParquetSink
Implements DataSink for writing to a parquet file.

ConstantsΒ§

BUFFER_FLUSH_BYTES πŸ”’
When writing parquet files in parallel, if the buffered Parquet data exceeds this size, it is flushed to object store
INITIAL_BUFFER_BYTES πŸ”’
Initial writing buffer size. Note this is just a size hint for efficiency. It will grow beyond the set value if needed.

FunctionsΒ§

apply_file_schema_type_coercions
Apply necessary schema type coercions to make file schema match table schema.
clear_metadata πŸ”’
Clears all metadata (Schema level and field level) on an iterator of Schemas
coerce_file_schema_to_string_typeDeprecated
If the table schema uses a string type, coerce the file schema to use a string type.
coerce_file_schema_to_view_typeDeprecated
Coerces the file schema if the table schema uses a view type.
coerce_int96_to_resolution
Coerces the file schema’s Timestamps to the provided TimeUnit if Parquet schema contains INT96.
column_serializer_task πŸ”’
Consumes a stream of [ArrowLeafColumn] via a channel and serializes them using an [ArrowColumnWriter] Once the channel is exhausted, returns the ArrowColumnWriter.
concatenate_parallel_row_groups πŸ”’
Consume RowGroups serialized by other parallel tasks and concatenate them in to the final parquet file, while flushing finalized bytes to an [ObjectStore]
fetch_parquet_metadataDeprecated
Fetches parquet metadata from ObjectStore for given object
fetch_statisticsDeprecated
Read and parse the statistics of the Parquet file at location path
field_with_new_type πŸ”’
Create a new field with the specified data type, copying the other properties from the input field
get_file_decryption_properties πŸ”’
output_single_parquet_file_parallelized πŸ”’
Parallelizes the serialization of a single parquet file, by first serializing N independent RecordBatch streams in parallel to RowGroups in memory. Another task then stitches these independent RowGroups together and streams this large single parquet file to an ObjectStore in multiple parts.
send_arrays_to_col_writers πŸ”’
Sends the ArrowArrays in passed [RecordBatch] through the channels to their respective parallel column serializers.
set_writer_encryption_properties πŸ”’
spawn_column_parallel_row_group_writer πŸ”’
Spawns a parallel serialization task for each column Returns join handles for each columns serialization task along with a send channel to send arrow arrays to each serialization task.
spawn_parquet_parallel_serialization_task πŸ”’
This task coordinates the serialization of a parquet file in parallel. As the query produces RecordBatches, these are written to a RowGroup via parallel [ArrowColumnWriter] tasks. Once the desired max rows per row group is reached, the parallel tasks are joined on another separate task and sent to a concatenation task. This task immediately continues to work on the next row group in parallel. So, parquet serialization is parallelized across both columns and row_groups, with a theoretical max number of parallel tasks given by n_columns * num_row_groups.
spawn_rg_join_and_finalize_task πŸ”’
Spawns a tokio task which joins the parallel column writer tasks, and finalizes the row group
statistics_from_parquet_meta_calcDeprecated
transform_binary_to_string
Transform a schema so that any binary types are strings
transform_schema_to_view
Transform a schema to use view types for Utf8 and Binary

Type AliasesΒ§

ColSender πŸ”’
ColumnWriterTask πŸ”’
RBStreamSerializeResult πŸ”’
This is the return type of calling [ArrowColumnWriter].close() on each column i.e. the Vec of encoded columns which can be appended to a row group