pub struct ListingOptions {
pub file_extension: String,
pub format: Arc<dyn FileFormat>,
pub table_partition_cols: Vec<(String, DataType)>,
pub collect_stat: bool,
pub target_partitions: usize,
pub file_sort_order: Vec<Vec<Sort>>,
}Expand description
Options for creating a crate::ListingTable
Fields§
§file_extension: StringA suffix on which files should be filtered (leave empty to keep all files on the path)
format: Arc<dyn FileFormat>The file format
table_partition_cols: Vec<(String, DataType)>The expected partition column names in the folder structure. See Self::with_table_partition_cols for details
collect_stat: boolSet true to try to guess statistics from the files. This can add a lot of overhead as it will usually require files to be opened and at least partially parsed.
target_partitions: usizeGroup files to avoid that the number of partitions exceeds this limit
file_sort_order: Vec<Vec<Sort>>Optional pre-known sort order(s). Must be SortExprs.
DataFusion may take advantage of this ordering to omit sorts or use more efficient algorithms. Currently sortedness must be provided if it is known by some external mechanism, but may in the future be automatically determined, for example using parquet metadata.
See https://github.com/apache/datafusion/issues/4177
NOTE: This attribute stores all equivalent orderings (the outer Vec)
where each ordering consists of an individual lexicographic
ordering (encapsulated by a Vec<Expr>). If there aren’t
multiple equivalent orderings, the outer Vec will have a
single element.
Implementations§
Source§impl ListingOptions
impl ListingOptions
Sourcepub fn new(format: Arc<dyn FileFormat>) -> ListingOptions
pub fn new(format: Arc<dyn FileFormat>) -> ListingOptions
Creates an options instance with the given format Default values:
- use default file extension filter
- no input partition to discover
- one target partition
- do not collect statistics
Sourcepub fn with_session_config_options(
self,
config: &SessionConfig,
) -> ListingOptions
pub fn with_session_config_options( self, config: &SessionConfig, ) -> ListingOptions
Set options from SessionConfig and returns self.
Currently this sets target_partitions and collect_stat
but if more options are added in the future that need to be coordinated
they will be synchronized through this method.
Sourcepub fn with_file_extension(
self,
file_extension: impl Into<String>,
) -> ListingOptions
pub fn with_file_extension( self, file_extension: impl Into<String>, ) -> ListingOptions
Set file extension on ListingOptions and returns self.
§Example
let listing_options = ListingOptions::new(Arc::new(ParquetFormat::default()))
.with_file_extension(".parquet");
assert_eq!(listing_options.file_extension, ".parquet");Sourcepub fn with_file_extension_opt<S>(
self,
file_extension: Option<S>,
) -> ListingOptions
pub fn with_file_extension_opt<S>( self, file_extension: Option<S>, ) -> ListingOptions
Optionally set file extension on ListingOptions and returns self.
If file_extension is None, the file extension will not be changed
§Example
let extension = Some(".parquet");
let listing_options = ListingOptions::new(Arc::new(ParquetFormat::default()))
.with_file_extension_opt(extension);
assert_eq!(listing_options.file_extension, ".parquet");Sourcepub fn with_table_partition_cols(
self,
table_partition_cols: Vec<(String, DataType)>,
) -> ListingOptions
pub fn with_table_partition_cols( self, table_partition_cols: Vec<(String, DataType)>, ) -> ListingOptions
Set table partition columns on ListingOptions and returns self.
“partition columns,” used to support Hive Partitioning, are columns added to the data that is read, based on the folder structure where the data resides.
For example, give the following files in your filesystem:
/mnt/nyctaxi/year=2022/month=01/tripdata.parquet
/mnt/nyctaxi/year=2021/month=12/tripdata.parquet
/mnt/nyctaxi/year=2021/month=11/tripdata.parquetA crate::ListingTable created at /mnt/nyctaxi/ with partition
columns “year” and “month” will include new year and month
columns while reading the files. The year column would have
value 2022 and the month column would have value 01 for
the rows read from
/mnt/nyctaxi/year=2022/month=01/tripdata.parquet
§Notes
-
If only one level (e.g.
yearin the example above) is specified, the other levels are ignored but the files are still read. -
Files that don’t follow this partitioning scheme will be ignored.
-
Since the columns have the same value for all rows read from each individual file (such as dates), they are typically dictionary encoded for efficiency. You may use
wrap_partition_type_in_dictto request a dictionary-encoded type. -
The partition columns are solely extracted from the file path. Especially they are NOT part of the parquet files itself.
§Example
// listing options for files with paths such as `/mnt/data/col_a=x/col_b=y/data.parquet`
// `col_a` and `col_b` will be included in the data read from those files
let listing_options = ListingOptions::new(Arc::new(
ParquetFormat::default()
))
.with_table_partition_cols(vec![("col_a".to_string(), DataType::Utf8),
("col_b".to_string(), DataType::Utf8)]);
assert_eq!(listing_options.table_partition_cols, vec![("col_a".to_string(), DataType::Utf8),
("col_b".to_string(), DataType::Utf8)]);Sourcepub fn with_collect_stat(self, collect_stat: bool) -> ListingOptions
pub fn with_collect_stat(self, collect_stat: bool) -> ListingOptions
Set stat collection on ListingOptions and returns self.
let listing_options =
ListingOptions::new(Arc::new(ParquetFormat::default())).with_collect_stat(true);
assert_eq!(listing_options.collect_stat, true);Sourcepub fn with_target_partitions(self, target_partitions: usize) -> ListingOptions
pub fn with_target_partitions(self, target_partitions: usize) -> ListingOptions
Set number of target partitions on ListingOptions and returns self.
let listing_options =
ListingOptions::new(Arc::new(ParquetFormat::default())).with_target_partitions(8);
assert_eq!(listing_options.target_partitions, 8);Sourcepub fn with_file_sort_order(
self,
file_sort_order: Vec<Vec<Sort>>,
) -> ListingOptions
pub fn with_file_sort_order( self, file_sort_order: Vec<Vec<Sort>>, ) -> ListingOptions
Set file sort order on ListingOptions and returns self.
// Tell datafusion that the files are sorted by column "a"
let file_sort_order = vec![vec![col("a").sort(true, true)]];
let listing_options = ListingOptions::new(Arc::new(ParquetFormat::default()))
.with_file_sort_order(file_sort_order.clone());
assert_eq!(listing_options.file_sort_order, file_sort_order);Sourcepub async fn infer_schema<'a>(
&'a self,
state: &dyn Session,
table_path: &'a ListingTableUrl,
) -> Result<Arc<Schema>, DataFusionError>
pub async fn infer_schema<'a>( &'a self, state: &dyn Session, table_path: &'a ListingTableUrl, ) -> Result<Arc<Schema>, DataFusionError>
Infer the schema of the files at the given path on the provided object store.
If the table_path contains one or more files (i.e. it is a directory /
prefix of files) their schema is merged by calling FileFormat::infer_schema
Note: The inferred schema does not include any partitioning columns.
This method is called as part of creating a crate::ListingTable.
Sourcepub async fn validate_partitions(
&self,
state: &dyn Session,
table_path: &ListingTableUrl,
) -> Result<(), DataFusionError>
pub async fn validate_partitions( &self, state: &dyn Session, table_path: &ListingTableUrl, ) -> Result<(), DataFusionError>
Infers the partition columns stored in LOCATION and compares
them with the columns provided in PARTITIONED BY to help prevent
accidental corrupts of partitioned tables.
Allows specifying partial partitions.
Sourcepub async fn infer_partitions(
&self,
state: &dyn Session,
table_path: &ListingTableUrl,
) -> Result<Vec<String>, DataFusionError>
pub async fn infer_partitions( &self, state: &dyn Session, table_path: &ListingTableUrl, ) -> Result<Vec<String>, DataFusionError>
Infer the partitioning at the given path on the provided object store. For performance reasons, it doesn’t read all the files on disk and therefore may fail to detect invalid partitioning.
Trait Implementations§
Source§impl Clone for ListingOptions
impl Clone for ListingOptions
Source§fn clone(&self) -> ListingOptions
fn clone(&self) -> ListingOptions
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreAuto Trait Implementations§
impl Freeze for ListingOptions
impl !RefUnwindSafe for ListingOptions
impl Send for ListingOptions
impl Sync for ListingOptions
impl Unpin for ListingOptions
impl !UnwindSafe for ListingOptions
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more