ListingOptions

Struct ListingOptions 

Source
pub struct ListingOptions {
    pub file_extension: String,
    pub format: Arc<dyn FileFormat>,
    pub table_partition_cols: Vec<(String, DataType)>,
    pub collect_stat: bool,
    pub target_partitions: usize,
    pub file_sort_order: Vec<Vec<Sort>>,
}
Expand description

Options for creating a crate::ListingTable

Fields§

§file_extension: String

A suffix on which files should be filtered (leave empty to keep all files on the path)

§format: Arc<dyn FileFormat>

The file format

§table_partition_cols: Vec<(String, DataType)>

The expected partition column names in the folder structure. See Self::with_table_partition_cols for details

§collect_stat: bool

Set true to try to guess statistics from the files. This can add a lot of overhead as it will usually require files to be opened and at least partially parsed.

§target_partitions: usize

Group files to avoid that the number of partitions exceeds this limit

§file_sort_order: Vec<Vec<Sort>>

Optional pre-known sort order(s). Must be SortExprs.

DataFusion may take advantage of this ordering to omit sorts or use more efficient algorithms. Currently sortedness must be provided if it is known by some external mechanism, but may in the future be automatically determined, for example using parquet metadata.

See https://github.com/apache/datafusion/issues/4177

NOTE: This attribute stores all equivalent orderings (the outer Vec) where each ordering consists of an individual lexicographic ordering (encapsulated by a Vec<Expr>). If there aren’t multiple equivalent orderings, the outer Vec will have a single element.

Implementations§

Source§

impl ListingOptions

Source

pub fn new(format: Arc<dyn FileFormat>) -> ListingOptions

Creates an options instance with the given format Default values:

  • use default file extension filter
  • no input partition to discover
  • one target partition
  • do not collect statistics
Source

pub fn with_session_config_options( self, config: &SessionConfig, ) -> ListingOptions

Set options from SessionConfig and returns self.

Currently this sets target_partitions and collect_stat but if more options are added in the future that need to be coordinated they will be synchronized through this method.

Source

pub fn with_file_extension( self, file_extension: impl Into<String>, ) -> ListingOptions

Set file extension on ListingOptions and returns self.

§Example

let listing_options = ListingOptions::new(Arc::new(ParquetFormat::default()))
    .with_file_extension(".parquet");

assert_eq!(listing_options.file_extension, ".parquet");
Source

pub fn with_file_extension_opt<S>( self, file_extension: Option<S>, ) -> ListingOptions
where S: Into<String>,

Optionally set file extension on ListingOptions and returns self.

If file_extension is None, the file extension will not be changed

§Example

let extension = Some(".parquet");
let listing_options = ListingOptions::new(Arc::new(ParquetFormat::default()))
    .with_file_extension_opt(extension);

assert_eq!(listing_options.file_extension, ".parquet");
Source

pub fn with_table_partition_cols( self, table_partition_cols: Vec<(String, DataType)>, ) -> ListingOptions

Set table partition columns on ListingOptions and returns self.

“partition columns,” used to support Hive Partitioning, are columns added to the data that is read, based on the folder structure where the data resides.

For example, give the following files in your filesystem:

/mnt/nyctaxi/year=2022/month=01/tripdata.parquet
/mnt/nyctaxi/year=2021/month=12/tripdata.parquet
/mnt/nyctaxi/year=2021/month=11/tripdata.parquet

A crate::ListingTable created at /mnt/nyctaxi/ with partition columns “year” and “month” will include new year and month columns while reading the files. The year column would have value 2022 and the month column would have value 01 for the rows read from /mnt/nyctaxi/year=2022/month=01/tripdata.parquet

§Notes
  • If only one level (e.g. year in the example above) is specified, the other levels are ignored but the files are still read.

  • Files that don’t follow this partitioning scheme will be ignored.

  • Since the columns have the same value for all rows read from each individual file (such as dates), they are typically dictionary encoded for efficiency. You may use wrap_partition_type_in_dict to request a dictionary-encoded type.

  • The partition columns are solely extracted from the file path. Especially they are NOT part of the parquet files itself.

§Example

// listing options for files with paths such as  `/mnt/data/col_a=x/col_b=y/data.parquet`
// `col_a` and `col_b` will be included in the data read from those files
let listing_options = ListingOptions::new(Arc::new(
    ParquetFormat::default()
  ))
  .with_table_partition_cols(vec![("col_a".to_string(), DataType::Utf8),
      ("col_b".to_string(), DataType::Utf8)]);

assert_eq!(listing_options.table_partition_cols, vec![("col_a".to_string(), DataType::Utf8),
    ("col_b".to_string(), DataType::Utf8)]);
Source

pub fn with_collect_stat(self, collect_stat: bool) -> ListingOptions

Set stat collection on ListingOptions and returns self.


let listing_options =
    ListingOptions::new(Arc::new(ParquetFormat::default())).with_collect_stat(true);

assert_eq!(listing_options.collect_stat, true);
Source

pub fn with_target_partitions(self, target_partitions: usize) -> ListingOptions

Set number of target partitions on ListingOptions and returns self.


let listing_options =
    ListingOptions::new(Arc::new(ParquetFormat::default())).with_target_partitions(8);

assert_eq!(listing_options.target_partitions, 8);
Source

pub fn with_file_sort_order( self, file_sort_order: Vec<Vec<Sort>>, ) -> ListingOptions

Set file sort order on ListingOptions and returns self.


// Tell datafusion that the files are sorted by column "a"
let file_sort_order = vec![vec![col("a").sort(true, true)]];

let listing_options = ListingOptions::new(Arc::new(ParquetFormat::default()))
    .with_file_sort_order(file_sort_order.clone());

assert_eq!(listing_options.file_sort_order, file_sort_order);
Source

pub async fn infer_schema<'a>( &'a self, state: &dyn Session, table_path: &'a ListingTableUrl, ) -> Result<Arc<Schema>, DataFusionError>

Infer the schema of the files at the given path on the provided object store.

If the table_path contains one or more files (i.e. it is a directory / prefix of files) their schema is merged by calling FileFormat::infer_schema

Note: The inferred schema does not include any partitioning columns.

This method is called as part of creating a crate::ListingTable.

Source

pub async fn validate_partitions( &self, state: &dyn Session, table_path: &ListingTableUrl, ) -> Result<(), DataFusionError>

Infers the partition columns stored in LOCATION and compares them with the columns provided in PARTITIONED BY to help prevent accidental corrupts of partitioned tables.

Allows specifying partial partitions.

Source

pub async fn infer_partitions( &self, state: &dyn Session, table_path: &ListingTableUrl, ) -> Result<Vec<String>, DataFusionError>

Infer the partitioning at the given path on the provided object store. For performance reasons, it doesn’t read all the files on disk and therefore may fail to detect invalid partitioning.

Trait Implementations§

Source§

impl Clone for ListingOptions

Source§

fn clone(&self) -> ListingOptions

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for ListingOptions

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

§

impl<T> Instrument for T

§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided [Span], returning an Instrumented wrapper. Read more
§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
§

impl<T> PolicyExt for T
where T: ?Sized,

§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] only if self and other return Action::Follow. Read more
§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V

§

impl<T> WithSubscriber for T

§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

impl<T> ErasedDestructor for T
where T: 'static,