ListingTable

Struct ListingTable 

Source
pub struct ListingTable {
    table_paths: Vec<ListingTableUrl>,
    file_schema: SchemaRef,
    table_schema: SchemaRef,
    schema_source: SchemaSource,
    options: ListingOptions,
    definition: Option<String>,
    collected_statistics: FileStatisticsCache,
    constraints: Constraints,
    column_defaults: HashMap<String, Expr>,
    schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>,
    expr_adapter_factory: Option<Arc<dyn PhysicalExprAdapterFactory>>,
}
Expand description

Built in TableProvider that reads data from one or more files as a single table.

The files are read using an [ObjectStore] instance, for example from local files or objects from AWS S3.

§Features:

  • Reading multiple files as a single table
  • Hive style partitioning (e.g., directories named date=2024-06-01)
  • Merges schemas from files with compatible but not identical schemas (see ListingTableConfig::file_schema)
  • limit, filter and projection pushdown for formats that support it (e.g., Parquet)
  • Statistics collection and pruning based on file metadata
  • Pre-existing sort order (see ListingOptions::file_sort_order)
  • Metadata caching to speed up repeated queries (see FileMetadataCache)
  • Statistics caching (see FileStatisticsCache)

§Reading Directories and Hive Style Partitioning

For example, given the table1 directory (or object store prefix)

table1
 ├── file1.parquet
 └── file2.parquet

A ListingTable would read the files file1.parquet and file2.parquet as a single table, merging the schemas if the files have compatible but not identical schemas.

Given the table2 directory (or object store prefix)

table2
 ├── date=2024-06-01
 │    ├── file3.parquet
 │    └── file4.parquet
 └── date=2024-06-02
      └── file5.parquet

A ListingTable would read the files file3.parquet, file4.parquet, and file5.parquet as a single table, again merging schemas if necessary.

Given the hive style partitioning structure (e.g,. directories named date=2024-06-01 and date=2026-06-02), ListingTable also adds a date column when reading the table:

  • The files in table2/date=2024-06-01 will have the value 2024-06-01
  • The files in table2/date=2024-06-02 will have the value 2024-06-02.

If the query has a predicate like WHERE date = '2024-06-01' only the corresponding directory will be read.

§See Also

  1. ListingTableConfig: Configuration options
  2. DataSourceExec: ExecutionPlan used by ListingTable

§Caching Metadata

Some formats, such as Parquet, use the FileMetadataCache to cache file metadata that is needed to execute but expensive to read, such as row groups and statistics. The cache is scoped to the SessionContext and can be configured via the runtime config options.

§Example: Read a directory of parquet files using a ListingTable

async fn get_listing_table(session: &dyn Session) -> Result<Arc<dyn TableProvider>> {
let table_path = "/path/to/parquet";

// Parse the path
let table_path = ListingTableUrl::parse(table_path)?;

// Create default parquet options
let file_format = ParquetFormat::new();
let listing_options = ListingOptions::new(Arc::new(file_format))
  .with_file_extension(".parquet");

// Resolve the schema
let resolved_schema = listing_options
   .infer_schema(session, &table_path)
   .await?;

let config = ListingTableConfig::new(table_path)
  .with_listing_options(listing_options)
  .with_schema(resolved_schema);

// Create a new TableProvider
let provider = Arc::new(ListingTable::try_new(config)?);

Fields§

§table_paths: Vec<ListingTableUrl>§file_schema: SchemaRef

file_schema contains only the columns physically stored in the data files themselves. - Represents the actual fields found in files like Parquet, CSV, etc. - Used when reading the raw data from files

§table_schema: SchemaRef

table_schema combines file_schema + partition columns - Partition columns are derived from directory paths (not stored in files) - These are columns like “year=2022/month=01” in paths like /data/year=2022/month=01/file.parquet

§schema_source: SchemaSource

Indicates how the schema was derived (inferred or explicitly specified)

§options: ListingOptions

Options used to configure the listing table such as the file format and partitioning information

§definition: Option<String>

The SQL definition for this table, if any

§collected_statistics: FileStatisticsCache

Cache for collected file statistics

§constraints: Constraints

Constraints applied to this table

§column_defaults: HashMap<String, Expr>

Column default expressions for columns that are not physically present in the data files

§schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>

Optional SchemaAdapterFactory for creating schema adapters

§expr_adapter_factory: Option<Arc<dyn PhysicalExprAdapterFactory>>

Optional PhysicalExprAdapterFactory for creating physical expression adapters

Implementations§

Source§

impl ListingTable

Source

pub fn try_new(config: ListingTableConfig) -> Result<Self>

Create new ListingTable

See documentation and example on ListingTable and ListingTableConfig

Source

pub fn with_constraints(self, constraints: Constraints) -> Self

Assign constraints

Source

pub fn with_column_defaults( self, column_defaults: HashMap<String, Expr>, ) -> Self

Assign column defaults

Source

pub fn with_cache(self, cache: Option<FileStatisticsCache>) -> Self

Set the FileStatisticsCache used to cache parquet file statistics.

Setting a statistics cache on the SessionContext can avoid refetching statistics multiple times in the same session.

If None, creates a new DefaultFileStatisticsCache scoped to this query.

Source

pub fn with_definition(self, definition: Option<String>) -> Self

Specify the SQL definition for this table, if any

Source

pub fn table_paths(&self) -> &Vec<ListingTableUrl>

Get paths ref

Source

pub fn options(&self) -> &ListingOptions

Get options ref

Source

pub fn schema_source(&self) -> SchemaSource

Get the schema source

Source

pub fn with_schema_adapter_factory( self, schema_adapter_factory: Arc<dyn SchemaAdapterFactory>, ) -> Self

Set the SchemaAdapterFactory for this ListingTable

The schema adapter factory is used to create schema adapters that can handle schema evolution and type conversions when reading files with different schemas than the table schema.

§Example: Adding Schema Evolution Support
let table_with_evolution = table
    .with_schema_adapter_factory(Arc::new(DefaultSchemaAdapterFactory));

See ListingTableConfig::with_schema_adapter_factory for an example of custom SchemaAdapterFactory.

Source

pub fn schema_adapter_factory(&self) -> Option<&Arc<dyn SchemaAdapterFactory>>

Get the SchemaAdapterFactory for this table

Source

fn create_schema_adapter(&self) -> Box<dyn SchemaAdapter>

Creates a schema adapter for mapping between file and table schemas

Uses the configured schema adapter factory if available, otherwise falls back to the default implementation.

Source

fn create_file_source_with_schema_adapter(&self) -> Result<Arc<dyn FileSource>>

Creates a file source and applies schema adapter factory if available

Source

pub fn try_create_output_ordering( &self, execution_props: &ExecutionProps, ) -> Result<Vec<LexOrdering>>

If file_sort_order is specified, creates the appropriate physical expressions

Source§

impl ListingTable

Source

pub async fn list_files_for_scan<'a>( &'a self, ctx: &'a dyn Session, filters: &'a [Expr], limit: Option<usize>, ) -> Result<(Vec<FileGroup>, Statistics)>

Get the list of files for a scan as well as the file level statistics. The list is grouped to let the execution plan know how the files should be distributed to different threads / executors.

Source

async fn do_collect_statistics( &self, ctx: &dyn Session, store: &Arc<dyn ObjectStore>, part_file: &PartitionedFile, ) -> Result<Arc<Statistics>>

Collects statistics for a given partitioned file.

This method first checks if the statistics for the given file are already cached. If they are, it returns the cached statistics. If they are not, it infers the statistics from the file and stores them in the cache.

Trait Implementations§

Source§

impl Clone for ListingTable

Source§

fn clone(&self) -> ListingTable

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for ListingTable

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl TableProvider for ListingTable

Source§

fn as_any(&self) -> &dyn Any

Returns the table provider as Any so that it can be downcast to a specific implementation.
Source§

fn schema(&self) -> SchemaRef

Get a reference to the schema for this table
Source§

fn constraints(&self) -> Option<&Constraints>

Get a reference to the constraints of the table. Returns: Read more
Source§

fn table_type(&self) -> TableType

Get the type of this table for metadata/catalog purposes.
Source§

fn scan<'life0, 'life1, 'life2, 'life3, 'async_trait>( &'life0 self, state: &'life1 dyn Session, projection: Option<&'life2 Vec<usize>>, filters: &'life3 [Expr], limit: Option<usize>, ) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait, 'life2: 'async_trait, 'life3: 'async_trait,

Create an ExecutionPlan for scanning the table with optionally specified projection, filter and limit, described below. Read more
Source§

fn scan_with_args<'a, 'life0, 'life1, 'async_trait>( &'life0 self, state: &'life1 dyn Session, args: ScanArgs<'a>, ) -> Pin<Box<dyn Future<Output = Result<ScanResult>> + Send + 'async_trait>>
where Self: 'async_trait, 'a: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Create an ExecutionPlan for scanning the table using structured arguments. Read more
Source§

fn supports_filters_pushdown( &self, filters: &[&Expr], ) -> Result<Vec<TableProviderFilterPushDown>>

Specify if DataFusion should provide filter expressions to the TableProvider to apply during the scan. Read more
Source§

fn get_table_definition(&self) -> Option<&str>

Get the create statement used to create this table, if available.
Source§

fn insert_into<'life0, 'life1, 'async_trait>( &'life0 self, state: &'life1 dyn Session, input: Arc<dyn ExecutionPlan>, insert_op: InsertOp, ) -> Pin<Box<dyn Future<Output = Result<Arc<dyn ExecutionPlan>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Return an ExecutionPlan to insert data into this table, if supported. Read more
Source§

fn get_column_default(&self, column: &str) -> Option<&Expr>

Get the default value for a column, if available.
Source§

fn get_logical_plan(&self) -> Option<Cow<'_, LogicalPlan>>

Get the LogicalPlan of this table, if available.
Source§

fn statistics(&self) -> Option<Statistics>

Get statistics for this table, if available Although not presently used in mainline DataFusion, this allows implementation specific behavior for downstream repositories, in conjunction with specialized optimizer rules to perform operations such as re-ordering of joins.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

§

impl<T> Instrument for T

§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided [Span], returning an Instrumented wrapper. Read more
§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
§

impl<T> PolicyExt for T
where T: ?Sized,

§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] only if self and other return Action::Follow. Read more
§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V

§

impl<T> WithSubscriber for T

§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

impl<T> ErasedDestructor for T
where T: 'static,