get_projected_output_ordering

Function get_projected_output_ordering 

Source
fn get_projected_output_ordering(
    base_config: &FileScanConfig,
    projected_schema: &SchemaRef,
) -> Vec<LexOrdering>
Expand description

The various listing tables does not attempt to read all files concurrently, instead they will read files in sequence within a partition. This is an important property as it allows plans to run against 1000s of files and not try to open them all concurrently.

However, it means if we assign more than one file to a partition the output sort order will not be preserved as illustrated in the following diagrams:

When only 1 file is assigned to each partition, each partition is correctly sorted on (A, B, C)

┏ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ┓
  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ┐
┃   ┌───────────────┐     ┌──────────────┐ │   ┌──────────────┐ │   ┌─────────────┐   ┃
  │ │   1.parquet   │ │ │ │  2.parquet   │   │ │  3.parquet   │   │ │  4.parquet  │ │
┃   │ Sort: A, B, C │     │Sort: A, B, C │ │   │Sort: A, B, C │ │   │Sort: A, B, C│   ┃
  │ └───────────────┘ │ │ └──────────────┘   │ └──────────────┘   │ └─────────────┘ │
┃                                          │                    │                     ┃
  │                   │ │                    │                    │                 │
┃                                          │                    │                     ┃
  │                   │ │                    │                    │                 │
┃                                          │                    │                     ┃
  │                   │ │                    │                    │                 │
┃  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘  ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘  ─ ─ ─ ─ ─ ─ ─ ─ ─  ┃
     DataFusion           DataFusion           DataFusion           DataFusion
┃    Partition 1          Partition 2          Partition 3          Partition 4       ┃
 ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━

                                     DataSourceExec

However, when more than 1 file is assigned to each partition, each partition is NOT correctly sorted on (A, B, C). Once the second file is scanned, the same values for A, B and C can be repeated in the same sorted stream

 ┏ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━
   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┃
 ┃   ┌───────────────┐     ┌──────────────┐ │
   │ │   1.parquet   │ │ │ │  2.parquet   │   ┃
 ┃   │ Sort: A, B, C │     │Sort: A, B, C │ │
   │ └───────────────┘ │ │ └──────────────┘   ┃
 ┃   ┌───────────────┐     ┌──────────────┐ │
   │ │   3.parquet   │ │ │ │  4.parquet   │   ┃
 ┃   │ Sort: A, B, C │     │Sort: A, B, C │ │
   │ └───────────────┘ │ │ └──────────────┘   ┃
 ┃                                          │
   │                   │ │                    ┃
 ┃  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
      DataFusion           DataFusion         ┃
 ┃    Partition 1          Partition 2
  ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ┛

              DataSourceExec