fn get_projected_output_ordering(
base_config: &FileScanConfig,
projected_schema: &SchemaRef,
) -> Vec<LexOrdering>Expand description
The various listing tables does not attempt to read all files concurrently, instead they will read files in sequence within a partition. This is an important property as it allows plans to run against 1000s of files and not try to open them all concurrently.
However, it means if we assign more than one file to a partition the output sort order will not be preserved as illustrated in the following diagrams:
When only 1 file is assigned to each partition, each partition is
correctly sorted on (A, B, C)
┏ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ┓
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ┐
┃ ┌───────────────┐ ┌──────────────┐ │ ┌──────────────┐ │ ┌─────────────┐ ┃
│ │ 1.parquet │ │ │ │ 2.parquet │ │ │ 3.parquet │ │ │ 4.parquet │ │
┃ │ Sort: A, B, C │ │Sort: A, B, C │ │ │Sort: A, B, C │ │ │Sort: A, B, C│ ┃
│ └───────────────┘ │ │ └──────────────┘ │ └──────────────┘ │ └─────────────┘ │
┃ │ │ ┃
│ │ │ │ │ │
┃ │ │ ┃
│ │ │ │ │ │
┃ │ │ ┃
│ │ │ │ │ │
┃ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┃
DataFusion DataFusion DataFusion DataFusion
┃ Partition 1 Partition 2 Partition 3 Partition 4 ┃
━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━
DataSourceExecHowever, when more than 1 file is assigned to each partition, each
partition is NOT correctly sorted on (A, B, C). Once the second
file is scanned, the same values for A, B and C can be repeated in
the same sorted stream
┏ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┃
┃ ┌───────────────┐ ┌──────────────┐ │
│ │ 1.parquet │ │ │ │ 2.parquet │ ┃
┃ │ Sort: A, B, C │ │Sort: A, B, C │ │
│ └───────────────┘ │ │ └──────────────┘ ┃
┃ ┌───────────────┐ ┌──────────────┐ │
│ │ 3.parquet │ │ │ │ 4.parquet │ ┃
┃ │ Sort: A, B, C │ │Sort: A, B, C │ │
│ └───────────────┘ │ │ └──────────────┘ ┃
┃ │
│ │ │ ┃
┃ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
DataFusion DataFusion ┃
┃ Partition 1 Partition 2
━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ┛
DataSourceExec