Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/develop' into wm/write-stats
Browse files Browse the repository at this point in the history
  • Loading branch information
lwwmanning committed Oct 21, 2024
2 parents 1301771 + 36a7a94 commit a68f003
Show file tree
Hide file tree
Showing 70 changed files with 1,505 additions and 1,311 deletions.
198 changes: 99 additions & 99 deletions Cargo.lock

Large diffs are not rendered by default.

33 changes: 20 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@
Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache Arrow arrays
in-memory, on-disk, and over-the-wire.

Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster)
and scans (2-10x faster), while preserving approximately the same compression ratio and write throughput as Parquet with zstd.
It will also support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.
Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and scans (2-10x faster),
while preserving approximately the same compression ratio and write throughput as Parquet with zstd.
It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.

Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
extremely fast, batteries-included.
Vortex is intended to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
extremely fast, & batteries-included.

> [!CAUTION]
> This library is still under rapid development and is a work in progress!
Expand Down Expand Up @@ -58,7 +58,12 @@ One of the unique attributes of the (in-progress) Vortex file format is that it
file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to
the file format specification.

In fact, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
For example, the Compressor implementation can choose to chunk data into a Parquet-like layout with
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it can choose
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is constant
across all rows can be a single chunk, whereas a large string column may be split arbitrarily many times).

In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
themselves. This should help avoid the rapid calcification that has plagued other columnar file formats.

## Components
Expand Down Expand Up @@ -120,10 +125,10 @@ in-memory array implementation, allowing us to defer decompression. Currently, t
Vortex's default compression strategy is based on the
[BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper.

Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted (
recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode
the entire chunk. This sounds like it would be very expensive, but given basic statistics about a chunk, it is
possible to cheaply prune many encodings and ensure the search space does not explode in size.
Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted
(recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode
the entire chunk. This sounds like it would be very expensive, but given the logical types and basic statistics about a
chunk, it is possible to cheaply prune many encodings and ensure the search space does not explode in size.

### Compute

Expand Down Expand Up @@ -224,7 +229,7 @@ Expect more details on this in Q4 2024.
This project is inspired by and--in some cases--directly based upon the existing, excellent work of many researchers
and OSS developers.

In particular, the following academic papers greatly influenced the development:
In particular, the following academic papers have strongly influenced development:

* Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis.
[BtrBlocks: Efficient Columnar Compression for Data Lakes](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf).
Expand All @@ -240,12 +245,14 @@ In particular, the following academic papers greatly influenced the development:
* Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mccormick, Aniket Mokashi, Paul Harvey,
Hector Gonzalez, David Lomax, Sagar Mittal, et al. [Procella: Unifying serving and analytical
data at YouTube](https://dl.acm.org/citation.cfm?id=3360438). PVLDB, 12(12): 2022-2034, 2019.
* Dominik Durner, Viktor Leis, and Thomas Neumann. [Exploiting Cloud Object Storage for High-Performance
Analytics](https://www.durner.dev/app/media/papers/anyblob-vldb23.pdf). PVLDB, 16(11): 2769-2782, 2023.


Additionally, we benefited greatly from:

* the existence, ideas, & implementation of [Apache Arrow](https://arrow.apache.org).
* likewise for the excellent [Apache DataFusion](https://github.com/apache/datafusion) project.
* the existence, ideas, & implementations of both [Apache Arrow](https://arrow.apache.org) and
[Apache DataFusion](https://github.com/apache/datafusion).
* the [parquet2](https://github.com/jorgecarleitao/parquet2) project by [Jorge Leitao](https://github.com/jorgecarleitao).
* the public discussions around choices of compression codecs, as well as the C++ implementations thereof,
from [duckdb](https://github.com/duckdb/duckdb).
Expand Down
2 changes: 1 addition & 1 deletion docs/encoding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ Arrays
.. automodule:: vortex.encoding
:members:
:imported-members:
:special-members: __len__
:special-members: __len__, __lt__, __le__, __eq__, __ne__, __ge__, __gt__
2 changes: 1 addition & 1 deletion encodings/alp/src/alp/array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@ use std::fmt::{Debug, Display};
use std::sync::Arc;

use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::PrimitiveArray;
use vortex::encoding::ids;
use vortex::iter::{Accessor, AccessorRef};
use vortex::stats::ArrayStatisticsCompute;
use vortex::validity::{ArrayValidity, LogicalValidity, Validity};
use vortex::variants::{ArrayVariants, PrimitiveArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoArray, IntoCanonical};
use vortex_dtype::{DType, PType};
use vortex_error::{vortex_bail, vortex_panic, VortexExpect as _, VortexResult};
Expand Down
2 changes: 1 addition & 1 deletion encodings/alp/src/alp_rd/array.rs
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
use std::fmt::{Debug, Display};

use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::{PrimitiveArray, SparseArray};
use vortex::encoding::ids;
use vortex::stats::{ArrayStatisticsCompute, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoCanonical};
use vortex_dtype::{DType, Nullability, PType};
use vortex_error::{vortex_bail, VortexExpect, VortexResult};
Expand Down
2 changes: 1 addition & 1 deletion encodings/bytebool/src/array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ use std::mem::ManuallyDrop;

use arrow_buffer::BooleanBuffer;
use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::BoolArray;
use vortex::encoding::ids;
use vortex::stats::StatsSet;
use vortex::validity::{ArrayValidity, LogicalValidity, Validity, ValidityMetadata};
use vortex::variants::{ArrayVariants, BoolArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{impl_encoding, ArrayTrait, Canonical, IntoCanonical, TypedArray};
use vortex_buffer::Buffer;
use vortex_dtype::DType;
Expand Down
2 changes: 1 addition & 1 deletion encodings/datetime-parts/src/array.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
use std::fmt::{Debug, Display};

use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::StructArray;
use vortex::compute::unary::try_cast;
use vortex::encoding::ids;
use vortex::stats::{ArrayStatisticsCompute, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity, Validity};
use vortex::variants::{ArrayVariants, ExtensionArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoArray, IntoCanonical};
use vortex_dtype::{DType, PType};
use vortex_error::{vortex_bail, VortexExpect as _, VortexResult, VortexUnwrap};
Expand Down
2 changes: 1 addition & 1 deletion encodings/dict/src/array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@ use std::fmt::{Debug, Display};

use arrow_buffer::BooleanBuffer;
use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::BoolArray;
use vortex::compute::take;
use vortex::compute::unary::scalar_at;
use vortex::encoding::ids;
use vortex::stats::StatsSet;
use vortex::validity::{ArrayValidity, LogicalValidity};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{
impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoArray, IntoArrayVariant,
IntoCanonical,
Expand Down
13 changes: 12 additions & 1 deletion encodings/dict/src/compute.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
use vortex::compute::unary::{scalar_at, scalar_at_unchecked, ScalarAtFn};
use vortex::compute::{slice, take, ArrayCompute, SliceFn, TakeFn};
use vortex::compute::{filter, slice, take, ArrayCompute, FilterFn, SliceFn, TakeFn};
use vortex::{Array, IntoArray};
use vortex_error::{VortexExpect, VortexResult};
use vortex_scalar::Scalar;
Expand All @@ -18,6 +18,10 @@ impl ArrayCompute for DictArray {
fn take(&self) -> Option<&dyn TakeFn> {
Some(self)
}

fn filter(&self) -> Option<&dyn FilterFn> {
Some(self)
}
}

impl ScalarAtFn for DictArray {
Expand Down Expand Up @@ -46,6 +50,13 @@ impl TakeFn for DictArray {
}
}

impl FilterFn for DictArray {
fn filter(&self, predicate: &Array) -> VortexResult<Array> {
let codes = filter(self.codes(), predicate)?;
Self::try_new(codes, self.values()).map(|a| a.into_array())
}
}

impl SliceFn for DictArray {
// TODO(robert): Add function to trim the dictionary
fn slice(&self, start: usize, stop: usize) -> VortexResult<Array> {
Expand Down
2 changes: 1 addition & 1 deletion encodings/fastlanes/src/bitpacking/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ use std::fmt::{Debug, Display};
use ::serde::{Deserialize, Serialize};
pub use compress::*;
use fastlanes::BitPacking;
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::{PrimitiveArray, SparseArray};
use vortex::encoding::ids;
use vortex::stats::{ArrayStatisticsCompute, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity, Validity, ValidityMetadata};
use vortex::variants::{ArrayVariants, PrimitiveArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoCanonical, TypedArray};
use vortex_buffer::Buffer;
use vortex_dtype::{DType, NativePType, Nullability, PType};
Expand Down
2 changes: 1 addition & 1 deletion encodings/fastlanes/src/delta/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@ use std::fmt::{Debug, Display};

pub use compress::*;
use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::PrimitiveArray;
use vortex::encoding::ids;
use vortex::stats::{ArrayStatisticsCompute, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity, Validity, ValidityMetadata};
use vortex::variants::{ArrayVariants, PrimitiveArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoArray, IntoCanonical};
use vortex_dtype::{match_each_unsigned_integer_ptype, NativePType};
use vortex_error::{vortex_bail, vortex_panic, VortexExpect as _, VortexResult};
Expand Down
2 changes: 1 addition & 1 deletion encodings/fastlanes/src/for/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@ use std::fmt::{Debug, Display};

pub use compress::*;
use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::encoding::ids;
use vortex::stats::{ArrayStatisticsCompute, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity};
use vortex::variants::{ArrayVariants, PrimitiveArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoCanonical};
use vortex_dtype::{DType, PType};
use vortex_error::{vortex_bail, vortex_panic, VortexExpect as _, VortexResult};
Expand Down
4 changes: 2 additions & 2 deletions encodings/fsst/src/array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ use std::sync::Arc;

use fsst::{Decompressor, Symbol};
use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::VarBinArray;
use vortex::encoding::ids;
use vortex::stats::{ArrayStatisticsCompute, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity, Validity};
use vortex::variants::{ArrayVariants, BinaryArrayTrait, Utf8ArrayTrait};
use vortex::visitor::AcceptArrayVisitor;
use vortex::{impl_encoding, Array, ArrayDType, ArrayTrait, IntoCanonical};
use vortex_dtype::{DType, Nullability, PType};
use vortex_error::{vortex_bail, VortexExpect, VortexResult};
Expand Down Expand Up @@ -183,7 +183,7 @@ impl FSSTArray {
}

impl AcceptArrayVisitor for FSSTArray {
fn accept(&self, visitor: &mut dyn vortex::visitor::ArrayVisitor) -> VortexResult<()> {
fn accept(&self, visitor: &mut dyn ArrayVisitor) -> VortexResult<()> {
visitor.visit_child("symbols", &self.symbols())?;
visitor.visit_child("symbol_lengths", &self.symbol_lengths())?;
visitor.visit_child("codes", &self.codes())?;
Expand Down
2 changes: 1 addition & 1 deletion encodings/roaring/src/boolean/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ pub use compress::*;
use croaring::Native;
pub use croaring::{Bitmap, Portable};
use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::BoolArray;
use vortex::encoding::ids;
use vortex::stats::{Stat, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity, Validity};
use vortex::variants::{ArrayVariants, BoolArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{impl_encoding, Array, ArrayTrait, Canonical, IntoArray, IntoCanonical, TypedArray};
use vortex_buffer::Buffer;
use vortex_dtype::DType;
Expand Down
2 changes: 1 addition & 1 deletion encodings/roaring/src/integer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@ use std::fmt::{Debug, Display};
pub use compress::*;
use croaring::{Bitmap, Portable};
use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::PrimitiveArray;
use vortex::compute::unary::try_cast;
use vortex::encoding::ids;
use vortex::stats::{ArrayStatistics, ArrayStatisticsCompute, Stat, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity, Validity};
use vortex::variants::{ArrayVariants, PrimitiveArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{
impl_encoding, Array, ArrayDType as _, ArrayTrait, Canonical, IntoArray, IntoCanonical,
TypedArray,
Expand Down
2 changes: 1 addition & 1 deletion encodings/runend-bool/src/array.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
use std::fmt::{Debug, Display};

use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::compute::unary::scalar_at;
use vortex::compute::{search_sorted, SearchSortedSide};
use vortex::encoding::ids;
use vortex::stats::{ArrayStatistics, ArrayStatisticsCompute, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity, Validity, ValidityMetadata};
use vortex::variants::{ArrayVariants, BoolArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{
impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoArrayVariant, IntoCanonical,
};
Expand Down
2 changes: 1 addition & 1 deletion encodings/runend/src/array.rs
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
use std::fmt::{Debug, Display};

use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::PrimitiveArray;
use vortex::compute::unary::scalar_at;
use vortex::compute::{search_sorted, search_sorted_u64_many, SearchSortedSide};
use vortex::encoding::ids;
use vortex::stats::{ArrayStatistics, ArrayStatisticsCompute, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity, Validity, ValidityMetadata};
use vortex::variants::{ArrayVariants, PrimitiveArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{
impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoArray, IntoArrayVariant,
IntoCanonical,
Expand Down
2 changes: 1 addition & 1 deletion encodings/zigzag/src/array.rs
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
use std::fmt::Display;

use serde::{Deserialize, Serialize};
use vortex::array::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::array::PrimitiveArray;
use vortex::encoding::ids;
use vortex::stats::{ArrayStatisticsCompute, StatsSet};
use vortex::validity::{ArrayValidity, LogicalValidity};
use vortex::variants::{ArrayVariants, PrimitiveArrayTrait};
use vortex::visitor::{AcceptArrayVisitor, ArrayVisitor};
use vortex::{
impl_encoding, Array, ArrayDType, ArrayTrait, Canonical, IntoArray, IntoArrayVariant,
IntoCanonical,
Expand Down
Loading

0 comments on commit a68f003

Please sign in to comment.