Small-Scale Data Pipeline: DuckDB vs. chDB vs. Spark

In the realm of local-scale data pipelines, where data processing occurs on a single machine or a small cluster, choosing the right tool is paramount. Each tool brings unique strengths and weaknesses to the table, influencing factors such as ease of use, performance, flexibility, and scalability.

In this article we’ll explore three options for small-medium scale aggregation: Spark, chDB, and DuckDB.

Our Use Case

Our product, DT Exchange, is a programmatic advertising platform that facilitates the buying and selling of ad inventory between publishers and advertisers, optimizing ad targeting and delivery.

As part of training our internal models, we needed to produce an hourly report about bids that were excluded. For that reason we take an hourly anonymized aggregation of our bid events and extract all the bids that have an “Excluded” status.

Our criteria for delivery are:

Ease of Use
Maintainability
Performance and Cost
Startup Time and Image Size
Flexibility
Scalability

Our main tool for this kind of job has always been Spark, but upon examining the size of the hourly anonymized aggregated data (~40 GiB), we’ve decided to explore other options: chDB and DuckDB.

Disclaimer About Benchmarks

While benchmarks can be a valuable tool for comparing the performance of different systems, it's important to keep in mind that they can also be very use-case specific. This is especially true for broad targets like SQL databases, which can be used for a wide variety of workloads.

The results of this benchmark are relevant to the use case described in the article. In other cases, we may have obtained different results.

It's also important to note that benchmarks are often run on specific hardware and software configurations. The results may not be the same on different hardware or with different software versions.

You may want to run your own benchmarks to see how the different databases perform on your data and queries.

Experiment Setup

The input data was downloaded from GCS in advance. All the solutions that we’re exploring have good integration with Amazon S3 and GCS, but for the purposes of this article, we’ll use pre-downloaded local data.

The input data is 40 GiB over 52 parquet files. Their schema contains 94 columns.

The tests run on a c2d-standard-16 (16 core, 64 GB, x86).

Our required data can be extracted with the following query (the real field names changed for simplicity):

SQL
select fieldA, fieldB, fieldC, fieldD, fieldE
from '../data/*.parquet' where bidStatus ='Excluded'
group by 1, 2, 3, 4, 5

The result should be saved into a single parquet file, with gzip compression and row_group_size of 15000.

Ease of Use

The DuckDB implementation is very simple, and has a very good integration with Python. Due to its nature as an embedded database, it supports many other languages, but Python is probably the easiest.

The implementation requires pip install duckdb and the following script:

Python
import duckdb

query = """copy(
select fieldA, fieldB, fieldC, fieldD, fieldE
from '../data/*.parquet' where bidStatus ='Excluded'
       group by 1, 2, 3, 4, 5
    )
    to './duckdb_local.parquet' (FORMAT 'parquet', COMPRESSION 'gzip', ROW_GROUP_SIZE 15000);"""

duckdb.sql(query)

chDB has a very similar implementation:

Python
import chdb

query = """
select fieldA, fieldB, fieldC, fieldD, fieldE
from '../data/*.parquet' where bidStatus ='Excluded'
       group by 1, 2, 3, 4, 5
       INTO OUTFILE './chdb_local.parquet'
       FORMAT Parquet
       SETTINGS
            output_format_parquet_compression_method='gzip',
            output_format_parquet_row_group_size=15000"""

chdb.query(query)

Now, Spark is a different story. Even if you have (as we do) all the templates and settings for creating a new Spark project quickly, the implementation is still quite verbose. First of all, instead of a single Python script, we must have a project:

Plain Text
spark
├── project
│   ├── build.properties
│   └── plugins.sbt
├── src
│   └── main
│       ├── resources
│       │   └── log4j.properties
│       └── scala
│           └── com
│               └── dt
│                   └── spark
│                       └── Main.scala
└── build.sbt

With roughly 100 lines of sbt code, containing battle tested exclusions, assembly settings, merge rules, and shading rules. Not all of them are needed for this use case, but figuring out which ones are safe to remove is usually not worth the effort.
Now for the code:

Java
package com.dt.spark

import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}

object Main {
  def localSparkSession(name: String, sparkConf: SparkConf): SparkSession = {
    SparkSession
      .builder()
      .master("local[*]")
      .appName(name)
      .config(sparkConf)
      .getOrCreate()
  }

  def main(args: Array[String]): Unit = {
    val spark = localSparkSession("spark_agg", new SparkConf())

    val df = spark.read.parquet("../data/*.parquet")
    df.createOrReplaceTempView("df")

    val result = spark.sql(
      s"""select fieldA, fieldB, fieldC, fieldD, fieldE
  |from df where bidStatus ='Excluded'
  |group by 1, 2, 3, 4, 5""".stripMargin)

    result.coalesce(1).write  // we must have it as a single file
      .option("compression", "gzip")
      .option("parquet.row.group.size", 15000)
      .mode(SaveMode.Overwrite)
      .parquet(s"spark_local.parquet")
  }
}

Maintainability

This is probably the most difficult to predict, but it could be argued that less code is always easier to maintain than more code. When new requirements come, we might see that we must implement the Spark solution, but we should pay that price (and the price of debugging Spark dependency issues) only when we must.

Performance and Cost

As mentioned above, benchmarks can be misleading. In this case we’ve optimized for developer experience and used mostly default settings. One can tune Spark to be more optimized for this specific task, but would take some effort. Even without that, we can still learn some things.

Running perf stats and /usr/bin/time -v on the code above resulted in the following:

Metric	DuckDB	chDB	Spark
Task Clock Time (ms)	60,269.31	121,584.99	151,496.45
CPU Usage (%)	1,507.8	1,306.6	923.8
Context Switches	6,726	3,664	185,639
CPU Migrations	130	110	16,303
Minor Page Faults	46,549	463,230	498,602
Major Page Faults	2	1	16
Max RSS (kB)	196,548	2,366,392	1,651,196
Elapsed Time (s)	3.99	9.3	16.39

‍

We see that DuckDB crushed the competition, with more than half of the time that chDB took. Spark is lagging far behind with 2.5 longer processing time compared to DuckDB. Let’s try to analyze those numbers:

CPU Utilization & Lower Task Clock Time

DuckDB uses more CPU cores efficiently. This suggests that it has much less overhead than the alternatives. This is probably because it was designed as an embedded DB, to have as minimal overhead as possible.
chDB comes close, but still a bit slower. I would guess that because Clickhouse was designed as a full-fledged database, some startup overhead must exist in chDB. Even the official chDB Benchmarks^[1]^[2] show similar results when compared to DuckDB (though the results are slightly more favorable towards chDB there).
Spark is a JVM application. It has a long startup time, and a complex query planning mechanism. All that is negligible on a large scale with large multi-node aggregations, but it’s felt in this case. My guess is that Spark’s startup and planning phases are not as parallel as the aggregation itself, thus average CPU utilization is lower.

Context Switches & CPU Migrations

While DuckDB and chDB have comparable context switches and CPU migrations, Spark is on a completely different scale.
This difference is a bit curious, since we know that Spark does work very well in high load situations. There could be several explanations here:
- This includes JVM warmup and JIT optimizations, which might cause a lot of context switches.
- The GC might cause context switches and thread contention.
- The Execution plan is planned for distribution, while it’s not needed in this case.
The high CPU migrations might indicate that Spark may not be efficiently utilizing CPU cache locality.

Page Faults & Memory

DuckDB's lower memory overhead suggests it processes data in a more cache-friendly manner, and is more optimized for in-memory processing, minimizing memory remappings. It could also be more efficient at streaming data, and processing data in smaller chunks, therefore not needing as much RAM.
Spark is running on the JVM which has very complex memory management, with GC, JIT, a pre-allocated heap and so on. Spark itself also has a complex scheduling mechanism and data model. Those would affect memory usage and cache locality.
chDB might be configured or optimized to keep more data in memory for faster access, leading to a larger RSS and more minor page faults.

Image Size and Startup Time

To get a rough estimate of the image size we’ll look at the dependencies required to run each job:

DuckDB
- Only uses a single 60 MiB DuckDB package (59 MiB for the native code)
chDB
- 574 MiB for chDB package (basically a single binary)
- ~300 MiB in other packages bundled into it like pyarrow, pandas, and numpy.
Spark
- 428 MiB for Spark itself
- Under 1 MiB for the application

It’s quite easy to see that DuckDB takes this category as it is designed as an embedded, lightweight, standalone database.

Flexibility

Spark's rich ecosystem and support for diverse data sources and formats grant it unparalleled flexibility. Its ability to integrate with various storage systems, stream processing engines, and machine learning libraries makes it a versatile tool for building complex data pipelines. Its strength lies in its code. It’s not restricted to SQL, and uses Scala or Python natively.

chDB offers moderate flexibility with its SQL-based interface and support for user-defined functions enabling customization and extension. It’s obviously not as powerful as Spark, but it does support an immense amount of capabilities.

DuckDB, being more specialized, exhibits lower flexibility. While it excels at in-memory processing and SQL queries, its support for external data sources and advanced analytics may be less extensive compared to Spark and Clickhouse. Important to note is that it is a newer product than the others, and its ecosystem of plugins is still growing.

Scalability

Spark's distributed computing capabilities make it the undisputed champion of scalability. By distributing data and computations across a cluster of machines, Spark can handle massive datasets that exceed the capacity of a single node.

chDB also demonstrates impressive scalability, particularly for analytical workloads. Its ability to partition data, distribute queries, and leverage columnar storage enables efficient processing of large datasets on a single machine. For a clustered solution, Clickhouse is very powerful, though it can be tricky to get working.

DuckDB, being single-node, has inherent limitations in scalability. While it can handle substantial datasets that fit in memory, its performance may degrade as data volumes grow beyond the capacity of a single machine.

Choosing the Right Tool‍

For small to medium-sized datasets and straightforward queries, DuckDB's ease of use and speed make it a compelling choice. Its minimal setup and fast performance can significantly accelerate time-to-market, allowing developers to quickly deploy fast solutions.

When dealing with more complex queries or larger datasets that still reside on a single machine, chDB (or clickhouse-local) emerges as a strong contender. chDB offers a mature and extensive set of capabilities, including support for complex data types, advanced indexing techniques, and efficient query processing. It has more capabilities than DuckDB, at a similar or better performance level. A good example of chDB’s capabilities is its array functions (e.g., arrayJoin) that perform significantly better than DuckDB.

For truly massive datasets that surpass the capacity of a single machine, Spark's distributed computing prowess provides the most robust solution. Its ability to scale horizontally across a cluster enables processing of petabytes of data, making it suitable for large-scale data warehousing, machine learning, and real-time analytics.

Sticky Table Header with Metrics

Feature	DuckDB	chDB / Clickhouse	Spark
Ease of Use & Time to Delivery	Simple	Simple	Complex
Maintainability	Simple	Simple	Complex
Performance & Cost	High performance Low Overhead	High performance Medium Overhead	High Performance High Overhead
Image Size & Startup Time	Small Fast startup	Medium-Large Fast startup	Medium-Large Slow startup
Flexibility	Moderate	Moderate	High
Scalability	Low	Low	High (distributed)

‍Key Takeaway

For our stated problem we ended up using DuckDB. It was quick to implement, and due to its low overhead and startup time, could be scheduled to run every few minutes with a Kubernetes CronJob.

For other pipelines we ended up using clickhouse-local, due to its amazing array handing functions.

That said, for most of our pipelines we use Spark as it's the only solution that can easily handle the vast amount of data we process. Even the input for the job in this article is generated by a Spark job.

Generally speaking, in the realm of local-scale data pipelines, there is no one-size-fits-all solution. The optimal tool hinges on your specific requirements, data volumes, and query complexity. For many use cases, DuckDB can be a surprisingly powerful and efficient option, delivering rapid results with minimal setup. As your needs evolve and data volumes grow, Clickhouse and Spark offer avenues for scaling up without compromising performance. By carefully evaluating your requirements and understanding the strengths and weaknesses of each tool, you can confidently choose the right solution for your local-scale data pipeline, unlocking valuable insights and driving data-driven decision-making.

‍