Polars & DuckDB: The New Data Science Stack

Beyond Pandas: How Polars and DuckDB are Rewriting the Data Science Toolchain

For the better part of a decade, the Python data science stack has rested on three sturdy pillars: NumPy for numbers, Pandas for tabular data, and Scikit-learn for modeling. If you were a data scientist in 2018, your import statements were almost a ritual: import pandas as pd.

But the landscape has shifted. Data isn’t getting smaller; it’s exploding into the terabytes. The “out-of-core” tricks we used to play (like chunking CSVs) are becoming unsustainable. Enter a new generation of tools designed not for 2015 laptops, but for modern multi-core processors and the cloud.

Two names dominate this evolution: Polars (the lightning-fast DataFrame library) and DuckDB (the in-process analytical database). Together, they represent the most significant shift in the data science workflow since the invention of the Jupyter Notebook.

This 4000-word deep dive explores why the toolchain is evolving, how Polars and DuckDB solve the “big data on a laptop” problem, and—crucially—how to use them together to reclaim your weekends from slow-running scripts.


Part 1: The Old Guard – Why Pandas Became a Bottleneck

To understand the future, we must respect the past. Pandas is not “bad.” It is a masterpiece of API design for interactive data analysis. It allowed analysts to write df[df['column'] > 5].groupby('other').mean() in a way that felt like magic.

The Vectorized Mirage

Pandas relies heavily on vectorized operations via NumPy. While faster than Python loops, it suffers from a fatal flaw: Copy-on-Write (prior to recent updates) and memory bloat. Every time you filter or transform a large DataFrame, Pandas often creates an entirely new copy in RAM.

The String Garbage Collector Problem

If you have a column of text (strings), Pandas stores them as Python objects. Every string operation forces Python’s memory management to wake up, slowing everything down.

The “No Free Lunch” Syntax

The .apply() function with a lambda? It’s a Python loop in disguise. For 10 million rows, that loop takes minutes.

The Result

Data scientists developed bad habits: filtering aggressively to reduce RAM, writing convoluted groupby-apply chains, and eventually giving up, moving to Spark, and crying over cluster configuration logs.

Pandas hit a wall. It was time for a rewrite—but not a minor patch. A paradigm rewrite.


Part 2: The New Pillars – The Rise of Polars and DuckDB

The industry realized two things:

  1. Moore’s Law is now about cores, not clock speed. We need tools that use all CPU cores (parallelism).
  2. RAM is not infinite. We need tools that can spill to disk (out-of-core processing) without manual chunking.

Two distinct philosophies emerged to solve these problems.

Polars: The “Rust-powered, Blazing Fast” DataFrame

Polars is not a Pandas fork. It is a complete rewrite in Rust, using the Apache Arrow memory format as its foundation.

Key Innovations:

  • Apache Arrow: A columnar memory format that is cache-efficient and language-agnostic. Polars doesn’t have to parse strings repeatedly; it stores them as Arrow arrays.
  • Lazy Execution: Polars has an eager API (like Pandas) and a lazy API. With lazy execution, you build a query plan. Polars optimizes it (predicate pushdown, projection pushdown) before executing.
  • Parallelism: By default, Polars uses all your CPU cores. A groupby operation is automatically split across cores.

DuckDB: The “SQLite for Analytics”

DuckDB is an embedded columnar database. You install it via pip, and it runs inside your Python process. No server. No cluster. No JVM overhead.

Key Innovations:

  • Columnar Storage: Unlike SQLite (row-based), DuckDB stores data column by column. Queries that aggregate one column (e.g., SELECT SUM(sales)) only touch that specific column on disk, making it screaming fast.
  • Vectorized Execution Engines: DuckDB processes data in batches (vectors) rather than row-by-row.
  • Zero Serialization: DuckDB can query Pandas, Polars, and Parquet DataFrames directly without importing them into a database first.

Part 3: Deep Dive – Polars Syntax and Philosophy

If you are moving from Pandas to Polars, you will feel uncomfortable for about one hour. Then you will never want to go back.

The “Ownership” Model (Rust influences)

Polars is strict about mutability. You cannot modify a DataFrame in place. Every operation returns a new DataFrame. This sounds memory-heavy, but Polars uses Arrow’s copy-on-write efficiently.

Bad Pandas (in-place):

df['new_col'] = df['old_col'] * 2

Good Polars (expression-based):

df.with_columns(
    (pl.col("old_col") * 2).alias("new_col")
)

The Power of Expressions

Polars’ superpower is that expressions are first-class objects. You can store them in variables and reuse them.

# Pandas way (hard to reuse)
df['total'] = df['price'] * df['qty']
df['tax'] = df['total'] * 0.1

# Polars way (composable)
price_expr = pl.col("price") * pl.col("qty")
tax_expr = price_expr * 0.1

df.with_columns([
    price_expr.alias("total"),
    tax_expr.alias("tax")
])

Lazy Execution in Practice

This is where Polars beats Pandas for large files (e.g., 50GB of CSV).

import polars as pl

# Lazy: No data is read yet.
q = (
    pl.scan_csv("huge_sales_*.csv")  # Scan multiple files
    .filter(pl.col("date") > "2023-01-01")
    .groupby("product_id")
    .agg([
        pl.col("revenue").sum(),
        pl.col("units").mean()
    ])
    .sort("revenue", descending=True)
)

# Execute: Only now does the work happen.
result = q.collect()

What happens in the background? Polars rewrites your query. It pushes the date filter down to the CSV reader (so it never parses rows from 2022). It drops the units column until the aggregation step. It splits the files across threads.

Polars vs Pandas: The Speed Test

  • Pandas on 10GB CSV: Likely crashes (MemoryError) or swaps to disk and runs for 10 minutes.
  • Polars on 10GB CSV (Lazy): Streams through memory, uses cores, finishes in 1-2 minutes.

SEO Keyword: Fast DataFrame library Python


Part 4: Deep Dive – DuckDB, The SQL Renaissance

For years, data scientists looked down on SQL as “not real programming.” Then they tried to do a complex window function in Pandas and cried. DuckDB is leading a SQL revival within Python.

The “No Data Movement” Heaven

The biggest crime in data science is unnecessary data copying. Moving a DataFrame from Pandas to SQLite to CSV back to Pandas is wasteful.

DuckDB lets you query anything, anywhere, without moving it.

import duckdb
import polars as pl

# Create a Polars DataFrame
df_polars = pl.DataFrame({"a": [1,2,3], "b": ["x","y","z"]})

# Query it directly with DuckDB
result = duckdb.sql("""
    SELECT a, b, a * 2 as a_doubled
    FROM df_polars
    WHERE a > 1
""").pl()  # Convert result back to Polars

print(result)

DuckDB + Parquet = The Ultimate Data Lakehouse on a Laptop

Parquet is a compressed, columnar file format. DuckDB can read Parquet files as if they were database tables.

Imagine you have 10 Parquet files on S3 (or your local SSD). With DuckDB, you don’t need to load them.

# No ingestion step. No database server.
duckdb.sql("""
    SELECT 
        year, 
        SUM(sales) as total_sales
    FROM 's3://my-bucket/sales_*.parquet'
    GROUP BY year
    ORDER BY year DESC
""")

Performance: Because both DuckDB and Parquet are columnar, DuckDB only reads the year and sales columns from the files. It skips the other 200 columns entirely. This is magical.

DuckDB’s Killer Feature: The WITH Clause and Window Functions

Complex data transformations that require 50 lines of Pandas .merge() and .groupby().transform() become a 10-line SQL query.

Pandas (Messy):

# Rank sales within each region
df['rank'] = df.groupby('region')['sales'].rank(method='dense', ascending=False)

DuckDB (Elegant):

SELECT 
    *,
    DENSE_RANK() OVER (PARTITION BY region ORDER BY sales DESC) as rank
FROM df

SEO Keyword: In-process analytical database


Part 5: The Combined Workflow – Polars + DuckDB Stack

Here is the secret sauce. The modern data scientist should not choose one. They should use Polars for complex transformations and DuckDB for SQL-friendly aggregations.

The ETL Pattern: DuckDB ingests, Polars transforms

Scenario: You have 1TB of IoT sensor logs in JSON lines format.

Step 1: Ingest and Filter with DuckDB (Disk-concious)
DuckDB is excellent at reading JSON and writing Parquet.

import duckdb

# Filter down the 1TB to 10GB of interesting data
duckdb.sql("""
    COPY (
        SELECT sensor_id, timestamp, temperature, humidity
        FROM 'raw_data/*.json'
        WHERE temperature > 100  -- Anomaly detection
        AND timestamp >= '2024-01-01'
    ) TO 'filtered_data.parquet' (FORMAT PARQUET)
""")

Step 2: Complex Wrangling with Polars (CPU-bound)
Now you have 10GB in Parquet. You need to calculate a rolling median per sensor and detect seasonality. Polars shines here.

import polars as pl

df = pl.scan_parquet("filtered_data.parquet")

# Complex rolling window
result = (
    df
    .sort("timestamp")
    .groupby("sensor_id")
    .agg([
        pl.col("temperature").rolling_median(window_size=10).alias("temp_median"),
        pl.col("humidity").mean().alias("avg_humidity")
    ])
    .collect()
)

Step 3: Final Reporting with DuckDB (SQL friendly)
The final aggregated result (now 50MB) needs to be joined with a product catalog for a BI dashboard. You could use Polars, but SQL is cleaner.

catalog = pl.read_csv("product_catalog.csv")

final_report = duckdb.sql("""
    SELECT 
        c.product_name,
        r.temp_median,
        r.avg_humidity
    FROM result r
    JOIN catalog c ON r.sensor_id = c.sensor_id
""").pl()

final_report.write_csv("dashboard_ready.csv")

Why this beats the “Everything in Spark” approach?

Spark is powerful, but it requires a cluster. For 80% of medium-data problems (1GB to 500GB), Polars + DuckDB runs faster on a single $100/month cloud instance than Spark runs on a $1000/month cluster. No serialization to Java. No shuffle overhead.


Part 6: Practical Migration – From Pandas to the New Stack

If you are reading this, you likely have a legacy Pandas script that fails every Friday because it runs out of memory. Here is how to migrate it.

Rule 1: Replace .apply() with .map_batches() or Expressions

Pandas:

def clean_text(x):
    return x.lower().strip()

df['clean'] = df['messy_column'].apply(clean_text)

Polars:

# Polars runs this on Rust strings, not Python objects
df.with_columns(
    pl.col("messy_column").str.to_lowercase().str.strip_chars().alias("clean")
)

Speedup: 100x

Rule 2: Replace Looping Groupbys with Grouped Maps

Pandas (Slow):

groups = []
for name, group in df.groupby('user_id'):
    group['shifted'] = group['value'].shift(1)
    groups.append(group)
result = pd.concat(groups)

Polars (Fast):

result = df.with_columns(
    pl.col("value").shift(1).over("user_id")
)

Speedup: 1000x

Rule 3: When to say “No” to Polars (and use DuckDB)

Polars struggles with huge joins (e.g., joining a 2GB table with a 200GB table). It will try to bring everything into memory.
Solution: Do the join in DuckDB, which spills to disk.

# DuckDB handles the 200GB join gracefully
duckdb.sql("""
    SELECT a.*, b.metadata
    FROM large_table a
    JOIN small_table b ON a.id = b.id
""")

Part 7: Real-World Benchmarks and Use Cases

Let’s get concrete. Here are three scenarios where the new toolchain excels.

Use Case 1: The “NYC Taxi Data” (200GB of Parquet)

  • Pandas: Impossible. Crashes on load.
  • Spark (Local mode): Slow setup. 45 minutes.
  • DuckDB: 4 minutes to query “Total tips by hour of day.”
  • Polars: 6 minutes to compute “Rolling average fare by pickup zone.”

Winner: DuckDB for SQL, Polars for time-series.

Use Case 2: Real-time Log Analysis (Streaming 10k events/sec)

You need to calculate a 5-minute sliding window.

  • Pandas: Not built for streaming.
  • Polars: Excellent. Using .groupby_dynamic() with every="1s" and period="5m" processes streams efficiently.
  • DuckDB: Works, but requires re-running queries; not designed for incremental updates.

Winner: Polars.

Use Case 3: The Data Scientists’ Notebook (Interactive Exploration)

You have a 5GB CSV. You want to check df.columns and df.describe().

  • Pandas: Loads 5GB into RAM. Takes 30 seconds.
  • Polars (Lazy): scan_csv returns instantly. collect() only when needed.
  • DuckDB: SELECT * FROM 'file.csv' LIMIT 10 is instantaneous because of projection pushdown.

Winner: Tie. Use Polars for EDA with lazy frames; use DuckDB for ad-hoc SQL.


Part 8: The Future – A Unified Arrow Ecosystem

The most exciting development is that Polars, DuckDB, and Pandas 2.0 are all converging on the Apache Arrow memory format.

Zero-Copy Sharing

Soon, you will be able to:

  1. Read a Parquet file with DuckDB (stores as Arrow).
  2. Pass the Arrow table to Polars (no copy).
  3. Convert to Pandas 2.0 (zero copy, if dtype_backend='pyarrow').

This means the “toolchain” is no longer a series of painful conversions (df.to_csv() -> load again). It is a seamless pipeline.

What about Dask and Ray?

Dask and Ray solve distributed computing. Polars and DuckDB solve single-node efficiency. They are complementary. In fact, you can use Dask to launch many Polars or DuckDB workers. The industry is moving toward “distributed Polars” experiments, but for now, a single powerful machine with Polars/DuckDB beats a cluster of weak machines for 95% of jobs.


Part 9: Getting Started – A 30-Day Roadmap

Ready to upgrade your toolchain? Here is a painless migration path.

Week 1: Replace read_csv with pl.read_csv

Just swap the import. Most Pandas code works with pl.DataFrame basic operations (indexing, selection). Learn the .with_columns syntax.

Week 2: Install DuckDB and run duckdb.sql("SELECT * FROM df")

Start by using DuckDB as a “calculator” for complex aggregations on your existing Pandas DataFrames. You don’t need to change your ETL yet.

Week 3: Rewrite one slow .groupby().apply() using Polars expressions.

Find the slowest cell in your notebook. Rewrite it using pl.col().over(). Measure the speedup. Show your boss.

Week 4: Migrate a whole pipeline to DuckDB + Parquet.

Take a weekly job that takes 2 hours. Write it as a DuckDB SQL script that reads from CSV and writes to Parquet. Then use Polars for the final model features. Watch it run in 4 minutes.

Pro Tip: Use polars.Config for debugging.

pl.Config.set_verbose(True)  # See the query plan
pl.Config.set_streaming_chunk_size(1000000)  # Control memory usage

Conclusion: The End of the “Big Data” Excuse

For years, data scientists used “big data” as an excuse for slow models and messy code. “We can’t use Python; we need Spark.” “My laptop can’t handle it.”

Polars and DuckDB have demolished that excuse.

They prove that single-node performance is not dead—it was just poorly implemented. By leveraging Rust (Polars) and vectorized execution (DuckDB), these tools bring supercomputer-level I/O optimization to a pip install.

The evolution of the data science toolchain is not about bigger clusters. It is about smarter algorithms that respect the hardware. It is about moving from copying data to pointing to data. It is about writing expressions, not loops.

The Stack for 2025 and beyond:

  • Storage: Parquet (via Hugging Face datasets or S3)
  • Processing: Polars (complex wrangling) + DuckDB (SQL aggregations)
  • Visualization: Plotly / Altair (which now read Arrow directly)
  • Modeling: Scikit-learn / XGBoost (which accept Pandas 2.0 Arrow-backed data)

Stop fighting your memory limits. Stop waiting for CSV parses. Install Polars, install DuckDB, and take back your afternoon.

Call to Action:
Open your terminal. Run pip install polars duckdb. Go to your nastiest, slowest notebook. Rewrite the first cell. You will never look back.

Leave a Comment

Scroll to Top