Data Engineering with Python and Pandas: Production Patterns

Pandas is the lingua franca of data manipulation in Python. Almost every data engineer knows it. Far fewer know how to use it in a way that does not collapse under production data volumes, consume 10x the necessary memory, or run so slowly that the pipeline misses its window.

This post covers the patterns that matter for production pandas work: memory management, chunked processing, method chaining, vectorization, type discipline, and the clear-eyed view of when to stop using pandas entirely.

Memory Optimization: Read Less, Use Less

Pandas default dtype inference is conservative in the wrong direction: it uses int64 for integers and float64 for floats regardless of the actual value range. A column with values 0-100 stored as int64 uses 8x more memory than it needs.

import pandas as pd
import numpy as np

def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
    """Downcast numeric columns to minimize memory usage."""
    for col in df.select_dtypes(include=['int64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='integer')
    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='float')
    # Convert low-cardinality strings to category
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].nunique() / len(df) < 0.1:  # < 10% unique
            df[col] = df[col].astype('category')
    return df

# Specify dtypes at read time (fastest approach)
dtypes = {
    'user_id': 'int32',
    'event_type': 'category',
    'amount': 'float32',
    'is_active': 'bool'
}
df = pd.read_csv('events.csv', dtype=dtypes)

Only read the columns you need. If a CSV has 50 columns and your pipeline uses 8, pass usecols to read_csv. This reduces memory and parse time proportionally.

df = pd.read_csv(
    'events.csv',
    usecols=['event_id', 'user_id', 'event_type', 'occurred_at', 'amount'],
    dtype={'event_type': 'category', 'amount': 'float32'},
    parse_dates=['occurred_at']
)

Chunked Processing for Large Files

When a file does not fit in memory, process it in chunks. The pandas chunksize parameter returns an iterator over DataFrame chunks rather than loading the full file.

def process_large_csv(filepath: str, output_path: str) -> None:
    """Process a large CSV in chunks, write results incrementally."""
    chunk_size = 100_000
    results = []
    
    for chunk in pd.read_csv(filepath, chunksize=chunk_size):
        # Process each chunk
        processed = (
            chunk
            .query("amount > 0")
            .assign(
                amount_usd=lambda df: df['amount'] / 100,
                occurred_date=lambda df: pd.to_datetime(df['occurred_at']).dt.date
            )
            .groupby(['user_id', 'occurred_date'])
            .agg(
                daily_spend=('amount_usd', 'sum'),
                transaction_count=('event_id', 'count')
            )
            .reset_index()
        )
        results.append(processed)
    
    # Combine all chunks and aggregate across chunk boundaries
    final = (
        pd.concat(results, ignore_index=True)
        .groupby(['user_id', 'occurred_date'])
        .sum()
        .reset_index()
    )
    final.to_parquet(output_path, index=False)

Note the two-stage aggregation: aggregate within each chunk, then re-aggregate across chunks. This handles the case where a user appears in multiple chunks and their daily totals would otherwise be split.

Method Chaining

Method chaining produces readable, debuggable pipelines. Instead of creating intermediate variables for each transformation step, chain operations with a consistent structure:

# Instead of this:
df1 = df[df['status'] == 'active']
df2 = df1.dropna(subset=['email'])
df3 = df2.rename(columns={'user_id': 'id'})
df4 = df3.assign(full_name=df3['first_name'] + ' ' + df3['last_name'])

# Write this:
result = (
    df
    .query("status == 'active'")
    .dropna(subset=['email'])
    .rename(columns={'user_id': 'id'})
    .assign(full_name=lambda df: df['first_name'] + ' ' + df['last_name'])
    .reset_index(drop=True)
)

# Debug a chain step without breaking it
def log_shape(df, label=''):
    print(f"{label}: {df.shape}")
    return df

result = (
    df
    .query("status == 'active'")
    .pipe(log_shape, 'after filter')
    .dropna(subset=['email'])
    .pipe(log_shape, 'after dropna')
    .assign(full_name=lambda df: df['first_name'] + ' ' + df['last_name'])
)

The .pipe() method is the key to keeping chains clean: it lets you call any function that takes a DataFrame as its first argument within the chain, including logging helpers and custom transformations.

Vectorization vs. Apply

The most common pandas performance mistake is using .apply() where a vectorized operation exists. Apply iterates over rows in Python, which is slow. Vectorized operations use NumPy under the hood and run at C speed.

# Slow: Python loop via apply
df['revenue_tier'] = df['revenue'].apply(
    lambda x: 'high' if x > 10000 else 'medium' if x > 1000 else 'low'
)

# Fast: vectorized with np.select
conditions = [df['revenue'] > 10000, df['revenue'] > 1000]
choices = ['high', 'medium']
df['revenue_tier'] = np.select(conditions, choices, default='low')

# Slow: string manipulation via apply
df['domain'] = df['email'].apply(lambda x: x.split('@')[1])

# Fast: vectorized string method
df['domain'] = df['email'].str.split('@').str[1]

# Benchmark difference on 1M rows:
# apply: ~3-5 seconds
# vectorized: ~50-100ms

When you genuinely need row-level Python logic, apply is sometimes the right call. But check for a vectorized equivalent first. NumPy where, select, and pandas str/dt accessor methods cover a large portion of apply use cases.

Merge Patterns and Performance

Pandas merge is a full in-memory join. For large DataFrames, the memory requirement can be 2-3x the size of both inputs combined. A few patterns that help:

# Sort before merge if you will iterate the result in order
# (sort_values is faster than sort after merge)
left = left.sort_values('join_key')
right = right.sort_values('join_key')

# For lookups where right side is small, use map instead of merge
lookup = dict(zip(dim_df['id'], dim_df['name']))
df['name'] = df['entity_id'].map(lookup)

# Reduce memory before merge: only keep needed columns
result = pd.merge(
    left[['id', 'amount', 'date']],
    right[['id', 'category', 'segment']],
    on='id',
    how='left'
)

When to Stop Using Pandas

Pandas is single-threaded and loads data into memory. For datasets above a few hundred megabytes, or for queries that are better expressed as SQL, reach for a different tool.

Polars: a pandas replacement with a Rust backend, lazy evaluation, and multi-threaded execution. The API is similar to pandas but with better performance characteristics for large datasets and a query optimizer for lazy mode. Strongly typed and faster on most benchmarks.

import polars as pl

# Lazy evaluation: Polars optimizes the full query before executing
result = (
    pl.scan_csv("large_events.csv")  # lazy, no data loaded yet
    .filter(pl.col("amount") > 0)
    .with_columns([
        (pl.col("amount") / 100).alias("amount_usd"),
        pl.col("occurred_at").str.to_date().alias("event_date")
    ])
    .group_by(["user_id", "event_date"])
    .agg([
        pl.col("amount_usd").sum().alias("daily_spend"),
        pl.col("event_id").count().alias("tx_count")
    ])
    .collect()  # execute here
)

DuckDB: in-process SQL analytics engine that can query pandas DataFrames, Parquet files, and CSV files directly via SQL. Ideal when your transformation logic is naturally SQL-shaped.

import duckdb

# Query a pandas DataFrame directly with SQL
result = duckdb.sql("""
    SELECT
        user_id,
        DATE_TRUNC('day', occurred_at) AS event_date,
        SUM(amount / 100.0) AS daily_spend,
        COUNT(*) AS tx_count
    FROM df  -- df is the pandas DataFrame in scope
    WHERE amount > 0
    GROUP BY 1, 2
    ORDER BY 1, 2
""").df()  # returns a pandas DataFrame

The practical heuristic: use pandas for datasets under 500MB where Python logic is needed, Polars for larger datasets or performance-critical paths, and DuckDB when the transformation is SQL and the data is files or existing DataFrames. PySpark when you need distributed processing across a cluster.

Pandas is not going away. It is too deeply embedded in the ecosystem and too familiar to too many practitioners. But knowing its limits and the alternatives that address those limits is what separates a data engineer who uses pandas from one who can make an informed choice about when not to.