Chapter 32  ·  Part VIII: Data Systems

Stream Processing vs. Batch Processing

Every data system is either processing a bounded dataset that already exists, or an unbounded stream of data that keeps arriving. The choice between these two mental models shapes your architecture more than any tool you pick.

~45 min read Batch · Lambda · Kappa · Flink · Kafka Streams

What's in this chapter

Key Learnings

Read this if you have 3 minutes

01

Batch and stream are not opposites — batch is just stream with a beginning and an end. The same data, the same logic, just different assumptions about when the dataset stops arriving. Understanding this unifies how you think about both.

02

Event time is when something happened; processing time is when your system sees it. These are almost never the same. Every correctness bug in stream processing comes from confusing the two.

03

Watermarks are a bet, not a guarantee. A watermark says "I'm confident I've seen all events up to time T." That confidence can be wrong — late data will arrive — and your system must decide what to do when it does.

04

The Lambda architecture's core insight is sound; its operational cost is not. Maintaining two separate code paths — one for batch, one for streaming — that must produce identical results is a maintenance trap. Kappa collapses them into one.

05

Stateful stream processing is what separates toy pipelines from production systems. Filtering and mapping are easy. Aggregations, joins, and sessionization require state — and that state must be fault-tolerant, replayable, and often enormous.

06

Exactly-once is not "each event is processed once." It means the observable effect on your output is as if each event was processed exactly once, even if the system internally retried. This requires both idempotent writes and transactional coordination.

07

Batch is still the right answer for many workloads. If your business question can tolerate an hour of delay, batch is simpler, cheaper, and easier to debug. Reach for streaming when low latency is genuinely required — not just because it feels more modern.

The Core Idea: Bounded vs. Unbounded Data

Let's start with a simple distinction that most introductions bury or skip entirely.

Bounded data is a dataset that has a defined beginning and end. A log file from last Tuesday. The orders placed in November. The results of a database query you ran at 9am. When you process bounded data, you know exactly how much there is. You can scan it all, sort it, and produce a final answer.

Unbounded data is a dataset that keeps arriving indefinitely. User click events. Sensor readings from a factory floor. Transactions flowing into a payment system. There is no end — or at least no end you can see ahead of time. Processing unbounded data means you have to produce results before you've seen everything, which means your results are always, in some sense, provisional.

Batch processing is the traditional tool for bounded data. Stream processing is the tool for unbounded data. But here's the thing: they're not as different as they look. A batch job is just a stream pipeline that happens to run after all the data has arrived. And a stream pipeline can, in principle, do everything a batch job can do — it just has to decide when it has enough data to produce an answer.

This realization — that batch and stream are two points on a spectrum, not two fundamentally different things — is the key mental model for this chapter.

Core Insight

A batch job is a stream job that runs once after all the data has arrived. A stream job is a batch job that runs continuously as data arrives and never waits for the end. The underlying computation can be identical — what differs is the input model and the latency requirement.

Why Batch Processing Ruled for So Long

Batch processing is old. MapReduce, introduced by Google in 2004, is the most famous example, but the idea goes back decades to mainframe jobs that would process the day's transactions overnight. The model is simple and powerful: gather all the input data, run a transformation on it, write the output. Repeat tomorrow.

The appeal is real. Batch jobs are:

The problem is latency. A nightly batch job can only tell you what was true yesterday. A job that runs hourly can tell you what was true an hour ago. For many use cases — fraud detection, recommendation systems, dashboards that business teams watch during the day — that lag is too long.

The desire to get fresh results faster is what pushed the industry toward stream processing.

The Lambda Architecture: A Pragmatic Compromise

Around 2011, Nathan Marz proposed the Lambda architecture as a way to get the best of both worlds: the accuracy and reprocessing capability of batch, combined with the low latency of stream processing.

The architecture has three layers:

┌─────────────────────────────────────────────────────────────────┐
│                         Raw Data Stream                         │
└────────────────────────┬────────────────────────────────────────┘
                         │
           ┌─────────────┴─────────────┐
           ▼                           ▼
  ┌─────────────────┐         ┌─────────────────┐
  │   Batch Layer   │         │   Speed Layer   │
  │                 │         │  (Stream proc)  │
  │ Recomputes full │         │ Low latency,    │
  │ results every   │         │ approximate or  │
  │ few hours from  │         │ recent-only     │
  │ raw data store  │         │ results         │
  └────────┬────────┘         └────────┬────────┘
           │                           │
           │   Batch views             │  Real-time views
           └──────────────┬────────────┘
                          ▼
               ┌──────────────────┐
               │   Serving Layer  │
               │                  │
               │ Merges batch +   │
               │ speed layer      │
               │ results at query │
               │ time             │
               └──────────────────┘
    
Lambda Architecture: batch layer for accuracy, speed layer for freshness, serving layer merges both at query time.

The batch layer periodically re-runs a full computation over all historical data. This produces perfectly correct results but with high latency (hours).

The speed layer processes the incoming stream in real time, producing approximate or recent-only results with low latency (seconds).

The serving layer takes a query, looks up results from both layers, and merges them. For example: total sales = (batch total up to 3 hours ago) + (stream total for the last 3 hours).

The Insight Behind Lambda

Lambda's insight is important: the batch layer is the source of truth. It re-runs the full computation regularly, which means any bugs in the speed layer get corrected eventually when the next batch run overwrites its output. You tolerate some inaccuracy in the short term (speed layer) in exchange for guaranteed correctness in the long term (batch layer).

This is a real benefit. In practice, stream processors are harder to get exactly right — especially around edge cases, late data, and state management. Knowing the batch job will clean things up gives you a safety net.

The Pain of Lambda

The problem with Lambda is that you have written the same pipeline twice. The batch layer and the speed layer must produce consistent results — if they diverge, your serving layer will return wrong answers. But they are implemented in completely different frameworks, possibly different languages, definitely with different operational properties.

Over time, these two code paths drift. A bug gets fixed in the batch layer but not in the speed layer. A new business requirement gets added to the speed layer but the batch layer is six months behind. The team that owns the batch code and the team that owns the stream code start to diverge.

You've created a system that has twice the code, twice the bugs, and twice the operational burden, yet it still has to produce the same results as a single system would.

The Lambda Trap

Lambda architectures tend to start as a "temporary" solution while the team figures out streaming. They tend to stay in place for years. Every engineer who joins the team has to understand both layers. Every new feature has to be implemented in both. This is not a technical debt you can pay back easily — it's baked into the architecture.

The Kappa Architecture: One Pipeline to Rule Them All

In 2014, Jay Kreps (co-creator of Kafka) proposed a simpler alternative: throw away the batch layer. Run only a single stream processing pipeline. When you need to reprocess historical data — because you found a bug or want to add a new feature — replay the raw event log through the same stream pipeline.

┌─────────────────────────────────────────────────────────────────┐
│              Immutable Log (e.g. Kafka, retained indefinitely)  │
│                                                                 │
│   event₁  event₂  event₃  event₄  event₅  event₆  event₇ ...  │
└──────────────────────────┬──────────────────────────────────────┘
                           │
              ┌────────────┴─────────────┐
              │                          │
              ▼                          ▼
   ┌─────────────────────┐   ┌─────────────────────┐
   │  Stream Job v1      │   │  Stream Job v2       │
   │  (current, live)    │   │  (new version,       │
   │                     │   │  replaying from      │
   │  Reading from        │   │  beginning of log)   │
   │  current offset     │   │                      │
   └──────────┬──────────┘   └──────────┬───────────┘
              │                          │
              ▼                          ▼
   ┌─────────────────────┐   ┌─────────────────────┐
   │  Output table A     │   │  Output table B      │
   │  (current results)  │   │  (new results,       │
   │                     │   │  catching up)        │
   └─────────────────────┘   └─────────────────────┘

   When v2 catches up to v1, swap serving to table B, decommission v1.
    
Kappa Architecture: one stream job, reprocessing done by replaying the log into a new job version, then swapping.

The key enabling technology is the immutable, replayable log. Kafka, for example, can retain events indefinitely (or for years). If you need to reprocess, you spin up a new version of your stream job, point it at the beginning of the log, and let it catch up. Once it reaches the current position, you cut over serving to the new output and shut down the old job.

What Kappa Gives You

What Kappa Costs You

Kappa is not free. The trade-offs are real:

When to Use Which

Lambda makes sense when: your batch and stream logic are genuinely complex in different ways (e.g., the batch layer does expensive ML model training), or when reprocessing speed is critical and batch can do it 100x faster. Kappa makes sense when: the logic is the same regardless of whether you're processing fresh or historical data, and you can tolerate slower reprocessing. Most teams should start with Kappa and only add a batch layer if they hit a concrete problem that streaming alone can't solve.

Event Time vs. Processing Time: The Most Important Distinction in Streaming

This is where most streaming systems go wrong, and where most streaming tutorials gloss over the hard part.

Consider a mobile app that tracks how long users spend reading an article. The app fires an event when the user finishes reading. But the user's phone is on a plane with no internet. The event is buffered on the device. An hour later, when the plane lands, the event is sent to your servers.

Your server receives this event at 3pm. But the reading actually happened at 1:30pm. Which time matters?

Event time is when the thing actually happened — 1:30pm. This is what you care about for business analytics: "How much reading happened between 1pm and 2pm?"

Processing time is when your system saw the event — 3pm. This is what your system's clock says. It's easy to use, but it's wrong for the above question.

REALITY (Event Time)
──────────────────────────────────────────────────────►
  1:00pm          1:30pm          2:00pm          3:00pm
                    │
                User reads article (event happens here)
                    │
                    │  ← Phone is offline
                    │
YOUR SYSTEM (Processing Time)
──────────────────────────────────────────────────────►
  1:00pm          1:30pm          2:00pm          3:00pm
                                                    │
                                         System receives event
                                                   HERE

  Gap = 90 minutes of skew between event time and processing time
    
Event time skew: the same event appears at very different positions depending on which clock you use.

If you use processing time for your "articles read per hour" metric, you'll count this event in the 3pm hour, not the 1pm hour. Your 1pm–2pm metric will be understated. Your 3pm metric will be overstated. And you'll never know — there's no error, just silently wrong numbers.

This sounds like an edge case, but it's the norm for mobile apps, IoT sensors, and any system where clients can go offline. It also happens in distributed systems: a service under load might take 30 seconds to forward an event to your pipeline. That 30-second delay means every event you receive is already 30 seconds old.

The Practical Consequences

Working in event time is harder but correct. The challenge: if your stream processor is computing "events per minute in event time," it needs to know when it can close the window for minute X and emit a result. How does it know it has seen all the events for 1:30pm? It can't — new events for 1:30pm might still be in transit.

This is the problem that watermarks solve.

Watermarks: Making a Decision Under Uncertainty

A watermark is a statement from your stream processor: "I am confident that I have seen all events with an event time earlier than T."

Once the watermark passes time T, the system can close all windows that end at or before T and emit their results. It doesn't wait forever. It makes a bet.

How Watermarks Are Generated

The simplest approach is a heuristic watermark: take the maximum event time seen so far, subtract a fixed lag estimate, and use that as the watermark. For example:

Events arriving (with their event times):
  [1:30:05] [1:30:12] [1:30:08] [1:29:55] [1:30:20] [1:30:18] ...

Maximum event time seen so far: 1:30:20
Assumed max lag (late arrival buffer): 60 seconds

Watermark = 1:30:20 - 60s = 1:29:20

Interpretation: "I'm confident I've seen all events from before 1:29:20"

So: the window [1:28:00 – 1:29:00] can be closed and emitted.
    The window [1:29:00 – 1:30:00] must stay open (watermark hasn't passed it yet).
    
Watermark calculation: max event time minus lag estimate. Advance the watermark as new events arrive with later timestamps.

The 60-second buffer is a judgment call. Too small, and you'll close windows before all late events arrive — your results will be wrong. Too large, and your output latency increases by that same amount — you wait longer before emitting results.

Late Data: It Will Arrive

No matter how generous your watermark lag, some events will arrive after their window has already closed. These are called late events. You have three choices:

  1. Drop them. Fast and simple. Acceptable if late events are rare and approximate answers are fine.
  2. Reopen the window and recompute. Correct, but complex. The downstream system now receives a correction for a result it already processed. It must handle this gracefully.
  3. Route them to a side output. The late events go to a separate stream that a separate process handles — maybe a daily reconciliation batch job. This is often the pragmatic choice.

Apache Flink and Apache Beam both support all three approaches. The choice depends on how often late events arrive, how much they matter to your accuracy requirements, and how tolerant your downstream systems are to corrections.

The Silent Accuracy Problem

If you drop late events silently and never measure how many you're dropping, you have no idea how accurate your results are. Before deploying any windowed aggregation pipeline in production, instrument the late event rate. If more than 0.1% of events are arriving late, investigate the source of delay before deciding your latency tolerance.

Windows: How You Slice an Infinite Stream Into Finite Pieces

To compute an aggregation over a stream — a count, a sum, an average — you need to define a finite chunk of data to aggregate over. This is a window.

Tumbling Windows

Fixed-size, non-overlapping windows. "One-minute buckets." The window [1:00, 1:01) is closed, the window [1:01, 1:02) opens. Every event belongs to exactly one window.

Simple and efficient. Good for: periodic metrics (requests per minute, errors per hour), batch-like summaries over a stream.

Sliding Windows

Windows of fixed size that slide forward by a step smaller than their size. A "5-minute window updated every minute" means any given event appears in multiple windows. Good for: smoothed metrics, moving averages.

More expensive than tumbling windows because each event is processed multiple times.

Session Windows

Windows defined by activity, not time. A session window groups events from the same user until there's a gap of more than X minutes with no events from that user — then the session closes.

Session windows have variable and unpredictable size. They require per-key state (what's the last event time for this user?). They're powerful for user behavior analysis — "how long was this user's visit?" — but significantly more complex to implement correctly.

TUMBLING (1 min)
  ├────────────┤────────────┤────────────┤────────────┤
  1:00       1:01        1:02        1:03        1:04

SLIDING (2 min window, 1 min slide)
  ├──────────────────┤
       ├──────────────────┤
            ├──────────────────┤
  1:00     1:01     1:02     1:03     1:04

SESSION (30-second gap)
  User A:  ●●●          ●●    │gap│  ●●●●
           └── session 1 ──┘        └─ session 2 ─┘

           (sessions close when user is idle >30s)
    
Three window types: tumbling (fixed, non-overlapping), sliding (overlapping), session (activity-based).

Stateful Stream Processing: Where It Gets Hard

Filtering events is easy. Mapping one event to another event is easy. The moment you need to correlate multiple events — count them, join them, detect a pattern across them — you need state.

State is what makes stream processing fundamentally different from simple message passing.

What State Looks Like

Consider: "Count the number of failed login attempts per user in the last 10 minutes. If it exceeds 5, emit a fraud alert."

To answer this, your stream processor must maintain, for each user, a rolling count of failed attempts with their timestamps. As new events arrive, it adds to the count. As old events age out of the 10-minute window, it removes them. It checks the count and emits an alert if the threshold is crossed.

This state is not in the event. It exists in the processor's memory, and it accumulates over time. For a system processing millions of users, this state can be gigabytes or terabytes.

State Backends

Stream processors like Flink let you choose where state lives:

Checkpointing: Fault Tolerance for State

If a stream processor crashes mid-processing, you need to restart it without losing or double-counting events. This is checkpointing.

Flink uses a mechanism called distributed snapshots (based on the Chandy-Lamport algorithm). Periodically, it injects a checkpoint barrier into the data stream. When each operator processes the barrier, it saves its current state to durable storage (S3, HDFS). If the job fails, it restores from the last checkpoint and replays events from that point.

Kafka (source)
  event₁  event₂  [barrier₁]  event₃  event₄  [barrier₂]  event₅
                       │                             │
                       ▼                             ▼
  Operator saves state to S3 when         Next checkpoint
  it receives the barrier

On failure: restart from last checkpoint, re-read events from Kafka
            starting from the offset recorded in that checkpoint.

  → event₁ and event₂ are "committed" (won't be reprocessed)
  → event₃ onward replayed from Kafka
    
Flink checkpointing: barriers flow with data, operators snapshot state when they see a barrier, enabling recovery to the last consistent snapshot.

The checkpoint interval is a trade-off: short intervals mean less data to replay on failure (smaller recovery time), but more I/O overhead and slight throughput reduction. Long intervals mean faster normal operation but longer recovery after failure.

Exactly-Once Semantics: What It Actually Means

You will often hear stream processing systems advertise "exactly-once semantics." This phrase is frequently misunderstood, and the misunderstanding has caused real production bugs.

Exactly-once does not mean each event is processed exactly once internally.

When a stream processor fails and recovers from a checkpoint, it will re-process events from the checkpoint. Some events will be processed twice internally. Exactly-once means: the observable effect on the output is as if each event was processed exactly once, even though the system may have internally processed it multiple times.

To achieve this, two things must both be true:

  1. The processor must be able to recover to a consistent state — achieved by checkpointing.
  2. The output writes must be idempotent or transactional — so that replaying an event produces the same output, not a duplicate output.

If you're writing to Kafka as your output, Flink can use Kafka's transactional producer API to ensure that even if a message is written multiple times internally, only one copy is committed to the output Kafka topic. If you're writing to a database, you need upsert semantics (write based on a unique key, so re-writing gives the same row, not a duplicate row).

The Exactly-Once Trap

Exactly-once from Kafka source to Kafka sink is well-supported in Flink and Kafka Streams. Exactly-once to an external database or REST API is on you. If your pipeline ends with a call to an external service that has no idempotency guarantee, you do not have exactly-once semantics end-to-end, regardless of what Flink's documentation says. The guarantee only holds as far as the infrastructure supports it.

The Cost of Exactly-Once

Exactly-once processing has real overhead:

In practice, the overhead is 5–15% throughput reduction compared to at-least-once processing. For many pipelines this is fine. For high-throughput, latency-sensitive pipelines, consider whether at-least-once with idempotent downstream writes gives you the same correctness guarantee at lower cost.

Joins in Stream Processing

Joining two streams is conceptually simple. It's operationally complex.

Stream-Stream Join

Join two streams on a key, within a time window. For example: join a "click" event with the corresponding "impression" event, where both events have the same ad_id, and the click must arrive within 30 seconds of the impression.

To do this, the processor must buffer one stream while waiting for matching events from the other stream. How long do you buffer? Until the time window expires. But events can be out of order. So you buffer longer to be safe. This buffering requires state. For high-cardinality keys (millions of ad IDs), this state can be enormous.

Stream-Table Join

Join a stream against a slowly-changing table. For example: enrich each purchase event with the current user's profile (name, tier, location).

The user profile table is loaded into the stream processor's state (as a changelog stream from the database). Each incoming purchase event is looked up against this in-memory table. This is efficient and the most common type of stream join in practice.

The complexity: what is "current" for the user profile? If the user updated their profile 1 second before the purchase, did you get the new profile or the old one? This is the stream-table join consistency problem, and the answer depends on how quickly the changelog reaches your processor.

Apache Flink vs. Kafka Streams vs. Spark Streaming

A quick comparison of the major frameworks to orient you:

Property Apache Flink Kafka Streams Spark Structured Streaming
Processing model True streaming (event at a time) True streaming (event at a time) Micro-batch (default) or continuous
Latency Milliseconds Milliseconds Seconds (micro-batch)
State management First-class, RocksDB backend, very mature First-class, RocksDB backend, simpler API Good, but less flexible than Flink
Exactly-once Yes (end-to-end with Kafka) Yes (end-to-end with Kafka) Yes (with idempotent sinks)
Event time / watermarks Excellent, fine-grained control Good, improving Good, less flexible
Deployment model Standalone cluster or Kubernetes Library inside your app — no cluster needed Spark cluster
Learning curve Steep Gentle (if you know Kafka) Moderate (if you know Spark)
Best for Complex, stateful, low-latency pipelines Kafka-centric, moderate complexity Teams already using Spark, batch + stream unification
Real-World Guidance

If your team is Kafka-native and you need streaming that's operationally simple, start with Kafka Streams. If you need sub-second latency, complex stateful logic (session windows, pattern detection, complex joins), or very high throughput, invest in Flink. If your team already lives in Spark and batch processing is your primary workload with streaming as a secondary need, Spark Structured Streaming is the lowest friction choice.

When Batch Is Still the Right Answer

Streaming is fashionable. This has led to a generation of pipelines being rewritten as streaming jobs when the original batch version was simpler, cheaper, and equally good for the use case.

Ask yourself: what is the actual latency requirement?

If the answer is "a business analyst refreshes this dashboard once in the morning," you don't need sub-second streaming. A job that runs every hour is fine. If the answer is "we need to detect fraud within 5 seconds of a transaction," you genuinely need streaming.

There are several cases where batch remains the superior choice:

A Practical Latency Ladder

Start at the top of this ladder and move down only when you have a concrete latency requirement that forces you to.

Reprocessing: The Practical Test of Your Architecture

Here is a test for your data architecture: you discover that a bug in your pipeline has been producing incorrect metrics for the past three weeks. How do you fix the historical data?

With a batch architecture: fix the code, rerun the job over the same input data. You get the correct output in minutes or hours. Relatively painless.

With a Kappa/streaming architecture: if you retained the raw events in Kafka or an object store, spin up a new version of the job, replay from three weeks ago, and wait for it to catch up. This could take hours or days depending on data volume and job throughput.

With a Lambda architecture: the next batch run will correct the output. You just have to wait for the next batch window.

The ability to reprocess historical data is not a nice-to-have. Bugs happen, requirements change, new features need backfilled data. If your architecture makes reprocessing painful, you will live with wrong data for longer than you should.

The Reprocessing Checklist

Before shipping a streaming pipeline, answer these questions:

A Decision Framework

When you're choosing between batch and stream for a new data pipeline, work through these questions in order:

What latency do you need?
│
├── Hours or more? ──────────────────────────► Use batch.
│                                              Simpler, cheaper, easier to reprocess.
│
└── Minutes or seconds?
    │
    ├── Is the computation stateful?
    │   (joins, aggregations, session detection)
    │   │
    │   ├── No (filter/enrich only) ──────────► Kafka Streams or Flink.
    │   │                                        Low operational overhead.
    │   │
    │   └── Yes
    │       │
    │       ├── Complex state, sub-second latency ──► Flink.
    │       │                                          Accept the operational cost.
    │       │
    │       └── Moderate state, seconds ok ─────────► Kafka Streams or
    │                                                   Spark Structured Streaming.
    │
    └── Do you need exactly-once end-to-end?
        │
        ├── Yes, including external sinks ────────────► Verify your sink supports it.
        │                                               Idempotent writes or
        │                                               transactional API required.
        │
        └── At-least-once with idempotent writes ok ─► Simpler and faster.
    
Decision tree for batch vs. stream and framework selection.

Putting It Together: A Concrete Example

Let's say you're building a real-time dashboard for an e-commerce platform. It shows: revenue per minute (last 30 minutes), active users right now, and a flag if any product page has an error rate above 5% in the last 5 minutes.

Revenue per minute: Tumbling 1-minute windows on purchase events, keyed by nothing (global sum). Use event time. Watermark lag of 30 seconds (purchases from mobile apps can be delayed). Late events (beyond 30s) are dropped — the error is small and the metric is approximate anyway.

Active users right now: Session windows with a 5-minute inactivity gap on click/view events, keyed by user_id. Count distinct sessions open at any moment. This requires significant state (one entry per active user), RocksDB backend. Use processing time here — "right now" is inherently about when the server sees the event, not when the user clicked.

Error rate by product page: Sliding 5-minute window, 1-minute slide, on page-view and error events, keyed by product_id. Compute error_count / view_count. Alert if ratio > 0.05. Use event time, watermark 10 seconds. Low tolerance for late events because this is an alerting use case — a false negative (missed alert) is worse than slightly delayed alerting.

All three jobs can run in Flink from the same Kafka topics. They have different window types, different keys, and different latency tolerances — but the infrastructure is shared.

Chapter Summary

The One-Sentence Principle

Batch and stream processing are the same computation with different assumptions about when the data ends — choose the simplest model that meets your latency requirement, and resist the pull of streaming until you genuinely need it.

The Most Common Mistake

Using processing time instead of event time for any aggregation that a user or business analyst will interpret as "what happened at time X." The numbers look plausible, are wrong in systematic ways, and the error is invisible until someone cross-checks against another source.

Three Questions for Your Next Design Review

  • What is the actual latency requirement, and who confirmed it? Would an hourly batch job satisfy the stated need?
  • If we discover a bug that corrupts three weeks of output, how do we reprocess? How long will it take?
  • For every windowed aggregation: are we using event time or processing time, and is that the right choice for this specific metric?