Chapter 32 — Stream Processing vs. Batch Processing

The Core Idea: Bounded vs. Unbounded Data

Let's start with a simple distinction that most introductions bury or skip entirely.

Bounded data is a dataset that has a defined beginning and end. A log file from last Tuesday. The orders placed in November. The results of a database query you ran at 9am. When you process bounded data, you know exactly how much there is. You can scan it all, sort it, and produce a final answer.

Unbounded data is a dataset that keeps arriving indefinitely. User click events. Sensor readings from a factory floor. Transactions flowing into a payment system. There is no end — or at least no end you can see ahead of time. Processing unbounded data means you have to produce results before you've seen everything, which means your results are always, in some sense, provisional.

Batch processing is the traditional tool for bounded data. Stream processing is the tool for unbounded data. But here's the thing: they're not as different as they look. A batch job is just a stream pipeline that happens to run after all the data has arrived. And a stream pipeline can, in principle, do everything a batch job can do — it just has to decide when it has enough data to produce an answer.

This realization — that batch and stream are two points on a spectrum, not two fundamentally different things — is the key mental model for this chapter.

Core Insight

A batch job is a stream job that runs once after all the data has arrived. A stream job is a batch job that runs continuously as data arrives and never waits for the end. The underlying computation can be identical — what differs is the input model and the latency requirement.

Why Batch Processing Ruled for So Long

Batch processing is old. MapReduce, introduced by Google in 2004, is the most famous example, but the idea goes back decades to mainframe jobs that would process the day's transactions overnight. The model is simple and powerful: gather all the input data, run a transformation on it, write the output. Repeat tomorrow.

The appeal is real. Batch jobs are:

Easy to reason about. The input is fixed. The output is deterministic. You can run the same job twice and get the same result.
Easy to reprocess. If your code was wrong, fix it and rerun it over the same input data. No data is lost.
Easy to debug. A failed batch job leaves behind its input data. You can inspect it, reproduce the failure, fix the bug.
Efficient at large scale. Batch systems can sort, shuffle, and aggregate terabytes of data using sequential disk I/O, which is fast. Spark and MapReduce are genuinely well-optimized for this.

The problem is latency. A nightly batch job can only tell you what was true yesterday. A job that runs hourly can tell you what was true an hour ago. For many use cases — fraud detection, recommendation systems, dashboards that business teams watch during the day — that lag is too long.

The desire to get fresh results faster is what pushed the industry toward stream processing.

The Lambda Architecture: A Pragmatic Compromise

Around 2011, Nathan Marz proposed the Lambda architecture as a way to get the best of both worlds: the accuracy and reprocessing capability of batch, combined with the low latency of stream processing.

The architecture has three layers:

┌─────────────────────────────────────────────────────────────────┐
│                         Raw Data Stream                         │
└────────────────────────┬────────────────────────────────────────┘
                         │
           ┌─────────────┴─────────────┐
           ▼                           ▼
  ┌─────────────────┐         ┌─────────────────┐
  │   Batch Layer   │         │   Speed Layer   │
  │                 │         │  (Stream proc)  │
  │ Recomputes full │         │ Low latency,    │
  │ results every   │         │ approximate or  │
  │ few hours from  │         │ recent-only     │
  │ raw data store  │         │ results         │
  └────────┬────────┘         └────────┬────────┘
           │                           │
           │   Batch views             │  Real-time views
           └──────────────┬────────────┘
                          ▼
               ┌──────────────────┐
               │   Serving Layer  │
               │                  │
               │ Merges batch +   │
               │ speed layer      │
               │ results at query │
               │ time             │
               └──────────────────┘

Lambda Architecture: batch layer for accuracy, speed layer for freshness, serving layer merges both at query time.

The batch layer periodically re-runs a full computation over all historical data. This produces perfectly correct results but with high latency (hours).

The speed layer processes the incoming stream in real time, producing approximate or recent-only results with low latency (seconds).

The serving layer takes a query, looks up results from both layers, and merges them. For example: total sales = (batch total up to 3 hours ago) + (stream total for the last 3 hours).

The Insight Behind Lambda

Lambda's insight is important: the batch layer is the source of truth. It re-runs the full computation regularly, which means any bugs in the speed layer get corrected eventually when the next batch run overwrites its output. You tolerate some inaccuracy in the short term (speed layer) in exchange for guaranteed correctness in the long term (batch layer).

This is a real benefit. In practice, stream processors are harder to get exactly right — especially around edge cases, late data, and state management. Knowing the batch job will clean things up gives you a safety net.

The Pain of Lambda

The problem with Lambda is that you have written the same pipeline twice. The batch layer and the speed layer must produce consistent results — if they diverge, your serving layer will return wrong answers. But they are implemented in completely different frameworks, possibly different languages, definitely with different operational properties.

Over time, these two code paths drift. A bug gets fixed in the batch layer but not in the speed layer. A new business requirement gets added to the speed layer but the batch layer is six months behind. The team that owns the batch code and the team that owns the stream code start to diverge.

You've created a system that has twice the code, twice the bugs, and twice the operational burden, yet it still has to produce the same results as a single system would.

The Lambda Trap

Lambda architectures tend to start as a "temporary" solution while the team figures out streaming. They tend to stay in place for years. Every engineer who joins the team has to understand both layers. Every new feature has to be implemented in both. This is not a technical debt you can pay back easily — it's baked into the architecture.

The Kappa Architecture: One Pipeline to Rule Them All

In 2014, Jay Kreps (co-creator of Kafka) proposed a simpler alternative: throw away the batch layer. Run only a single stream processing pipeline. When you need to reprocess historical data — because you found a bug or want to add a new feature — replay the raw event log through the same stream pipeline.

┌─────────────────────────────────────────────────────────────────┐
│              Immutable Log (e.g. Kafka, retained indefinitely)  │
│                                                                 │
│   event₁  event₂  event₃  event₄  event₅  event₆  event₇ ...  │
└──────────────────────────┬──────────────────────────────────────┘
                           │
              ┌────────────┴─────────────┐
              │                          │
              ▼                          ▼
   ┌─────────────────────┐   ┌─────────────────────┐
   │  Stream Job v1      │   │  Stream Job v2       │
   │  (current, live)    │   │  (new version,       │
   │                     │   │  replaying from      │
   │  Reading from        │   │  beginning of log)   │
   │  current offset     │   │                      │
   └──────────┬──────────┘   └──────────┬───────────┘
              │                          │
              ▼                          ▼
   ┌─────────────────────┐   ┌─────────────────────┐
   │  Output table A     │   │  Output table B      │
   │  (current results)  │   │  (new results,       │
   │                     │   │  catching up)        │
   └─────────────────────┘   └─────────────────────┘

   When v2 catches up to v1, swap serving to table B, decommission v1.

Kappa Architecture: one stream job, reprocessing done by replaying the log into a new job version, then swapping.

The key enabling technology is the immutable, replayable log. Kafka, for example, can retain events indefinitely (or for years). If you need to reprocess, you spin up a new version of your stream job, point it at the beginning of the log, and let it catch up. Once it reaches the current position, you cut over serving to the new output and shut down the old job.

What Kappa Gives You

One codebase. A single pipeline handles both real-time processing and historical reprocessing. No divergence possible.
Simpler operations. One type of infrastructure to run, monitor, and debug.
Easier reasoning. If you want to know what the system does, read one piece of code.

What Kappa Costs You

Kappa is not free. The trade-offs are real:

Reprocessing is slower. Batch systems can shuffle and sort data very efficiently. A stream job replaying events in order can be 10–100x slower at reprocessing than an equivalent Spark job. For huge datasets, this matters.
State management is harder. When you replay, you need to either rebuild state from scratch or have a mechanism to seed the new job with a checkpoint of the old job's state.
Long retention is expensive. Storing years of events in Kafka costs money. Compaction helps for key-based state, but raw event streams need raw storage.
Some computations are awkward in streaming. Global sorts, complex multi-pass algorithms, and certain types of joins are natural in batch and painful in streaming.

When to Use Which

Lambda makes sense when: your batch and stream logic are genuinely complex in different ways (e.g., the batch layer does expensive ML model training), or when reprocessing speed is critical and batch can do it 100x faster. Kappa makes sense when: the logic is the same regardless of whether you're processing fresh or historical data, and you can tolerate slower reprocessing. Most teams should start with Kappa and only add a batch layer if they hit a concrete problem that streaming alone can't solve.

Event Time vs. Processing Time: The Most Important Distinction in Streaming

This is where most streaming systems go wrong, and where most streaming tutorials gloss over the hard part.

Consider a mobile app that tracks how long users spend reading an article. The app fires an event when the user finishes reading. But the user's phone is on a plane with no internet. The event is buffered on the device. An hour later, when the plane lands, the event is sent to your servers.

Your server receives this event at 3pm. But the reading actually happened at 1:30pm. Which time matters?

Event time is when the thing actually happened — 1:30pm. This is what you care about for business analytics: "How much reading happened between 1pm and 2pm?"

Processing time is when your system saw the event — 3pm. This is what your system's clock says. It's easy to use, but it's wrong for the above question.

REALITY (Event Time)
──────────────────────────────────────────────────────►
  1:00pm          1:30pm          2:00pm          3:00pm
                    │
                User reads article (event happens here)
                    │
                    │  ← Phone is offline
                    │
YOUR SYSTEM (Processing Time)
──────────────────────────────────────────────────────►
  1:00pm          1:30pm          2:00pm          3:00pm
                                                    │
                                         System receives event
                                                   HERE

  Gap = 90 minutes of skew between event time and processing time

Event time skew: the same event appears at very different positions depending on which clock you use.

If you use processing time for your "articles read per hour" metric, you'll count this event in the 3pm hour, not the 1pm hour. Your 1pm–2pm metric will be understated. Your 3pm metric will be overstated. And you'll never know — there's no error, just silently wrong numbers.

This sounds like an edge case, but it's the norm for mobile apps, IoT sensors, and any system where clients can go offline. It also happens in distributed systems: a service under load might take 30 seconds to forward an event to your pipeline. That 30-second delay means every event you receive is already 30 seconds old.

The Practical Consequences

Working in event time is harder but correct. The challenge: if your stream processor is computing "events per minute in event time," it needs to know when it can close the window for minute X and emit a result. How does it know it has seen all the events for 1:30pm? It can't — new events for 1:30pm might still be in transit.

This is the problem that watermarks solve.

Watermarks: Making a Decision Under Uncertainty

A watermark is a statement from your stream processor: "I am confident that I have seen all events with an event time earlier than T."

Once the watermark passes time T, the system can close all windows that end at or before T and emit their results. It doesn't wait forever. It makes a bet.

How Watermarks Are Generated

The simplest approach is a heuristic watermark: take the maximum event time seen so far, subtract a fixed lag estimate, and use that as the watermark. For example:

Events arriving (with their event times):
  [1:30:05] [1:30:12] [1:30:08] [1:29:55] [1:30:20] [1:30:18] ...

Maximum event time seen so far: 1:30:20
Assumed max lag (late arrival buffer): 60 seconds

Watermark = 1:30:20 - 60s = 1:29:20

Interpretation: "I'm confident I've seen all events from before 1:29:20"

So: the window [1:28:00 – 1:29:00] can be closed and emitted.
    The window [1:29:00 – 1:30:00] must stay open (watermark hasn't passed it yet).

Watermark calculation: max event time minus lag estimate. Advance the watermark as new events arrive with later timestamps.

The 60-second buffer is a judgment call. Too small, and you'll close windows before all late events arrive — your results will be wrong. Too large, and your output latency increases by that same amount — you wait longer before emitting results.

Late Data: It Will Arrive

No matter how generous your watermark lag, some events will arrive after their window has already closed. These are called late events. You have three choices:

Drop them. Fast and simple. Acceptable if late events are rare and approximate answers are fine.
Reopen the window and recompute. Correct, but complex. The downstream system now receives a correction for a result it already processed. It must handle this gracefully.
Route them to a side output. The late events go to a separate stream that a separate process handles — maybe a daily reconciliation batch job. This is often the pragmatic choice.

Apache Flink and Apache Beam both support all three approaches. The choice depends on how often late events arrive, how much they matter to your accuracy requirements, and how tolerant your downstream systems are to corrections.

The Silent Accuracy Problem

If you drop late events silently and never measure how many you're dropping, you have no idea how accurate your results are. Before deploying any windowed aggregation pipeline in production, instrument the late event rate. If more than 0.1% of events are arriving late, investigate the source of delay before deciding your latency tolerance.

Windows: How You Slice an Infinite Stream Into Finite Pieces

To compute an aggregation over a stream — a count, a sum, an average — you need to define a finite chunk of data to aggregate over. This is a window.

Tumbling Windows

Fixed-size, non-overlapping windows. "One-minute buckets." The window [1:00, 1:01) is closed, the window [1:01, 1:02) opens. Every event belongs to exactly one window.

Simple and efficient. Good for: periodic metrics (requests per minute, errors per hour), batch-like summaries over a stream.

Sliding Windows

Windows of fixed size that slide forward by a step smaller than their size. A "5-minute window updated every minute" means any given event appears in multiple windows. Good for: smoothed metrics, moving averages.

More expensive than tumbling windows because each event is processed multiple times.

Session Windows

Windows defined by activity, not time. A session window groups events from the same user until there's a gap of more than X minutes with no events from that user — then the session closes.

Session windows have variable and unpredictable size. They require per-key state (what's the last event time for this user?). They're powerful for user behavior analysis — "how long was this user's visit?" — but significantly more complex to implement correctly.

TUMBLING (1 min)
  ├────────────┤────────────┤────────────┤────────────┤
  1:00       1:01        1:02        1:03        1:04

SLIDING (2 min window, 1 min slide)
  ├──────────────────┤
       ├──────────────────┤
            ├──────────────────┤
  1:00     1:01     1:02     1:03     1:04

SESSION (30-second gap)
  User A:  ●●●          ●●    │gap│  ●●●●
           └── session 1 ──┘        └─ session 2 ─┘

           (sessions close when user is idle >30s)

Three window types: tumbling (fixed, non-overlapping), sliding (overlapping), session (activity-based).

Stateful Stream Processing: Where It Gets Hard

Filtering events is easy. Mapping one event to another event is easy. The moment you need to correlate multiple events — count them, join them, detect a pattern across them — you need state.

State is what makes stream processing fundamentally different from simple message passing.

What State Looks Like

Consider: "Count the number of failed login attempts per user in the last 10 minutes. If it exceeds 5, emit a fraud alert."

To answer this, your stream processor must maintain, for each user, a rolling count of failed attempts with their timestamps. As new events arrive, it adds to the count. As old events age out of the 10-minute window, it removes them. It checks the count and emits an alert if the threshold is crossed.

This state is not in the event. It exists in the processor's memory, and it accumulates over time. For a system processing millions of users, this state can be gigabytes or terabytes.

State Backends

Stream processors like Flink let you choose where state lives:

In-memory (HashMap). Fast. Limited to available RAM. Lost on process failure unless checkpointed.
RocksDB (embedded key-value store). Spills to disk. Can store much larger state than RAM. Slower than in-memory but persistent locally. This is the standard choice for production Flink jobs.
Remote store (Redis, Cassandra). State is external. Eliminates the tight coupling between the processing node and its state. Adds network round-trips on every state access — meaningful overhead for high-throughput pipelines.

Checkpointing: Fault Tolerance for State

If a stream processor crashes mid-processing, you need to restart it without losing or double-counting events. This is checkpointing.

Flink uses a mechanism called distributed snapshots (based on the Chandy-Lamport algorithm). Periodically, it injects a checkpoint barrier into the data stream. When each operator processes the barrier, it saves its current state to durable storage (S3, HDFS). If the job fails, it restores from the last checkpoint and replays events from that point.

Kafka (source)
  event₁  event₂  [barrier₁]  event₃  event₄  [barrier₂]  event₅
                       │                             │
                       ▼                             ▼
  Operator saves state to S3 when         Next checkpoint
  it receives the barrier

On failure: restart from last checkpoint, re-read events from Kafka
            starting from the offset recorded in that checkpoint.

  → event₁ and event₂ are "committed" (won't be reprocessed)
  → event₃ onward replayed from Kafka

Flink checkpointing: barriers flow with data, operators snapshot state when they see a barrier, enabling recovery to the last consistent snapshot.

The checkpoint interval is a trade-off: short intervals mean less data to replay on failure (smaller recovery time), but more I/O overhead and slight throughput reduction. Long intervals mean faster normal operation but longer recovery after failure.

Exactly-Once Semantics: What It Actually Means

You will often hear stream processing systems advertise "exactly-once semantics." This phrase is frequently misunderstood, and the misunderstanding has caused real production bugs.

Exactly-once does not mean each event is processed exactly once internally.

When a stream processor fails and recovers from a checkpoint, it will re-process events from the checkpoint. Some events will be processed twice internally. Exactly-once means: the observable effect on the output is as if each event was processed exactly once, even though the system may have internally processed it multiple times.

To achieve this, two things must both be true:

The processor must be able to recover to a consistent state — achieved by checkpointing.
The output writes must be idempotent or transactional — so that replaying an event produces the same output, not a duplicate output.

If you're writing to Kafka as your output, Flink can use Kafka's transactional producer API to ensure that even if a message is written multiple times internally, only one copy is committed to the output Kafka topic. If you're writing to a database, you need upsert semantics (write based on a unique key, so re-writing gives the same row, not a duplicate row).

The Exactly-Once Trap

Exactly-once from Kafka source to Kafka sink is well-supported in Flink and Kafka Streams. Exactly-once to an external database or REST API is on you. If your pipeline ends with a call to an external service that has no idempotency guarantee, you do not have exactly-once semantics end-to-end, regardless of what Flink's documentation says. The guarantee only holds as far as the infrastructure supports it.

The Cost of Exactly-Once

Exactly-once processing has real overhead:

Checkpointing pauses throughput briefly while state is snapshotted
Transactional Kafka producers add latency (transactions must be committed)
RocksDB state backend adds write amplification

In practice, the overhead is 5–15% throughput reduction compared to at-least-once processing. For many pipelines this is fine. For high-throughput, latency-sensitive pipelines, consider whether at-least-once with idempotent downstream writes gives you the same correctness guarantee at lower cost.

Joins in Stream Processing

Joining two streams is conceptually simple. It's operationally complex.

Stream-Stream Join

Join two streams on a key, within a time window. For example: join a "click" event with the corresponding "impression" event, where both events have the same ad_id, and the click must arrive within 30 seconds of the impression.

To do this, the processor must buffer one stream while waiting for matching events from the other stream. How long do you buffer? Until the time window expires. But events can be out of order. So you buffer longer to be safe. This buffering requires state. For high-cardinality keys (millions of ad IDs), this state can be enormous.

Stream-Table Join

Join a stream against a slowly-changing table. For example: enrich each purchase event with the current user's profile (name, tier, location).

The user profile table is loaded into the stream processor's state (as a changelog stream from the database). Each incoming purchase event is looked up against this in-memory table. This is efficient and the most common type of stream join in practice.

The complexity: what is "current" for the user profile? If the user updated their profile 1 second before the purchase, did you get the new profile or the old one? This is the stream-table join consistency problem, and the answer depends on how quickly the changelog reaches your processor.

Apache Flink vs. Kafka Streams vs. Spark Streaming

A quick comparison of the major frameworks to orient you:

Property	Apache Flink	Kafka Streams	Spark Structured Streaming
Processing model	True streaming (event at a time)	True streaming (event at a time)	Micro-batch (default) or continuous
Latency	Milliseconds	Milliseconds	Seconds (micro-batch)
State management	First-class, RocksDB backend, very mature	First-class, RocksDB backend, simpler API	Good, but less flexible than Flink
Exactly-once	Yes (end-to-end with Kafka)	Yes (end-to-end with Kafka)	Yes (with idempotent sinks)
Event time / watermarks	Excellent, fine-grained control	Good, improving	Good, less flexible
Deployment model	Standalone cluster or Kubernetes	Library inside your app — no cluster needed	Spark cluster
Learning curve	Steep	Gentle (if you know Kafka)	Moderate (if you know Spark)
Best for	Complex, stateful, low-latency pipelines	Kafka-centric, moderate complexity	Teams already using Spark, batch + stream unification

Real-World Guidance

If your team is Kafka-native and you need streaming that's operationally simple, start with Kafka Streams. If you need sub-second latency, complex stateful logic (session windows, pattern detection, complex joins), or very high throughput, invest in Flink. If your team already lives in Spark and batch processing is your primary workload with streaming as a secondary need, Spark Structured Streaming is the lowest friction choice.

When Batch Is Still the Right Answer

Streaming is fashionable. This has led to a generation of pipelines being rewritten as streaming jobs when the original batch version was simpler, cheaper, and equally good for the use case.

Ask yourself: what is the actual latency requirement?

If the answer is "a business analyst refreshes this dashboard once in the morning," you don't need sub-second streaming. A job that runs every hour is fine. If the answer is "we need to detect fraud within 5 seconds of a transaction," you genuinely need streaming.

There are several cases where batch remains the superior choice:

Compute-intensive transformations. If each event requires running an ML model, doing complex geo computations, or fetching data from multiple external sources, the overhead of doing this in a streaming context is significant. Batch can pipeline this more efficiently.
Multi-pass algorithms. Some computations genuinely require seeing all the data before they can start — sorting, graph algorithms, certain types of machine learning training. Batch is the natural model here.
Low volume, high correctness. End-of-month billing runs, compliance reports, data reconciliation. You want to be correct above all else. Batch, with its ability to rerun and compare outputs, is safer.
Small teams. Streaming infrastructure is operationally complex. A small team that doesn't have the bandwidth to manage Flink clusters, tune watermarks, and debug late event issues should default to batch until the latency requirement genuinely demands streaming.

A Practical Latency Ladder

Days: Nightly batch (Spark, dbt, SQL)
Hours: Scheduled batch every 1–4 hours
Minutes: Micro-batch (Spark Structured Streaming with longer trigger intervals)
Seconds: True streaming (Flink, Kafka Streams)
Milliseconds: In-process computation — don't use a distributed system at all

Start at the top of this ladder and move down only when you have a concrete latency requirement that forces you to.

Reprocessing: The Practical Test of Your Architecture

Here is a test for your data architecture: you discover that a bug in your pipeline has been producing incorrect metrics for the past three weeks. How do you fix the historical data?

With a batch architecture: fix the code, rerun the job over the same input data. You get the correct output in minutes or hours. Relatively painless.

With a Kappa/streaming architecture: if you retained the raw events in Kafka or an object store, spin up a new version of the job, replay from three weeks ago, and wait for it to catch up. This could take hours or days depending on data volume and job throughput.

With a Lambda architecture: the next batch run will correct the output. You just have to wait for the next batch window.

The ability to reprocess historical data is not a nice-to-have. Bugs happen, requirements change, new features need backfilled data. If your architecture makes reprocessing painful, you will live with wrong data for longer than you should.

The Reprocessing Checklist

Before shipping a streaming pipeline, answer these questions:

How long are raw events retained? (Kafka default: 7 days — often not enough)
If we need to replay 6 months of data, how long will the new job take to catch up?
During catchup, does the downstream system receive updates out of order? Can it handle that?
Is there a way to run two versions of the job simultaneously during a migration?

A Decision Framework

When you're choosing between batch and stream for a new data pipeline, work through these questions in order:

What latency do you need?
│
├── Hours or more? ──────────────────────────► Use batch.
│                                              Simpler, cheaper, easier to reprocess.
│
└── Minutes or seconds?
    │
    ├── Is the computation stateful?
    │   (joins, aggregations, session detection)
    │   │
    │   ├── No (filter/enrich only) ──────────► Kafka Streams or Flink.
    │   │                                        Low operational overhead.
    │   │
    │   └── Yes
    │       │
    │       ├── Complex state, sub-second latency ──► Flink.
    │       │                                          Accept the operational cost.
    │       │
    │       └── Moderate state, seconds ok ─────────► Kafka Streams or
    │                                                   Spark Structured Streaming.
    │
    └── Do you need exactly-once end-to-end?
        │
        ├── Yes, including external sinks ────────────► Verify your sink supports it.
        │                                               Idempotent writes or
        │                                               transactional API required.
        │
        └── At-least-once with idempotent writes ok ─► Simpler and faster.

Decision tree for batch vs. stream and framework selection.

Putting It Together: A Concrete Example

Let's say you're building a real-time dashboard for an e-commerce platform. It shows: revenue per minute (last 30 minutes), active users right now, and a flag if any product page has an error rate above 5% in the last 5 minutes.

Revenue per minute: Tumbling 1-minute windows on purchase events, keyed by nothing (global sum). Use event time. Watermark lag of 30 seconds (purchases from mobile apps can be delayed). Late events (beyond 30s) are dropped — the error is small and the metric is approximate anyway.

Active users right now: Session windows with a 5-minute inactivity gap on click/view events, keyed by user_id. Count distinct sessions open at any moment. This requires significant state (one entry per active user), RocksDB backend. Use processing time here — "right now" is inherently about when the server sees the event, not when the user clicked.

Error rate by product page: Sliding 5-minute window, 1-minute slide, on page-view and error events, keyed by product_id. Compute error_count / view_count. Alert if ratio > 0.05. Use event time, watermark 10 seconds. Low tolerance for late events because this is an alerting use case — a false negative (missed alert) is worse than slightly delayed alerting.

All three jobs can run in Flink from the same Kafka topics. They have different window types, different keys, and different latency tolerances — but the infrastructure is shared.

Chapter Summary

The One-Sentence Principle

Batch and stream processing are the same computation with different assumptions about when the data ends — choose the simplest model that meets your latency requirement, and resist the pull of streaming until you genuinely need it.

The Most Common Mistake

Using processing time instead of event time for any aggregation that a user or business analyst will interpret as "what happened at time X." The numbers look plausible, are wrong in systematic ways, and the error is invisible until someone cross-checks against another source.

Three Questions for Your Next Design Review

What is the actual latency requirement, and who confirmed it? Would an hourly batch job satisfy the stated need?
If we discover a bug that corrupts three weeks of output, how do we reprocess? How long will it take?
For every windowed aggregation: are we using event time or processing time, and is that the right choice for this specific metric?

Stream Processing vs. Batch Processing

What's in this chapter

Read this if you have 3 minutes