Study Notes 6.15: Kafka & Flink Streaming

1. Introduction to Kafka Streaming with PyFlink

Streaming Data Processing:
- Involves the continuous ingestion, processing, and movement of data in real time.
- Critical for use cases where immediate reaction is needed (e.g., fraud detection, real-time analytics, surge pricing).
Key Technologies Covered:
- Kafka (or Kafka-compatible systems like Red Panda):
  - Acts as a high-throughput, low-latency messaging system.
  - Uses topics (like tables) to which data producers send events.
- Apache Flink:
  - Provides a distributed processing framework for both stream and batch (micro-batch) workloads.
  - Excels in stateful processing, windowing, and fault tolerance.
- PostgreSQL:
  - Serves as the sink (destination) to store processed data, enabling query and analysis.

2. Architecture and Environment Setup

Containerization with Docker Compose:
- Multiple containers (or “machines”) are spun up for:
  - Red Panda: Simulating Kafka for local development.
  - Flink Components:
    - Job Manager: Orchestrates job submission and scheduling.
    - Task Manager: Executes parallel tasks.
  - PostgreSQL: Acts as the landing zone for processed events.
- Local Setup Tips:
  - Use database tools like DataGrip, DBeaver, or PGAdmin to connect to PostgreSQL.
  - Verify each container’s status via the Docker CLI and dashboards (e.g., Apache Flink’s dashboard on localhost:8081).

3. Kafka Producers, Topics, and Data Serialization

Kafka Producer Role:
- A Python-based producer script sends test data into Kafka (simulated by Red Panda).
- Data is serialized into JSON for interoperability among different languages and systems.
- Alternatives:
  - Formats like Thrift or Protobuf can be used to reduce message size and improve efficiency.
Kafka Topics:
- Analogous to tables in a relational database.
- Producers write data to a topic, and consumers (like Flink jobs) subscribe to these topics.

4. Flink’s Role and Modes of Operation

Flink as a Stream Processor:
- Reads data from Kafka and writes processed results to PostgreSQL.
- Supports both continuous streaming (staying active to process new events) and batch-like (micro-batch) modes.
Job Lifecycle and Checkpointing:
- Checkpointing:
  - Periodically snapshots the job’s state (e.g., every 10 seconds).
  - Enables job recovery after failures without reprocessing all data from the beginning.
  - Must be carefully configured to balance resilience with processing overhead.
- Offset Management:
  - Earliest Offset: Reads all available data from Kafka.
  - Latest Offset: Reads only data that arrives after the job starts.
  - Custom Timestamp: Allows restarting processing from a specific point in time.

5. Windowing and Watermarking in Flink

Why Windowing?
- Helps group events into finite chunks for aggregation (e.g., counts, sums).
- Especially important when dealing with unbounded (continuous) data streams.
Types of Windows:
- Tumbling Windows:
  - Non-overlapping, fixed-size windows (e.g., one-minute windows).
  - Best for batch-like processing where events are grouped by time intervals.
- Sliding Windows:
  - Overlapping windows that slide over time.
  - Useful for capturing trends with more granularity (e.g., every 30 seconds, even though the window length is one minute).
- Session Windows:
  - Group events based on periods of activity separated by gaps (determined by a “session gap”).
  - Ideal for modeling user sessions or bursts of activity.
Watermarking:
- A mechanism to tolerate out-of-order events by setting a delay (e.g., 15 seconds).
- Allows late-arriving events to be incorporated into the correct window.
- Can be paired with “allowed lateness” settings to update results if events arrive beyond the typical delay.
- Side Outputs:
  - An option to divert extremely late data to a separate stream for later processing.

6. Fault Tolerance and Checkpointing Strategies

Resilience in Streaming Jobs:
- Checkpointing not only captures Kafka offsets but also the state of active windows (even if a job fails in the middle of a window).
- On job restart, Flink can resume processing from the last checkpoint to avoid duplicate work.
- Potential Pitfalls:
  - Restarting a job from scratch (using the “earliest” offset) can lead to duplicate data.
  - Best practice is to redeploy by restoring from the checkpoint rather than starting a new job entirely.

7. When to Use Streaming Versus Batch Processing

Streaming Use Cases:
- Real-time fraud detection.
- Dynamic pricing models (e.g., Uber surge pricing).
- Systems that require near-immediate responses (e.g., alert systems).
Batch (Micro-Batch) Use Cases:
- When a slight delay is acceptable (e.g., hourly data aggregation).
- Scenarios where processing overhead must be minimized.
- Many analytical workloads that do not require instantaneous reaction.
Key Consideration:
- The decision to use streaming over batch processing depends on the business need for real-time insights versus the complexity and maintenance overhead associated with streaming systems.

8. Best Practices and Additional Tips

Connector Libraries:
- Use Flink’s built-in connectors (e.g., Kafka and JDBC connectors) to simplify data ingestion and output.
Schema Management:
- Since Kafka topics do not enforce schemas, extra care is needed to manage data consistency (e.g., using schema registries or defining conventions for producers and consumers).
Scaling and Parallelism:
- Flink’s parallelism is determined by the keys used in processing (e.g., grouping by a particular column).
- Properly keying your streams can help balance workload across available Task Managers.
Managing Complexity:
- Recognize that streaming pipelines have more moving parts than batch jobs (offsets, state management, watermarking).
- It’s important for teams to understand the additional operational complexities and invest in monitoring, alerting, and clear documentation.

9. Spark Streaming vs. Flink Streaming

Spark Streaming:
- Operates on the micro-batch principle (processing data in small, time-based batches).
- Can introduce a slight delay due to batch intervals (e.g., 15–30 seconds).
Flink Streaming:
- Implements true continuous processing (push architecture), processing events as they arrive.
- Generally offers lower latency and more granular control over windowing and state management.
Choosing Between Them:
- For real-time, low-latency applications where every millisecond counts, Flink’s continuous processing is often preferred.
- For use cases where micro-batch latency is acceptable, Spark Streaming might be simpler to implement and maintain.

10. Q&A and Practical Insights

Job Recovery and Duplicate Handling:
- It is essential to correctly configure checkpointing and offset management to prevent duplicate records when a job is restarted.
- Some production environments handle duplicate records by using “upsert” semantics in the sink (e.g., PostgreSQL’s “ON CONFLICT UPDATE”).
Skill Set and Team Organization:
- Streaming data engineering requires specialized skills due to the operational and development complexities involved.
- In some organizations, roles are split between batch data engineers and streaming (or “real-time”) engineers to ensure expertise in each area.
Real-World Examples:
- Netflix Fraud Detection:
  - Streaming is used to identify anomalies and immediately trigger security measures.
- Uber Surge Pricing:
  - Real-time data is crucial to dynamically adjust pricing based on supply and demand fluctuations.

Supplementary Information

Additional Resources:
- Apache Flink Documentation – Detailed guides on state management, windowing, and fault tolerance.
- Kafka Documentation – In-depth information on Kafka’s architecture, producers, consumers, and best practices.
Best Practice Tips:
- Regularly monitor checkpoint intervals and state size to avoid performance bottlenecks.
- Test different watermark strategies to balance latency and completeness of results.
- Consider using a schema registry when working with evolving data schemas in Kafka topics.

Pizofreude @pizofreude

Study Notes 6.15: Kafka & Flink Streaming

1. Introduction to Kafka Streaming with PyFlink

2. Architecture and Environment Setup

3. Kafka Producers, Topics, and Data Serialization

4. Flink’s Role and Modes of Operation

5. Windowing and Watermarking in Flink

6. Fault Tolerance and Checkpointing Strategies

7. When to Use Streaming Versus Batch Processing

8. Best Practices and Additional Tips

9. Spark Streaming vs. Flink Streaming

10. Q&A and Practical Insights

Supplementary Information

Comments 0 total