Study Notes 6.15: Kafka & Flink Streaming
Pizofreude

Pizofreude @pizofreude

About: CAE Engineer & Data Cruncher, Tech Stack Enthusiast, FOSS Advocate, and Avid Cyclist.

Location:
Die Erde
Joined:
Nov 12, 2022

Study Notes 6.15: Kafka & Flink Streaming

Publish Date: Mar 18
0 0

1. Introduction to Kafka Streaming with PyFlink

  • Streaming Data Processing:
    • Involves the continuous ingestion, processing, and movement of data in real time.
    • Critical for use cases where immediate reaction is needed (e.g., fraud detection, real-time analytics, surge pricing).
  • Key Technologies Covered:
    • Kafka (or Kafka-compatible systems like Red Panda):
      • Acts as a high-throughput, low-latency messaging system.
      • Uses topics (like tables) to which data producers send events.
    • Apache Flink:
      • Provides a distributed processing framework for both stream and batch (micro-batch) workloads.
      • Excels in stateful processing, windowing, and fault tolerance.
    • PostgreSQL:
      • Serves as the sink (destination) to store processed data, enabling query and analysis.

2. Architecture and Environment Setup

  • Containerization with Docker Compose:
    • Multiple containers (or “machines”) are spun up for:
      • Red Panda: Simulating Kafka for local development.
      • Flink Components:
        • Job Manager: Orchestrates job submission and scheduling.
        • Task Manager: Executes parallel tasks.
      • PostgreSQL: Acts as the landing zone for processed events.
    • Local Setup Tips:
      • Use database tools like DataGrip, DBeaver, or PGAdmin to connect to PostgreSQL.
      • Verify each container’s status via the Docker CLI and dashboards (e.g., Apache Flink’s dashboard on localhost:8081).

3. Kafka Producers, Topics, and Data Serialization

  • Kafka Producer Role:
    • A Python-based producer script sends test data into Kafka (simulated by Red Panda).
    • Data is serialized into JSON for interoperability among different languages and systems.
    • Alternatives:
      • Formats like Thrift or Protobuf can be used to reduce message size and improve efficiency.
  • Kafka Topics:
    • Analogous to tables in a relational database.
    • Producers write data to a topic, and consumers (like Flink jobs) subscribe to these topics.

4. Flink’s Role and Modes of Operation

  • Flink as a Stream Processor:
    • Reads data from Kafka and writes processed results to PostgreSQL.
    • Supports both continuous streaming (staying active to process new events) and batch-like (micro-batch) modes.
  • Job Lifecycle and Checkpointing:
    • Checkpointing:
      • Periodically snapshots the job’s state (e.g., every 10 seconds).
      • Enables job recovery after failures without reprocessing all data from the beginning.
      • Must be carefully configured to balance resilience with processing overhead.
    • Offset Management:
      • Earliest Offset: Reads all available data from Kafka.
      • Latest Offset: Reads only data that arrives after the job starts.
      • Custom Timestamp: Allows restarting processing from a specific point in time.

5. Windowing and Watermarking in Flink

  • Why Windowing?
    • Helps group events into finite chunks for aggregation (e.g., counts, sums).
    • Especially important when dealing with unbounded (continuous) data streams.
  • Types of Windows:
    • Tumbling Windows:
      • Non-overlapping, fixed-size windows (e.g., one-minute windows).
      • Best for batch-like processing where events are grouped by time intervals.
    • Sliding Windows:
      • Overlapping windows that slide over time.
      • Useful for capturing trends with more granularity (e.g., every 30 seconds, even though the window length is one minute).
    • Session Windows:
      • Group events based on periods of activity separated by gaps (determined by a “session gap”).
      • Ideal for modeling user sessions or bursts of activity.
  • Watermarking:
    • A mechanism to tolerate out-of-order events by setting a delay (e.g., 15 seconds).
    • Allows late-arriving events to be incorporated into the correct window.
    • Can be paired with “allowed lateness” settings to update results if events arrive beyond the typical delay.
    • Side Outputs:
      • An option to divert extremely late data to a separate stream for later processing.

6. Fault Tolerance and Checkpointing Strategies

  • Resilience in Streaming Jobs:
    • Checkpointing not only captures Kafka offsets but also the state of active windows (even if a job fails in the middle of a window).
    • On job restart, Flink can resume processing from the last checkpoint to avoid duplicate work.
    • Potential Pitfalls:
      • Restarting a job from scratch (using the “earliest” offset) can lead to duplicate data.
      • Best practice is to redeploy by restoring from the checkpoint rather than starting a new job entirely.

7. When to Use Streaming Versus Batch Processing

  • Streaming Use Cases:
    • Real-time fraud detection.
    • Dynamic pricing models (e.g., Uber surge pricing).
    • Systems that require near-immediate responses (e.g., alert systems).
  • Batch (Micro-Batch) Use Cases:
    • When a slight delay is acceptable (e.g., hourly data aggregation).
    • Scenarios where processing overhead must be minimized.
    • Many analytical workloads that do not require instantaneous reaction.
  • Key Consideration:
    • The decision to use streaming over batch processing depends on the business need for real-time insights versus the complexity and maintenance overhead associated with streaming systems.

8. Best Practices and Additional Tips

  • Connector Libraries:
    • Use Flink’s built-in connectors (e.g., Kafka and JDBC connectors) to simplify data ingestion and output.
  • Schema Management:
    • Since Kafka topics do not enforce schemas, extra care is needed to manage data consistency (e.g., using schema registries or defining conventions for producers and consumers).
  • Scaling and Parallelism:
    • Flink’s parallelism is determined by the keys used in processing (e.g., grouping by a particular column).
    • Properly keying your streams can help balance workload across available Task Managers.
  • Managing Complexity:
    • Recognize that streaming pipelines have more moving parts than batch jobs (offsets, state management, watermarking).
    • It’s important for teams to understand the additional operational complexities and invest in monitoring, alerting, and clear documentation.

9. Spark Streaming vs. Flink Streaming

  • Spark Streaming:
    • Operates on the micro-batch principle (processing data in small, time-based batches).
    • Can introduce a slight delay due to batch intervals (e.g., 15–30 seconds).
  • Flink Streaming:
    • Implements true continuous processing (push architecture), processing events as they arrive.
    • Generally offers lower latency and more granular control over windowing and state management.
  • Choosing Between Them:
    • For real-time, low-latency applications where every millisecond counts, Flink’s continuous processing is often preferred.
    • For use cases where micro-batch latency is acceptable, Spark Streaming might be simpler to implement and maintain.

10. Q&A and Practical Insights

  • Job Recovery and Duplicate Handling:
    • It is essential to correctly configure checkpointing and offset management to prevent duplicate records when a job is restarted.
    • Some production environments handle duplicate records by using “upsert” semantics in the sink (e.g., PostgreSQL’s “ON CONFLICT UPDATE”).
  • Skill Set and Team Organization:
    • Streaming data engineering requires specialized skills due to the operational and development complexities involved.
    • In some organizations, roles are split between batch data engineers and streaming (or “real-time”) engineers to ensure expertise in each area.
  • Real-World Examples:
    • Netflix Fraud Detection:
      • Streaming is used to identify anomalies and immediately trigger security measures.
    • Uber Surge Pricing:
      • Real-time data is crucial to dynamically adjust pricing based on supply and demand fluctuations.

Supplementary Information

  • Additional Resources:
  • Best Practice Tips:
    • Regularly monitor checkpoint intervals and state size to avoid performance bottlenecks.
    • Test different watermark strategies to balance latency and completeness of results.
    • Consider using a schema registry when working with evolving data schemas in Kafka topics.

Comments 0 total

    Add comment