A presentation at Raleigh Apache Kafka® Meetup in in Raleigh, NC, USA by Viktor Gamov
Streams Must Flow: fault-tolerant stream processing apps on Kubernetes April, 2019 / New York, NY @gamussa | #raleighKafka | @ConfluentINc
2 Special thanks! @gwenshap @gamussa @MatthiasJSax | #raleighKafka | @ConfluentINc
3 Agenda Kafka Streams 101 How do Kafka Streams applications scale? Kubernetes 101 Recommendations for Kafka Streams @gamussa | #raleighKafka | @ConfluentINc
https://gamov.dev/kstreams-k8s-pr @gamussa | #raleighKafka | @ConfluentINc
5 Kafka Streams – 101 Your App @gamussa | #raleighKafka | @ConfluentINc Other Systems Kafka Connect Kafka Connect Other Systems Kafka Streams
6 Stock Trade Stats Example KStream<String, Trade> source = builder.stream(STOCK_TOPIC); KStream<Windowed<String>, TradeStats> stats = source .groupByKey() .windowedBy(TimeWindows.of(5000).advanceBy(1000)) .aggregate(TradeStats::new, (k, v, tradestats) -> tradestats.add(v), Materialized.<~>as(“trade-aggregates”) .withValueSerde(new TradeStatsSerde())) .toStream() .mapValues(TradeStats::computeAvgPrice); stats.to(STATS_OUT_TOPIC, Produced.keySerde(WindowedSerdes.timeWindowedSerdeFrom(String.class))); @gamussa | #raleighKafka | @ConfluentINc
7 Stock Trade Stats Example KStream<String, Trade> source = builder.stream(STOCK_TOPIC); KStream<Windowed<String>, TradeStats> stats = source .groupByKey() .windowedBy(TimeWindows.of(5000).advanceBy(1000)) .aggregate(TradeStats::new, (k, v, tradestats) -> tradestats.add(v), Materialized.<~>as(“trade-aggregates”) .withValueSerde(new TradeStatsSerde())) .toStream() .mapValues(TradeStats::computeAvgPrice); stats.to(STATS_OUT_TOPIC, Produced.keySerde(WindowedSerdes.timeWindowedSerdeFrom(String.class))); @gamussa | #raleighKafka | @ConfluentINc
8 Stock Trade Stats Example KStream<String, Trade> source = builder.stream(STOCK_TOPIC); KStream<Windowed<String>, TradeStats> stats = source .groupByKey() .windowedBy(TimeWindows.of(5000).advanceBy(1000)) .aggregate(TradeStats::new, (k, v, tradestats) -> tradestats.add(v), Materialized.<~>as(“trade-aggregates”) .withValueSerde(new TradeStatsSerde())) .toStream() .mapValues(TradeStats::computeAvgPrice); stats.to(STATS_OUT_TOPIC, Produced.keySerde(WindowedSerdes.timeWindowedSerdeFrom(String.class))); @gamussa | #raleighKafka | @ConfluentINc
9 Stock Trade Stats Example KStream<String, Trade> source = builder.stream(STOCK_TOPIC); KStream<Windowed<String>, TradeStats> stats = source .groupByKey() .windowedBy(TimeWindows.of(5000).advanceBy(1000)) .aggregate(TradeStats::new, (k, v, tradestats) -> tradestats.add(v), Materialized.<~>as(“trade-aggregates”) .withValueSerde(new TradeStatsSerde())) .toStream() .mapValues(TradeStats::computeAvgPrice); stats.to(STATS_OUT_TOPIC, Produced.keySerde(WindowedSerdes.timeWindowedSerdeFrom(String.class))); @gamussa | #raleighKafka | @ConfluentINc
10 Topologies builder.stream Source Node state stores source.groupByKey .windowedBy(…) .aggregate(…) Processor Node mapValues() Processor Node streams Sink Node to(…) @gamussa | #raleighKafka | @ConfluentINc Processor Topology
How Do Kafka Streams Application Scale? @gamussa | #raleighKafka | @ConfluentINc
12 Partitions, Tasks, and Consumer Groups input topic Task executes processor topology One consumer group: can be executed with 1 - 4 threads on 1 - 4 machines 4 input topic partitions => 4 tasks result topic @gamussa | #raleighKafka | @ConfluentINc
13 Scaling with State “no state” Trade Stats App Instance 1 @gamussa | #raleighKafka | @ConfluentINc
Scaling with State “no state” Trade Stats App Trade Stats App Instance 1 @gamussa Instance 2 | #raleighKafka | @ConfluentINc 14
15 Scaling with State “no state” Trade Stats App Trade Stats App Instance 1 @gamussa Instance 2 | #raleighKafka | Trade Stats App Instance 3 @ConfluentINc
16 Scaling and FaultTolerance Two Sides of Same Coin @gamussa | #raleighKafka | @ConfluentINc
17 Fault-Tolerance Trade Stats App Trade Stats App Instance 1 @gamussa | Instance 2 #raleighKafka | Trade Stats App @ConfluentINc Instance 3
18 Fault-Tolerant State State Updates Input Topic Changelog Topic Result Topic @gamussa | #raleighKafka | @ConfluentINc
19 Migrate State Trade Stats App Instance 1 Trade Stats App Instance 2 restore Changelog Topic @gamussa | #raleighKafka | @ConfluentINc
20 Recovery Time • Changelog topics are log compacted • Size of changelog topic linear in size of state • Large state implies high recovery times @gamussa | #raleighKafka | @ConfluentINc
21 Recovery Overhead • Recovery overhead is proportional to ○ segment-size / state-size • Segment-size is smaller than state-size => reduced overhead • Update changelog topic segment size accordingly ○ topic config: log.segments.bytes ○ log cleaner interval important, too @gamussa | #raleighKafka | @ConfluentINc
22 Kubernetes Fundamentals @gamussa | #raleighKafka | @ConfluentINc
23 https://twitter.com/sahrizv/status/1018184792611827712 @gamussa | #raleighKafka | @ConfluentINc
24 @gamussa | #raleighKafka | @ConfluentINc
25 Orchestration ●Compute ●Networking ●Storage ●Service Discovery @gamussa | #raleighKafka | @ConfluentINc
26 Kubernetes ●Schedules and allocates resources ●Networking between Pods ●Storage ●Service Discovery @gamussa | #raleighKafka | @ConfluentINc
27 Refresher - Kubernetes Architecture kubectl https://thenewstack.io/kubernetes-an-overview/ @gamussa | #raleighKafka | @ConfluentINc
28 Pod • Basic Unit of Deployment in Kubernetes • A collection of containers sharing: • Namespace • Network • Volumes @gamussa | #raleighKafka | @ConfluentINc
29 Storage • Persistent Volume (PV) & Persistent Volume Claim (PVC) • Both PV and PVC are ‘resources’ @gamussa | #raleighKafka | @ConfluentINc
30 Storage • Persistent Volume (PV) & Persistent Volume Claim (PVC) • PV is a piece of storage that is provisioned dynamic or static of any individual pod that uses the PV @gamussa | #raleighKafka | @ConfluentINc
31 Storage • Persistent Volume (PV) & Persistent Volume Claim (PVC) • PVC is a request for storage by a User @gamussa | #raleighKafka | @ConfluentINc
32 Storage • Persistent Volume (PV) & Persistent Volume Claim (PVC) • PVCs consume PV @gamussa | #raleighKafka | @ConfluentINc
33 Stateful Workloads @gamussa | #raleighKafka | @ConfluentINc
34 StatefulSet ● Rely on Headless Headless Service Service to provide network identity Pod-0 ● Ideal for highly available stateful Pod-1 Pod-2 Containers Containers Containers Volumes Volumes Volumes workloads @gamussa | #raleighKafka | @ConfluentINc
Recommendations for Kafka Streams @gamussa | #raleighKafka | @ConfluentINc
36 Stock Stats App Stock Stats App Stock Stats App Kafka Streams Kafka Streams Kafka Streams Instance 1 Instance 2 Instance 3 @gamussa | #raleighKafka | @ConfluentINc
37 WordCount App WordCount App WordCount App Kafka Streams Kafka Streams Kafka Streams Instance 1 Instance 2 Instance 3 @gamussa | #raleighKafka | @ConfluentINc
38 StatefulSets are new and complicated. We don’t need them. @gamussa | #raleighKafka | @ConfluentINc
39 Recovering state takes time. Statelful is faster. @gamussa | #raleighKafka | @ConfluentINc
40 But I’ll want to scale-out and back anyway. @gamussa | #raleighKafka | @ConfluentINc
41 @gamussa | #raleighKafka | @ConfluentINc
42 I don’t really trust my storage admin anyway @gamussa | #raleighKafka | @ConfluentINc
43 Recommendations ● Keep changelog shards small ● If you trust your storage: Use StatefulSets ● Use anti-affinity when possible ● Use “parallel” pod management
44 🛑 Stop! Demo time! @gamussa | #raleighKafka | @ConfluentINc
45 Summary Kafka Streams has recoverable state, that gives streams apps easy elasticity and high availability Kubernetes makes it easy to scale applications It also has StatefulSets for applications with state @gamussa | #raleighKafka | @ConfluentINc
46 Summary Now you know how to deploy Kafka Streams on Kubernetes and take advantage on all the scalability and highavailability capabilities @gamussa | #raleighKafka | @ConfluentINc
47 But what about Kafka itself? @gamussa | #raleighKafka | @ConfluentINc
48 https://www.confluent.io/resources/kafka-summit-new-york-2019/ @gamussa | #raleighKafka | @ConfluentINc
49 Confluent Operator Automate provisioning Scale your Kafkas and CP clusters elastically Monitoring with Confluent Control Center or Prometheus Operate at scale with enterprise support from Confluent @gamussa | #raleighKafka | @ConfluentINc
50 Resources and Next Steps https://cnfl.io/helm_video https://cnfl.io/cp-helm https://cnfl.io/k8s https://slackpass.io/confluentcommunity #kubernetes @gamussa | #raleighKafka | @ConfluentINc
Thanks! @gamussa viktor@confluent.io We are hiring! https://www.confluent.io/careers/ @gamussa | @ #raleighKafka | @ConfluentINc
52
All things change constantly, and we need to get on board with streams! Kafka Streams, Apache Kafka’s stream processing library, allows developers to build sophisticated stateful stream processing applications which you can deploy in an environment of your choice. Kafka Streams is not only scalable but fully elastic allowing for dynamic scale-in and scale-out as the library handles state migration transparently in the background. By running Kafka Streams applications on Kubernetes, you can use Kubernetes powerful control plane to standardize and simplify the application management—from deployment to dynamic scaling. In this talk, Viktor explains the essentials of dynamic scaling and state migration in Kafka Streams. You will see a live demo of how a Kafka Streams application can run in a Docker container and the dynamic scaling of an application running in Kubernetes.
Here’s what was said about this presentation on social media.
Checking out #KafkaStreams and #k8s with @gAmUssA #RaleighKafka pic.twitter.com/0mne5JP2Q0
— Sandon Jacobs (@SandonLeeJacobs) April 18, 2019
@klarr_io Spending the evening with @confluentinc #raleighkafka pic.twitter.com/YE3r5FTN3R
— Jim Smith (@hymie123) April 18, 2019
It was very nice to see @mesutcelik from @hazelcast at #RaleighKafka meetup. pic.twitter.com/BIGBeUlqIJ
— Viktor Gamov (@gAmUssA) April 19, 2019