Building an E2E Analytics Pipeline with PyFlink Marta Paes (@morsapaes) Developer Advocate © 2020 Ververica

This Talk Motivation & Evolution of PyFlink 2 @morsapaes Demo Looking ahead

Why Flink + Python? Java Scala 3 @morsapaes

Why Flink + Python? Java Python Scala Expose the functionality of Flink beyond the JVM 4 @morsapaes

Why Flink + Python? Mature and intuitive analytics stack… 5 @morsapaes

Why Flink + Python? Mature and intuitive analytics stack… 2007 1995 2001 2008 …that doesn’t really scale. 6 @morsapaes

Why Flink + Python? 2007 1995 2001 Distribute and scale the functionality of Python through Flink 2008 7 @morsapaes

PyFlink Over Time PyFlink (Table API) Beta release Flink 1.9 Aug’19 8 @morsapaes

PyFlink Over Time PyFlink (Table API) Beta release Flink 1.10 Feb’20 Flink 1.9 Aug’19 Scalar UDFs apache-flink available on PyPi 9 @morsapaes

PyFlink Over Time Python UDFs in SQL DDL & SQL Client Tabular UDFs UDF metrics Pandas UDFs & Cython UDF Optimization PyFlink (Table API) Beta release Table conversion fromPandas/toPandas Flink 1.10 Flink 1.12 Feb’20 Nov/Dec’20 Flink 1.9 Flink 1.11 Aug’19 Jul’20 Scalar UDFs apache-flink available on PyPi 10 @morsapaes

Can we use PyFlink to identify the most frequent topics in the User Mailing List? 11 @morsapaes

The Demo Environment CDC Source Connector PyFlink * Postgres Submit job UDF Visualization JobManager Assign & monitor tasks

  • (Awfully) Trained LDA Model TaskManager JDBC Connector Exec. Query Tasks Postgres 12 @morsapaes

DEMO Step 1. Create the source table. 13 @morsapaes

DEMO Step 2. Write and register a UDF to clean and classify the messages. 14 @morsapaes

DEMO Step 3. Build your query and create a sink table to where you’ll be inserting the results! 15 @morsapaes

DEMO Step 4. Submit the job (and dependencies) to the cluster docker-compose exec jobmanager ./bin/flink run -py /opt/pyflink-ff2020/pipeline.py \ —pyArchives /opt/pyflink-ff2020/lda_model.zip#model \ —pyFiles /opt/pyflink-ff2020/tokenizer.py -d 16 @morsapaes

DEMO Step 5. Visualize in Superset! (9, ‘0.521*”flink” + 0.065*”issue” + 0.049*”source” + 0.041*”yarn” + 0.033*”build” + 0.028*”latency” + 0.027*”access” + 0.023*”progress” + 0.016*”run” + 0.015*”org”’) 17 @morsapaes

And soon! Python UDFs in SQL DDL & SQL Client Tabular UDFs UDF metrics Pandas UDFs & Cython UDF Optimization PyFlink (Table API) beta release Table conversion fromPandas/toPandas Flink 1.10 Flink 1.12 Feb’20 Nov/Dec’20 Flink 1.9 Flink 1.11 Aug’19 Jul’20 Scalar UDFs Python DataStream API (Stateless) [FLIP-130] apache-flink available on PyPi PyFlink on (native) Kubernetes [FLINK-17480] UDAFs & Pandas UDAFs [FLIP-139, FLIP-137] Python Table API DSL [FLINK-19091] 18 @morsapaes

PyFlink in a Nutshell ● Unified APIs for batch and streaming ● Native SQL integration + Pandas integration ● Support for a large set of operations (incl. complex joins, windowing, pattern matching/CEP) Execution Streaming Native Connectors Formats Batch FileSystems Apache Kafka ML Library (WIP) FLIP-39 Notebooks UDF Support Kinesis Python UD(T)F Pandas UDF +UDAF (WIP) +UDAF (WIP) HBase JDBC Elasticsearch Apache Zeppelin

  • 19 @morsapaes Try the PyFlink playground: https://github.com/apache/flink-playgrounds

Thank you, Flink Forward! Follow me on Twitter: @morsapaes Learn more about Flink: https://flink.apache.org/ © 2020 Ververica