Building an End-to-End Analytics Pipeline with PyFlink

A presentation at Flink Forward Global 2020 in October 2020 in by Marta Paes

Slide 1

Slide 1

Building an E2E Analytics Pipeline with PyFlink Marta Paes (@morsapaes) Developer Advocate © 2020 Ververica

Slide 2

Slide 2

This Talk Motivation & Evolution of PyFlink 2 @morsapaes Demo Looking ahead

Slide 3

Slide 3

Why Flink + Python? Java Scala 3 @morsapaes

Slide 4

Slide 4

Why Flink + Python? Java Python Scala Expose the functionality of Flink beyond the JVM 4 @morsapaes

Slide 5

Slide 5

Why Flink + Python? Mature and intuitive analytics stack… 5 @morsapaes

Slide 6

Slide 6

Why Flink + Python? Mature and intuitive analytics stack… 2007 1995 2001 2008 …that doesn’t really scale. 6 @morsapaes

Slide 7

Slide 7

Why Flink + Python? 2007 1995 2001 Distribute and scale the functionality of Python through Flink 2008 7 @morsapaes

Slide 8

Slide 8

PyFlink Over Time PyFlink (Table API) Beta release Flink 1.9 Aug’19 8 @morsapaes

Slide 9

Slide 9

PyFlink Over Time PyFlink (Table API) Beta release Flink 1.10 Feb’20 Flink 1.9 Aug’19 Scalar UDFs apache-flink available on PyPi 9 @morsapaes

Slide 10

Slide 10

PyFlink Over Time Python UDFs in SQL DDL & SQL Client Tabular UDFs UDF metrics Pandas UDFs & Cython UDF Optimization PyFlink (Table API) Beta release Table conversion fromPandas/toPandas Flink 1.10 Flink 1.12 Feb’20 Nov/Dec’20 Flink 1.9 Flink 1.11 Aug’19 Jul’20 Scalar UDFs apache-flink available on PyPi 10 @morsapaes

Slide 11

Slide 11

Can we use PyFlink to identify the most frequent topics in the User Mailing List? 11 @morsapaes

Slide 12

Slide 12

The Demo Environment CDC Source Connector PyFlink * Postgres Submit job UDF Visualization JobManager Assign & monitor tasks

  • (Awfully) Trained LDA Model TaskManager JDBC Connector Exec. Query Tasks Postgres 12 @morsapaes

Slide 13

Slide 13

DEMO Step 1. Create the source table. 13 @morsapaes

Slide 14

Slide 14

DEMO Step 2. Write and register a UDF to clean and classify the messages. 14 @morsapaes

Slide 15

Slide 15

DEMO Step 3. Build your query and create a sink table to where you’ll be inserting the results! 15 @morsapaes

Slide 16

Slide 16

DEMO Step 4. Submit the job (and dependencies) to the cluster docker-compose exec jobmanager ./bin/flink run -py /opt/pyflink-ff2020/pipeline.py \ —pyArchives /opt/pyflink-ff2020/lda_model.zip#model \ —pyFiles /opt/pyflink-ff2020/tokenizer.py -d 16 @morsapaes

Slide 17

Slide 17

DEMO Step 5. Visualize in Superset! (9, ‘0.521*”flink” + 0.065*”issue” + 0.049*”source” + 0.041*”yarn” + 0.033*”build” + 0.028*”latency” + 0.027*”access” + 0.023*”progress” + 0.016*”run” + 0.015*”org”’) 17 @morsapaes

Slide 18

Slide 18

And soon! Python UDFs in SQL DDL & SQL Client Tabular UDFs UDF metrics Pandas UDFs & Cython UDF Optimization PyFlink (Table API) beta release Table conversion fromPandas/toPandas Flink 1.10 Flink 1.12 Feb’20 Nov/Dec’20 Flink 1.9 Flink 1.11 Aug’19 Jul’20 Scalar UDFs Python DataStream API (Stateless) [FLIP-130] apache-flink available on PyPi PyFlink on (native) Kubernetes [FLINK-17480] UDAFs & Pandas UDAFs [FLIP-139, FLIP-137] Python Table API DSL [FLINK-19091] 18 @morsapaes

Slide 19

Slide 19

PyFlink in a Nutshell ● Unified APIs for batch and streaming ● Native SQL integration + Pandas integration ● Support for a large set of operations (incl. complex joins, windowing, pattern matching/CEP) Execution Streaming Native Connectors Formats Batch FileSystems Apache Kafka ML Library (WIP) FLIP-39 Notebooks UDF Support Kinesis Python UD(T)F Pandas UDF +UDAF (WIP) +UDAF (WIP) HBase JDBC Elasticsearch Apache Zeppelin

  • 19 @morsapaes Try the PyFlink playground: https://github.com/apache/flink-playgrounds

Slide 20

Slide 20

Thank you, Flink Forward! Follow me on Twitter: @morsapaes Learn more about Flink: https://flink.apache.org/ © 2020 Ververica