Data lake: how Red Hat maintains data quality across multiple Drupal sites

A presentation at Florida Drupal Camp in February 2023 in Orlando, FL, USA by April Sides

Slide 1

Slide 1

Data lake: how Red Hat maintains data quality across multiple Drupal sites Florida Drupalcamp 2023 1 Melissa Bent April Sides Senior Software Engineer Senior Software Engineer

Slide 2

Slide 2

Data lake: how Red Hat maintains data quality across multiple Drupal sites 2 Problem Discovery Solution Integration ● ● ● ●

Slide 3

Slide 3

Problem 3

Slide 4

Slide 4

Problem Redhat.com is built on Drupal and single-page applications. 4

Slide 5

Slide 5

Problem Organization-level data should be consistent and accurate across redhat.com. 5

Slide 6

Slide 6

Problem Organization-level data is duplicated and managed by different teams. 6

Slide 7

Slide 7

Problem Managing data quality across redhat.com is a manual process. 7

Slide 8

Slide 8

Problem There has to be a better way to share data across the redhat.com ecosystem. 8

Slide 9

Slide 9

Discovery 9

Slide 10

Slide 10

Discovery Requirements ▸ ▸ ▸ ▸ Share data in a scalable and maintainable way Connect with different tech stacks Serve as a single source of truth Provide flexible data model/schema 10 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 11

Slide 11

Discovery Data Repository Types ▸ Relational database ▸ Data warehouse ・ Data mart ・ Operational data store ▸ Data lake 11 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 12

Slide 12

Discovery Relational database - Structured transactional data - Main features: - Data normalization - Compliancy Example: Drupal database 12

Slide 13

Slide 13

Discovery Data warehouse - Business intelligence focus - Structured data - Schema-on-write 13

Slide 14

Slide 14

Discovery Data Lake - Flexible structure - Scalable - Schema-on-read - Low maintenance 14

Slide 15

Slide 15

Discovery Data Lake Challenges ▸ ▸ ▸ ▸ ▸ Data rot Data governance Data compliance Data security Data availability 15 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 16

Slide 16

Discovery Data Lake Advantages ▸ Multiple data sources—Consistent access ▸ Protection against downtime ▸ Provides an additional, query-level, caching layer for production 16 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 17

Slide 17

Solution 17

Slide 18

Slide 18

Solution Tech Stack ▸ Database Backend: MongoDB ▸ Indexing: Search API + Custom module ▸ Retrieval ・ GraphQL ・ Direct query 18 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 19

Slide 19

Solution Indexing ▸ ▸ ▸ ▸ Custom Search API Backend Search API’s index management Flexible schema for each data source Drupal provides access control, the data model, and the editorial experience 19 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 20

Slide 20

Solution Retrieval ▸ Single-page applications: GraphQL ▸ Drupal: PHP MongoDB Driver 20 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 21

Slide 21

Integration 21

Slide 22

Slide 22

Integration: Products 22

Slide 23

Slide 23

Integration: Products Product Experience access.redhat.com/products 23 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 24

Slide 24

Integration: Products Customer Portal Products ▸ Customer Portal helps our customers get the most out of their subscriptions ▸ Product information is core to our data organization ▸ Multiple teams and sites combine to make what is the Customer Portal 24 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 25

Slide 25

Integration 25 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 26

Slide 26

Integration 26 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 27

Slide 27

Integration 27 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 28

Slide 28

Integration: Products Strategy Product data managed in Drupal Indexed to the Data Lake Queried via GraphQL Page built via GitLab pipeline (statically generated) ▸ Refreshed every 30 minutes ▸ ▸ ▸ ▸ 28 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 29

Slide 29

Integration: Learning Paths 29

Slide 30

Slide 30

Integration: Learning Paths Learning Path A curated collection of content, directing users to learn more about a particular topic or product. 30

Slide 31

Slide 31

Integration: Learning Paths developers.redhat.com 31 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 32

Slide 32

Integration: Learning Paths developers.redhat.com Article 32 Data lake: how Red Hat maintains data quality across multiple Drupal sites Cheat Sheet

Slide 33

Slide 33

Integration: Learning Paths developers.redhat.com Article in Learning Path 33 Data lake: how Red Hat maintains data quality across multiple Drupal sites Cheat Sheet in Learning Path

Slide 34

Slide 34

Integration: Learning Paths Learning Path content type 34 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 35

Slide 35

Integration: Learning Paths 35 Data lake: how Red Hat maintains data quality across multiple Drupal sites Resource content type

Slide 36

Slide 36

Integration: Learning Paths developers.redhat.com Article in Learning Path 36 Data lake: how Red Hat maintains data quality across multiple Drupal sites Cheat Sheet in Learning Path

Slide 37

Slide 37

Integration: Learning Paths developers.redhat.com Article in Learning Path 37 Data lake: how Red Hat maintains data quality across multiple Drupal sites Cheat Sheet in Learning Path

Slide 38

Slide 38

Integration: Learning Paths Resource displays hook_preprocess_node() 38 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 39

Slide 39

Integration: Learning Paths Shared module Learning Paths shared module 39 Data lake: how Red Hat maintains data quality across multiple Drupal sites

  • Data Lake Learning Paths schema - Data Lake query service - Reusable code: - Services - Controller - EventSubscriber - Blocks - Constraints/Validators

Slide 40

Slide 40

Integration: Learning Paths 40 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 41

Slide 41

Integration: Future Plans 41

Slide 42

Slide 42

Integration Content Syndicated Patterns ▸ “Shared patterns” with embedded content for banners, footers, marketing content, etc. 42 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 43

Slide 43

Integration Customer Portal ▸ Autobuilding product pages from the Data Lake (currently are Drupal nodes) ▸ Standardizing Product taxonomy across Customer Portal microsites ▸ Integrating with external systems (product life cycle, case management, developers.redhat.com, etc.) 43 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 44

Slide 44

Thank you Questions? Melissa Bent April Sides Red Hat linkedin.com/in/melissabent linkedin.com/in/aprilsides linkedin.com/company/red-hat twitter.com/merauluka twitter.com/weekbeforenext youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat 44