Data lake: how Red Hat maintains data quality across multiple Drupal sites

A presentation at DrupalCon Pittsburgh 2023 in June 2023 in Pittsburgh, PA, USA by April Sides

Slide 1

Slide 1

Data lake: how Red Hat maintains data quality across multiple Drupal sites DrupalCon Pittsburgh 2023 1 Melissa Bent April Sides Senior Software Engineer Senior Software Engineer

Slide 2

Slide 2

Hi Everyone đź‘‹ Melissa Bent (she/her) Senior Software Engineer 2 Data lake: how Red Hat maintains data quality across multiple Drupal sites April Sides (she/they) Senior Software Engineer

Slide 3

Slide 3

July 7-9, 2023 DrupalAsheville.com

Slide 4

Slide 4

Problem Solution Discovery 4 Data lake: how Red Hat maintains data quality across multiple Drupal sites Future Plans Implementations

Slide 5

Slide 5

Problem 5

Slide 6

Slide 6

Problem Red Hat is big. Really big. 19,000+ Employees 6 35+ Countries

Slide 7

Slide 7

Problem We use a lot of different technologies across our company to serve our customers. 7

Slide 8

Slide 8

Problem Redhat.com is built on Drupal and single-page applications. 8

Slide 9

Slide 9

Problem Organizational data Manual Maintenance 9 Duplication Different Teams

Slide 10

Slide 10

Problem There has to be a better way to share data across the redhat.com ecosystem. 10

Slide 11

Slide 11

Discovery 11

Slide 12

Slide 12

Discovery Product Experience access.redhat.com/products 12 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 13

Slide 13

Discovery Customer Portal Helps our customers get the most out of their subscriptions 13 Data lake: how Red Hat maintains data quality across multiple Drupal sites Product information is core to our data organization Multiple teams and sites combine to make the Customer Portal

Slide 14

Slide 14

Discovery Product Experience Manual content management process Duplication of content across platforms 14 Data lake: how Red Hat maintains data quality across multiple Drupal sites Risk of outdated, inaccurate product information across redhat.com Lost time maintaining an imperfect system

Slide 15

Slide 15

Discovery Requirements Share data in a scalable/maintainable way Provide flexible data model/schema Connect with current tech stacks Serve as a single source of truth 15 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 16

Slide 16

Solution 16

Slide 17

Slide 17

Solution Data Lake 17 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 18

Slide 18

Solution Data Lake Share data in a scalable/maintainable way Provide flexible data model/schema Scalable, low-maintenance Flexible structure, schema-on-read option Connect with current tech stacks Drupal module, GraphQL, PHP Driver Serve as a single source of truth Bring your own governance 18 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 19

Slide 19

Solution Challenges Protect against “data rot” 19 Data lake: how Red Hat maintains data quality across multiple Drupal sites Establish governance plan Security and privacy compliance

Slide 20

Slide 20

Solution Advantages Scalability 20 Data lake: how Red Hat maintains data quality across multiple Drupal sites Simple architecture Additional “caching” layer

Slide 21

Slide 21

Solution 21 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 22

Slide 22

Solution Indexing Custom Search API Backend Search API’s index management 22 Data lake: how Red Hat maintains data quality across multiple Drupal sites Flexible schema for each data source Drupal’s access control, data model, and editorial experience

Slide 23

Slide 23

Solution Retrieval Single-page applications GraphQL 23 Data lake: how Red Hat maintains data quality across multiple Drupal sites Drupal PHP MongoDB Driver

Slide 24

Slide 24

Implementations 24

Slide 25

Slide 25

Implementations Product Experience access.redhat.com/products 25 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 26

Slide 26

Index page All data pulled via GraphQL from the Data Lake â–¸ Statically generated via a GitLab build pipeline â–¸ No more manual maintenance by a developer in a Drupal node â–¸ Prevents accidental removal of the page in the Drupal UI â–¸ 26

Slide 27

Slide 27

Index page All data pulled via GraphQL from the Data Lake â–¸ Statically generated via a GitLab build pipeline â–¸ No more manual maintenance by a developer in a Drupal node â–¸ Prevents accidental removal of the page in the Drupal UI â–¸ 27

Slide 28

Slide 28

Index page â–¸ 28 Translations are managed through Drupal and indexed into the Data Lake

Slide 29

Slide 29

Product page Launched April 3, 2023! 🎉 Product names ingested from our Product Life Cycle API ▸ Canonical links are managed by Drupal ▸ Displays data from other systems (non-Drupal) such as: ・ Documentation ・ Security advisories ▸ ▸ 29

Slide 30

Slide 30

Product page Product bundles managed in Drupal Indexed into the Data Lake using predefined schema â–¸ All data (including resource links) pulled from child product data in the Data Lake â–¸ â–¸ 30

Slide 31

Slide 31

Drupal backend â–¸ â–¸ 31 Standard UI using Claro Custom Product Bundles entity

Slide 32

Slide 32

Drupal backend â–¸ â–¸ 32 Standard UI using Claro Custom Product Bundles entity

Slide 33

Slide 33

Drupal backend â–¸ â–¸ 33 Standard UI using Claro Custom Product Bundles entity

Slide 34

Slide 34

Drupal backend â–¸ 34 Order of product bundles controlled by Drupal

Slide 35

Slide 35

After all that. Now you… Manage data in Drupal Index the data to the Data Lake Query the Data Lake with GraphQL Build the page via GitLab pipeline Set your refresh every 30 minutes Enjoy your newly automated life 35

Slide 36

Slide 36

36

Slide 37

Slide 37

Implementations developers.redhat.com 37 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 38

Slide 38

Implementations A Learning Path is a curated collection of content, directing users to learn more about a particular topic or product. 38

Slide 39

Slide 39

Implementations developers.redhat.com Article 39 Data lake: how Red Hat maintains data quality across multiple Drupal sites Cheat Sheet

Slide 40

Slide 40

Implementations developers.redhat.com Article in Learning Path 40 Data lake: how Red Hat maintains data quality across multiple Drupal sites Cheat Sheet in Learning Path

Slide 41

Slide 41

Implementations Learning Path content type 41 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 42

Slide 42

Implementations 42 Data lake: how Red Hat maintains data quality across multiple Drupal sites Resource content type

Slide 43

Slide 43

Implementations developers.redhat.com Article in Learning Path 43 Data lake: how Red Hat maintains data quality across multiple Drupal sites Cheat Sheet in Learning Path

Slide 44

Slide 44

Implementations developers.redhat.com Article in Learning Path 44 Data lake: how Red Hat maintains data quality across multiple Drupal sites Cheat Sheet in Learning Path

Slide 45

Slide 45

Implementations Resource displays hook_preprocess_node() 45 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 46

Slide 46

Implementations Shared module Learning Paths shared module 46 Data lake: how Red Hat maintains data quality across multiple Drupal sites

  • Data Lake Learning Paths schema - Data Lake query service - Reusable code: - Services - Controller - EventSubscriber - Blocks - Constraints/Validators

Slide 47

Slide 47

Implementations 47 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 48

Slide 48

Future Plans 48

Slide 49

Slide 49

Future Plans Customer Portal Standardizing Product taxonomy across Customer Portal microsites and beyond Integrating with other systems at Red Hat 49 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 50

Slide 50

Future Plans Content Syndicated Patterns “Shared patterns” with embedded content for banners, footers, marketing content, etc. 50 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 51

Slide 51

Future Plans Learning Paths Learning Path discovery 51 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 52

Slide 52

Wrap up 52

Slide 53

Slide 53

Wrap up Code Repository MongoDB Data Lake https://red.ht/dc2023 URL is shortened to make it easier to find the sandbox project. Full URL: https://www.drupal.org/sandbox/merauluka/3363610 53 Data lake: how Red Hat maintains data quality across multiple Drupal sites

Slide 54

Slide 54

Drupal Contrib Modules Resources Allow Only One https://www.drupal.org/project/allow_only_one Automatic Entity Label https://www.drupal.org/project/auto_entitylabel Entity Browser https://www.drupal.org/project/entity_browser 54

Slide 55

Slide 55

Drupal Contrib Modules Resources External Data Source https://www.drupal.org/project/external_data_source MongoDB https://www.drupal.org/project/mongodb Search API https://www.drupal.org/project/search_api 55

Slide 56

Slide 56

Thank you Questions? Melissa Bent April Sides Red Hat linkedin.com/in/melissabent linkedin.com/in/aprilsides linkedin.com/company/red-hat twitter.com/merauluka twitter.com/weekbeforenext youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat 56