Introduction into full-text search with distributed search engines

A presentation at unKonf in October 2019 in Mannheim, Germany by Alexander Reelsen

Slide 1

Slide 1

Full-text search with distributed search engines Alexander Reelsen @spinscale alex@elastic.co

Slide 2

Slide 2

Agenda • Overview • Indexing: Analysis, Tokenization, Filtering, on disk data structures • Searching: Scoring, Algorithms & Optimization • Aggregations • Distributed systems and search •Q&A

Slide 3

Slide 3

Overview Full text search introduction

Slide 4

Slide 4

SELECT * FROM products WHERE name LIKE = ‘%topf%’

Slide 5

Slide 5

grep “topf” my_dataset.txt

Slide 6

Slide 6

Problem • Scales linearly with the data set size • Relevancy • Spell correction • Synonyms • Phrases

Slide 7

Slide 7

Inverted Index

Slide 8

Slide 8

The quick brown fox jumped over the lazy dog

Slide 9

Slide 9

Tokenize The quick brown fox jumped over the lazy dog 1 1 1 1 1 1 1 1 1

Slide 10

Slide 10

Sort The brown dog fox jumped lazy over quick the 1 1 1 1 1 1 1 1 1

Slide 11

Slide 11

Quick brown foxes leap over lazy dogs in summer

Slide 12

Slide 12

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Slide 13

Slide 13

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy dog

Slide 14

Slide 14

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy AND dog [1,2] AND [1] = [1]

Slide 15

Slide 15

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy OR dog [1,2] OR [1] = [1,2]

Slide 16

Slide 16

Technologies used today • Apache Lucene (search library) • Elasticsearch (distributed search engine built on top of Apache Lucene)

Slide 17

Slide 17

Indexing Analysis, Tokenization, Filtering Data structures

Slide 18

Slide 18

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quick 1 hit

Slide 19

Slide 19

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quicK 0 hits

Slide 20

Slide 20

Analysis: Tokenizer & Token Filters

Slide 21

Slide 21

Tokenization

Slide 22

Slide 22

Tokenization quick brown fox

Slide 23

Slide 23

Tokenization quick_brown_fox

Slide 24

Slide 24

Tokenization quick_brown_fox the lazy, white dog.

Slide 25

Slide 25

Tokenization quick_brown_fox the_lazy_white_dog

Slide 26

Slide 26

Tokenization quick_brown_fox the_lazy_white_dog https://unicode.org/reports/tr29/

Slide 27

Slide 27

Tokenization quick_brown_fox the_lazy_white_dog https://www.jade-hs.de

Slide 28

Slide 28

Tokenization quick_brown_fox the_lazy_white_dog https_www.jade_hs.de

Slide 29

Slide 29

Token Filter

Slide 30

Slide 30

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Slide 31

Slide 31

Token filter The Quick brown fox

Slide 32

Slide 32

Token filter Lowercase The Quick brown fox the quick brown fox

Slide 33

Slide 33

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox

Slide 34

Slide 34

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox

Slide 35

Slide 35

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Tokens can be changed, added, removed

Slide 36

Slide 36

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Queries need to be processed as well!

Slide 37

Slide 37

More analysis strategies • Phonetic analysis: Meyer vs. Meier • Stemming: foxes ⇾ fox • Compounding: Blumentopf ⇾ blumen topf • Folding: Spaß ⇾ Spass

Slide 38

Slide 38

(On-Disk) Data structures

Slide 39

Slide 39

What else is in an inverted index? • Documents: Find documents • Term frequencies: Relevancy • Positions: Positional Queries • Offsets: Highlighting • Stored fields: The original data

Slide 40

Slide 40

Segment: Unit of work • A fully self sufficient inverted index • An index consists of a number of segments • New segments are created for newly added documents • Segments are immutable!

Slide 41

Slide 41

Read-only data structures • Pro: Write-once, sequentially • Pro: Lock-free reading • Pro: File system cache • Contra: in-place updates & deletes • Contra: Housekeeping • Contra: Transactions

Slide 42

Slide 42

Segment: Deletes • Mark a document as deleted in a special file • Exclude it from searches • No space is freed! 1|2|3|4|5 3 6|7|8

Slide 43

Slide 43

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8

Slide 44

Slide 44

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8 1|2|4|5|6|7|8

Slide 45

Slide 45

Searching Precision vs. recall Scoring Algorithms and optimizations

Slide 46

Slide 46

Relevancy

Slide 47

Slide 47

Relevancy • Textbook answer: How well matches a document a query? • Business answer: Are the top search results those that make me the most money? • marketplace • hotel booking website • newspaper website

Slide 48

Slide 48

Scoring

Slide 49

Slide 49

Scoring: lazy dog • Naive: increase a counter if a term is matched • “the lazy dog” => score 2 • “the lazy frog” => score 1 • “the lazy lazy lazy lazy cat” => score 4 or 1?

Slide 50

Slide 50

Scoring: More than term frequency • How about incorporating information about the whole document corpus in scoring? • Are lesser common terms more relevant? • news paper: “dieselgate news”

Slide 51

Slide 51

Scoring: TF-IDF • Term frequency: number of times a term occurs in a field • Inverse document frequency: inverse function of the number of documents in which it occurs

Slide 52

Slide 52

Scoring: Vector space model • Each term is a dimension • The length is based on tf-idf calculation • Similarity is the angle between vectors • Cosine similarity: best match == angle 0°

Slide 53

Slide 53

Scoring: TF-IDF in Lucene score(q,d) = ∑ ( tf(t in d) · 2 idf(t) · t.getBoost() · norm(t,d) )

Slide 54

Slide 54

Slide 55

Slide 55

BM25 • Default in Apache Lucene/Elasticsearch • Works better with stopwords (high TF) • Term frequency saturation • Improved field length normalization (per document)

Slide 56

Slide 56

BM25 https://www.elastic.co/guide/en/elasticsearch/guide/2.x/pluggable-similarites.html

Slide 57

Slide 57

Precision vs. recall

Slide 58

Slide 58

Precision and Recall

Slide 59

Slide 59

Precision and Recall relevant documents irerelevant documents

Slide 60

Slide 60

Precision and Recall relevant documents irerelevant documents

Slide 61

Slide 61

True positives relevant documents irerelevant documents

Slide 62

Slide 62

True negatives relevant documents irerelevant documents

Slide 63

Slide 63

False positives relevant documents irerelevant documents

Slide 64

Slide 64

False negatives relevant documents irerelevant documents

Slide 65

Slide 65

Precision and recall • Precision: How many selected documents are relevant? • Recall: How many relevant documents are selected

Slide 66

Slide 66

Under the hood

Slide 67

Slide 67

Optimizations everywhere • leap frogging, skip lists • top-k • two phase iterations • integer compression

Slide 68

Slide 68

Query two phase iteration

Slide 69

Slide 69

Two phase iteration: Phrase query • Phrase query: “quick fox” • Approximation phase: document contains terms quick and fox • Verification phase: read positions of terms

Slide 70

Slide 70

Two phase iteration: Geo distance query • Geo distance query: Distance from reference point • Approximation phase: bbox around point • Verification phase: exact distance calculation

Slide 71

Slide 71

Two phase iteration: Geo distance query GET /my_locations/_search { “query”: { “bool” : { “filter” : { “geo_distance” : { “distance” : “200km”, “pin.location” : { “lat” : 40, “lon” : -70 } } } } } }

Slide 72

Slide 72

Two phase iteration: several queries • Powerful when several queries are used • “quick fox” AND brown • Approximation: quick AND fox AND brown • Verification: “quick fox” position check for hits

Slide 73

Slide 73

Skip lists & leap frogging

Slide 74

Slide 74

Skip lists • Term dictionary is a sorted skip list • Skip list is a linked list with ‘express lanes’ to leap forward https://en.wikipedia.org/wiki/Skip_list

Slide 75

Slide 75

Leap frogging elasticsearch AND kibana AND logstash

Slide 76

Slide 76

Leap frogging elasticsearch AND kibana AND logstash 266 102 568 98 302 266 60 102 199 18 59 150 5 5 102 1 3 5

Slide 77

Slide 77

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 78

Slide 78

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 79

Slide 79

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 80

Slide 80

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 81

Slide 81

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 82

Slide 82

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 83

Slide 83

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Slide 84

Slide 84

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Slide 85

Slide 85

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Slide 86

Slide 86

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Slide 87

Slide 87

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 88

Slide 88

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 89

Slide 89

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 90

Slide 90

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 91

Slide 91

Leap frogging logstash AND kibana AND elasticsearch 266 Done! 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 92

Slide 92

Top-k retrieval

Slide 93

Slide 93

Top-k retrieval • elasticsearch OR kibana • top 10 results wanted • maximum score for kibana is 3.0 • maximum score for elasticsearch is 5.0 • collecting documents: when 10th hit has score > 3, then only documents with elasticsearch need to be collected • total hit count is not accurate

Slide 94

Slide 94

Top-k retrieval

Slide 95

Slide 95

Index sorting

Slide 96

Slide 96

Order index by field values • each segment is sorted before write • criteria can be chosen by the user 5|2|3|1|4 retrieve sort top 2 5|2|3|1|4 5|4|3|2|1 5|4

Slide 97

Slide 97

Order index by field values • each segment is sorted before write • criteria can be chosen by the user early termination 5|4|3|2|1 5|4

Slide 98

Slide 98

Aggregations Reducing data

Slide 99

Slide 99

Aggregations

Slide 100

Slide 100

aggregations documents

Slide 101

Slide 101

aggregations documents

Slide 102

Slide 102

Bucketing documents Pumps Sneakers Oxfords Sneakers Boots

Slide 103

Slide 103

Bucketing documents Sneakers Pumps Boots Oxfords

Slide 104

Slide 104

Bucketing documents Sneakers Pumps 2 bucket agg 1 metric agg doc_count Boots Oxfords 1 1

Slide 105

Slide 105

Bucketing documents Sneakers Pumps 50 bucket agg 95 metric agg avg price Boots Oxfords 90 23

Slide 106

Slide 106

Aggregations • bucket: terms, histogram, geo, range, sampler , significant text, nested • metric: value_count, avg, min, max, sum, stats, median deviation, geo, percentile, cardinality, • pipeline: min, max, sum, avg, derivative, stats, percentiles, cumulative sum, moving average, moving function, serial differencing

Slide 107

Slide 107

Distributed systems & search Fanning out a search, reducing the results

Slide 108

Slide 108

Elasticsearch

Slide 109

Slide 109

Slide 110

Slide 110

Elasticsearch in 10 seconds • Search Engine (FTS, Analytics, Geo), near real-time • Distributed, scalable, highly available, resilient • Interface: HTTP & JSON • Centrepiece of the Elastic Stack • Uneducated conservative guess: Tens of thousands of clusters worldwide, hundreds of thousands of instances

Slide 111

Slide 111

Distributed systems

Slide 112

Slide 112

Distributed systems • How do nodes communicate with each other? • Who is taking and executing decisions? • Failure detection? • Replication strategy? • Consistency? • Enter consensus algorithms…

Slide 113

Slide 113

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires processes to agree on some data value that is needed during computation https://en.wikipedia.org/wiki/Consensus_(computer_science) 113

Slide 114

Slide 114

Consensus algorithms • Leader based: Paxos, Raft • Non leader based: BTC, gossip

Slide 115

Slide 115

Consensus in Elasticsearch • Custom consensus algorithm, improving the existing one • Formally verified • Optimized for Elasticsearch use-case (rolling restarts, growing/shrinking clusters, log-ofoperations vs. cluster state)

Slide 116

Slide 116

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Slide 117

Slide 117

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Slide 118

Slide 118

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Slide 119

Slide 119

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Slide 120

Slide 120

Master node tasks • Deciding where data should be stored • Pinging other nodes • Reacting on node leaves/joins • Updating cluster state • Distributing cluster state

Slide 121

Slide 121

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs1 node 4 cs1

Slide 122

Slide 122

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs1 node 4 cs1

Slide 123

Slide 123

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs2 node 4 cs1

Slide 124

Slide 124

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs2 node 4 cs1

Slide 125

Slide 125

Consensus in Elasticsearch node 1 cs1 node 2 cs2 node 3 cs2 node 4 cs2

Slide 126

Slide 126

Distributed search

Slide 127

Slide 127

Distributed search in Elasticsearch node 1 node 2 node 3 p0 Primary Shard (Lucene Index) node 4

Slide 128

Slide 128

Distributed search in Elasticsearch node 1 p0 node 2 r0 Replica shard (copy) node 3 node 4

Slide 129

Slide 129

Distributed search in Elasticsearch node 1 p0 node 2 r0 node 3 node 4 p1 r1 2nd Primary Shard Replica shard (copy)

Slide 130

Slide 130

Distributed search in Elasticsearch • Shard: Lucene index, unit of scale • Primary shard: Write scalability • Replica shard: Read scalability, availability

Slide 131

Slide 131

Distributed search in Elasticsearch node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 132

Slide 132

Distributed search in Elasticsearch 2. Execute query against shards node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 133

Slide 133

Distributed search in Elasticsearch 3. top-k search results are returned to coordinating node node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 134

Slide 134

Distributed search in Elasticsearch 4. Create real top-k result list node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 135

Slide 135

Distributed search in Elasticsearch 5. Fetch original documents node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 136

Slide 136

Distributed search in Elasticsearch 5. Fetch original documents node 1 p0 node 2 r0 node 3 node 4 p1 r1 6. Return data to the client

Slide 137

Slide 137

Aggregations

Slide 138

Slide 138

Aggregations - cardinality node 1 p0 node 2 p1 POST /sales/_search?size=0 { “aggs” : { “type_count” : { “cardinality” : { “field” : “type” } } } }

Slide 139

Slide 139

Aggregations - cardinality node 1 p0 25 node 2 p1 40 How many distinct elements are in my index? What is the total? 40? 65? Naive solution: merge data to single dataset and count. Doesn’t scale! Solution: Use HyperLogLog++

Slide 140

Slide 140

HyperLogLog++ • Hash based counting • Trades in memory for accuracy • Fixed memory usage, based on configurable precision • Result: Small mergeable data structure, can easily be sent over the network

Slide 141

Slide 141

Aggregations - percentile node 1 p0 node 2 p1 GET latency/_search { “size”: 0, “aggs” : { “load_time_outlier” : { “percentiles” : { “field” : “load_time” } } } }

Slide 142

Slide 142

T-Digest • Extreme percentiles are more accurate than the Median • Percentiles are divided into buckets • When buckets grow over a boundary, approximation kicks in, saving memory in the process • The exact level of inaccuracy is difficult to generalize • Alternative: HDR histograms

Slide 143

Slide 143

Probabilistic data structures • bloom/cuckoo/quotient filters (membership check) • HyperLogLog++ (cardinality) • T-Digest, DDSketch, HDR histogram (percentile) • Count-Min sketch (frequency, top-k) • Hashing (similarity)

Slide 144

Slide 144

Demo Try it out yourself! https://ela.st/jade-hochschule-samples

Slide 145

Slide 145

Elastic Cloud Free 30 day trial https://ela.st/university-wilhelmshaven

Slide 146

Slide 146

Upcoming trends & summary … or why you should take a closer look at search

Slide 147

Slide 147

Search is not just google… • “Just google it” does not cut it • Enterprise search: Intranet/G-Drive/Dropbox • Ecommerce search • SIEM • Observability: Logging, APM & Metrics

Slide 148

Slide 148

Search is not ‘done’ • Constant improvement • Data structures & algorithms (BKD tree for geo shapes) • Academic research moves to industry thanks to Apache Lucene

Slide 149

Slide 149

Search is still tough • Language specific analysis • Smart query parsing (nike red hoodie xl) • Geo based search • Anomaly detection • Incoporating feedback loops

Slide 150

Slide 150

Upcoming trends • Learning-to-Rank • Deep Learning • Feedback loop

Slide 151

Slide 151

Summary • Everything is a search problem! • Search is hard… and interesting • Distributed systems are hard… and interesting • Domain knowledge required • Data keeps exploding, good job chances!

Slide 152

Slide 152

Literature Books, books, books

Slide 153

Slide 153

Slide 154

Slide 154

Resources Links, links, links

Slide 155

Slide 155

Links https://lucene.apache.org/core/8_2_0/core/org/apache/lucene/search/similarities/ TFIDFSimilarity.html https://www.elastic.co/blog/whats-new-in-lucene-8 https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand https://speakerdeck.com/elastic/amusing-algorithms-and-data-structures https://www.elastic.co/blog/index-sorting-elasticsearch-6-0 https://raft.github.io/ https://github.com/elastic/elasticsearch-formal-models https://gist.github.com/spinscale/b62c8b357fae7db3f14b7d3127758951

Slide 156

Slide 156

Links - probabilistic data structures https://github.com/addthis/stream-lib https://github.com/DataDog/sketches-java https://github.com/HdrHistogram/HdrHistogram https://github.com/JohnStarich/java-skip-list https://github.com/addthis/stream-lib https://static.googleusercontent.com/media/research.google.com/fr/ pubs/archive/40671.pdf

Slide 157

Slide 157

Q&A Alexander Reelsen alex@elastic.co @spinscale