Introduction into Full Text Search

A presentation at HYF Copenhagen in July 2020 in by Alexander Reelsen

Slide 1

Slide 1

Full-text search with distributed search engines Alexander Reelsen @spinscale alex@elastic.co

Slide 2

Slide 2

Agenda • Overview • Indexing: Analysis, Tokenization, Filtering, on disk data structures • Searching: Scoring, Algorithms & Optimization • Aggregations • Distributed systems and search •Q&A

Slide 3

Slide 3

Overview Full text search introduction

Slide 4

Slide 4

Why is search so important?

Slide 5

Slide 5

Slide 6

Slide 6

Slide 7

Slide 7

Slide 8

Slide 8

Slide 9

Slide 9

Slide 10

Slide 10

Slide 11

Slide 11

Slide 12

Slide 12

Slide 13

Slide 13

SELECT * FROM products WHERE name LIKE = ‘%topf%’

Slide 14

Slide 14

grep “topf” my_dataset.txt

Slide 15

Slide 15

Problem • Scales linearly with the data set size • Relevancy • Spell correction • Synonyms • Phrases

Slide 16

Slide 16

Inverted Index

Slide 17

Slide 17

The quick brown fox jumped over the lazy dog

Slide 18

Slide 18

Tokenize The quick brown fox jumped over the lazy dog 1 1 1 1 1 1 1 1 1

Slide 19

Slide 19

Sort The brown dog fox jumped lazy over quick the 1 1 1 1 1 1 1 1 1

Slide 20

Slide 20

Quick brown foxes leap over lazy dogs in summer

Slide 21

Slide 21

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Slide 22

Slide 22

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy dog

Slide 23

Slide 23

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy AND dog [1,2] AND [1] = [1]

Slide 24

Slide 24

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy OR dog [1,2] OR [1] = [1,2]

Slide 25

Slide 25

Technologies used today • Apache Lucene (search library) • Elasticsearch (distributed search engine built on top of Apache Lucene)

Slide 26

Slide 26

Indexing Analysis, Tokenization, Filtering Data structures

Slide 27

Slide 27

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quick 1 hit

Slide 28

Slide 28

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quicK 0 hits

Slide 29

Slide 29

Analysis: Tokenizer & Token Filters

Slide 30

Slide 30

Tokenization

Slide 31

Slide 31

Tokenization quick brown fox

Slide 32

Slide 32

Tokenization quick_brown_fox

Slide 33

Slide 33

Tokenization quick_brown_fox the lazy, white dog.

Slide 34

Slide 34

Tokenization quick_brown_fox the_lazy_white_dog

Slide 35

Slide 35

Tokenization quick_brown_fox the_lazy_white_dog https://unicode.org/reports/tr29/

Slide 36

Slide 36

Tokenization quick_brown_fox the_lazy_white_dog https://www.jade-hs.de

Slide 37

Slide 37

Tokenization quick_brown_fox the_lazy_white_dog https_www.jade_hs.de

Slide 38

Slide 38

Token Filter

Slide 39

Slide 39

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Slide 40

Slide 40

Token filter The Quick brown fox

Slide 41

Slide 41

Token filter Lowercase The Quick brown fox the quick brown fox

Slide 42

Slide 42

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox

Slide 43

Slide 43

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox

Slide 44

Slide 44

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Tokens can be changed, added, removed

Slide 45

Slide 45

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Queries need to be processed as well!

Slide 46

Slide 46

More analysis strategies • Phonetic analysis: Meyer vs. Meier • Stemming: foxes ⇾ fox • Compounding: Blumentopf ⇾ blumen topf • Folding: Spaß ⇾ Spass

Slide 47

Slide 47

(On-Disk) Data structures

Slide 48

Slide 48

What else is in an inverted index? • Documents: Find documents • Term frequencies: Relevancy • Positions: Positional Queries • Offsets: Highlighting • Stored fields: The original data

Slide 49

Slide 49

Segment: Unit of work • A fully self sufficient inverted index • An index consists of a number of segments • New segments are created for newly added documents • Segments are immutable!

Slide 50

Slide 50

Read-only data structures • Pro: Write-once, sequentially • Pro: Lock-free reading • Pro: File system cache • Contra: in-place updates & deletes • Contra: Housekeeping • Contra: Transactions

Slide 51

Slide 51

Segment: Deletes • Mark a document as deleted in a special file • Exclude it from searches • No space is freed! 1|2|3|4|5 3 6|7|8

Slide 52

Slide 52

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8

Slide 53

Slide 53

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8 1|2|4|5|6|7|8

Slide 54

Slide 54

Searching Precision vs. recall Scoring Algorithms and optimizations

Slide 55

Slide 55

Relevancy

Slide 56

Slide 56

Relevancy • Textbook answer: How well matches a document a query? • Business answer: Are the top search results those that make me the most money? • marketplace • hotel booking website • newspaper website

Slide 57

Slide 57

Scoring

Slide 58

Slide 58

Scoring: lazy dog • Naive: increase a counter if a term is matched • “the lazy dog” => score 2 • “the lazy frog” => score 1 • “the lazy lazy lazy lazy cat” => score 4 or 1?

Slide 59

Slide 59

Scoring: More than term frequency • How about incorporating information about the whole document corpus in scoring? • Are lesser common terms more relevant? • news paper: “dieselgate news”

Slide 60

Slide 60

Scoring: TF-IDF • Term frequency: number of times a term occurs in a field • Inverse document frequency: inverse function of the number of documents in which it occurs

Slide 61

Slide 61

Scoring: Vector space model • Each term is a dimension • The length is based on tf-idf calculation • Similarity is the angle between vectors • Cosine similarity: best match == angle 0°

Slide 62

Slide 62

Scoring: TF-IDF in Lucene score(q,d) = ∑ ( tf(t in d) · 2 idf(t) · t.getBoost() · norm(t,d) )

Slide 63

Slide 63

Slide 64

Slide 64

BM25 • Default in Apache Lucene/Elasticsearch • Works better with stopwords (high TF) • Term frequency saturation • Improved field length normalization (per document)

Slide 65

Slide 65

BM25 https://www.elastic.co/guide/en/elasticsearch/guide/2.x/pluggable-similarites.html

Slide 66

Slide 66

Precision vs. recall

Slide 67

Slide 67

Precision and Recall

Slide 68

Slide 68

Precision and Recall relevant documents irerelevant documents

Slide 69

Slide 69

Precision and Recall relevant documents irerelevant documents

Slide 70

Slide 70

True positives relevant documents irerelevant documents

Slide 71

Slide 71

True negatives relevant documents irerelevant documents

Slide 72

Slide 72

False positives relevant documents irerelevant documents

Slide 73

Slide 73

False negatives relevant documents irerelevant documents

Slide 74

Slide 74

Precision and recall • Precision: How many selected documents are relevant? • Recall: How many relevant documents are selected

Slide 75

Slide 75

Under the hood

Slide 76

Slide 76

Optimizations everywhere • leap frogging, skip lists • top-k • two phase iterations • integer compression

Slide 77

Slide 77

Query two phase iteration

Slide 78

Slide 78

Two phase iteration: Phrase query • Phrase query: “quick fox” • Approximation phase: document contains terms quick and fox • Verification phase: read positions of terms

Slide 79

Slide 79

Two phase iteration: Geo distance query • Geo distance query: Distance from reference point • Approximation phase: bbox around point • Verification phase: exact distance calculation

Slide 80

Slide 80

Two phase iteration: Geo distance query GET /my_locations/_search { “query”: { “bool” : { “filter” : { “geo_distance” : { “distance” : “200km”, “pin.location” : { “lat” : 40, “lon” : -70 } } } } } }

Slide 81

Slide 81

Two phase iteration: several queries • Powerful when several queries are used • “quick fox” AND brown • Approximation: quick AND fox AND brown • Verification: “quick fox” position check for hits

Slide 82

Slide 82

Skip lists & leap frogging

Slide 83

Slide 83

Skip lists • Term dictionary is a sorted skip list • Skip list is a linked list with ‘express lanes’ to leap forward https://en.wikipedia.org/wiki/Skip_list

Slide 84

Slide 84

Leap frogging elasticsearch AND kibana AND logstash

Slide 85

Slide 85

Leap frogging elasticsearch AND kibana AND logstash 266 102 568 98 302 266 60 102 199 18 59 150 5 5 102 1 3 5

Slide 86

Slide 86

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 87

Slide 87

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 88

Slide 88

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 89

Slide 89

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 90

Slide 90

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 91

Slide 91

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Slide 92

Slide 92

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Slide 93

Slide 93

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Slide 94

Slide 94

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Slide 95

Slide 95

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Slide 96

Slide 96

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 97

Slide 97

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 98

Slide 98

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 99

Slide 99

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 100

Slide 100

Leap frogging logstash AND kibana AND elasticsearch 266 Done! 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Slide 101

Slide 101

Top-k retrieval

Slide 102

Slide 102

Top-k retrieval • elasticsearch OR kibana • top 10 results wanted • maximum score for kibana is 3.0 • maximum score for elasticsearch is 5.0 • collecting documents: when 10th hit has score > 3, then only documents with elasticsearch need to be collected • total hit count is not accurate

Slide 103

Slide 103

Top-k retrieval

Slide 104

Slide 104

Index sorting

Slide 105

Slide 105

Order index by field values • each segment is sorted before write • criteria can be chosen by the user 5|2|3|1|4 retrieve sort top 2 5|2|3|1|4 5|4|3|2|1 5|4

Slide 106

Slide 106

Order index by field values • each segment is sorted before write • criteria can be chosen by the user early termination 5|4|3|2|1 5|4

Slide 107

Slide 107

Aggregations Reducing data

Slide 108

Slide 108

Aggregations

Slide 109

Slide 109

aggregations documents

Slide 110

Slide 110

aggregations documents

Slide 111

Slide 111

Bucketing documents Pumps Sneakers Oxfords Sneakers Boots

Slide 112

Slide 112

Bucketing documents Sneakers Pumps Boots Oxfords

Slide 113

Slide 113

Bucketing documents Sneakers Pumps 2 bucket agg 1 metric agg doc_count Boots Oxfords 1 1

Slide 114

Slide 114

Bucketing documents Sneakers Pumps 50 bucket agg 95 metric agg avg price Boots Oxfords 90 23

Slide 115

Slide 115

Aggregations • bucket: terms, histogram, geo, range, sampler , significant text, nested • metric: value_count, avg, min, max, sum, stats, median deviation, geo, percentile, cardinality, • pipeline: min, max, sum, avg, derivative, stats, percentiles, cumulative sum, moving average, moving function, serial differencing

Slide 116

Slide 116

Distributed systems & search Fanning out a search, reducing the results

Slide 117

Slide 117

Elasticsearch

Slide 118

Slide 118

Slide 119

Slide 119

Elasticsearch in 10 seconds • Search Engine (FTS, Analytics, Geo), near real-time • Distributed, scalable, highly available, resilient • Interface: HTTP & JSON • Centrepiece of the Elastic Stack • Uneducated conservative guess: Tens of thousands of clusters worldwide, hundreds of thousands of instances

Slide 120

Slide 120

Distributed systems

Slide 121

Slide 121

Distributed systems • How do nodes communicate with each other? • Who is taking and executing decisions? • Failure detection? • Replication strategy? • Consistency? • Enter consensus algorithms…

Slide 122

Slide 122

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires processes to agree on some data value that is needed during computation https://en.wikipedia.org/wiki/Consensus_(computer_science) 122

Slide 123

Slide 123

Consensus algorithms • Leader based: Paxos, Raft • Non leader based: BTC, gossip

Slide 124

Slide 124

Consensus in Elasticsearch • Custom consensus algorithm, improving the existing one • Formally verified • Optimized for Elasticsearch use-case (rolling restarts, growing/shrinking clusters, log-ofoperations vs. cluster state)

Slide 125

Slide 125

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Slide 126

Slide 126

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Slide 127

Slide 127

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Slide 128

Slide 128

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Slide 129

Slide 129

Master node tasks • Deciding where data should be stored • Pinging other nodes • Reacting on node leaves/joins • Updating cluster state • Distributing cluster state

Slide 130

Slide 130

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs1 node 4 cs1

Slide 131

Slide 131

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs1 node 4 cs1

Slide 132

Slide 132

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs2 node 4 cs1

Slide 133

Slide 133

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs2 node 4 cs1

Slide 134

Slide 134

Consensus in Elasticsearch node 1 cs1 node 2 cs2 node 3 cs2 node 4 cs2

Slide 135

Slide 135

Distributed search

Slide 136

Slide 136

Distributed search in Elasticsearch node 1 node 2 node 3 p0 Primary Shard (Lucene Index) node 4

Slide 137

Slide 137

Distributed search in Elasticsearch node 1 p0 node 2 r0 Replica shard (copy) node 3 node 4

Slide 138

Slide 138

Distributed search in Elasticsearch node 1 p0 node 2 r0 node 3 node 4 p1 r1 2nd Primary Shard Replica shard (copy)

Slide 139

Slide 139

Distributed search in Elasticsearch • Shard: Lucene index, unit of scale • Primary shard: Write scalability • Replica shard: Read scalability, availability

Slide 140

Slide 140

Distributed search in Elasticsearch node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 141

Slide 141

Distributed search in Elasticsearch 2. Execute query against shards node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 142

Slide 142

Distributed search in Elasticsearch 3. top-k search results are returned to coordinating node node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 143

Slide 143

Distributed search in Elasticsearch 4. Create real top-k result list node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 144

Slide 144

Distributed search in Elasticsearch 5. Fetch original documents node 1 p0 node 2 r0 node 3 node 4 p1 r1

  1. Client connects any node with search request

Slide 145

Slide 145

Distributed search in Elasticsearch 5. Fetch original documents node 1 p0 node 2 r0 node 3 node 4 p1 r1 6. Return data to the client

Slide 146

Slide 146

Aggregations

Slide 147

Slide 147

Aggregations - cardinality node 1 p0 node 2 p1 POST /sales/_search?size=0 { “aggs” : { “type_count” : { “cardinality” : { “field” : “type” } } } }

Slide 148

Slide 148

Aggregations - cardinality node 1 p0 25 node 2 p1 40 How many distinct elements are in my index? What is the total? 40? 65? Naive solution: merge data to single dataset and count. Doesn’t scale! Solution: Use HyperLogLog++

Slide 149

Slide 149

HyperLogLog++ • Hash based counting • Trades in memory for accuracy • Fixed memory usage, based on configurable precision • Result: Small mergeable data structure, can easily be sent over the network

Slide 150

Slide 150

Aggregations - percentile node 1 p0 node 2 p1 GET latency/_search { “size”: 0, “aggs” : { “load_time_outlier” : { “percentiles” : { “field” : “load_time” } } } }

Slide 151

Slide 151

T-Digest • Extreme percentiles are more accurate than the Median • Percentiles are divided into buckets • When buckets grow over a boundary, approximation kicks in, saving memory in the process • The exact level of inaccuracy is difficult to generalize • Alternative: HDR histograms

Slide 152

Slide 152

Probabilistic data structures • bloom/cuckoo/quotient filters (membership check) • HyperLogLog++ (cardinality) • T-Digest, DDSketch, HDR histogram (percentile) • Count-Min sketch (frequency, top-k) • Hashing (similarity)

Slide 153

Slide 153

Demo Try it out yourself! https://ela.st/jade-hochschule-samples

Slide 154

Slide 154

Elastic Cloud Free 30 day trial https://ela.st/hack-your-future-2020

Slide 155

Slide 155

Upcoming trends & summary … or why you should take a closer look at search

Slide 156

Slide 156

Search is not just google… • “Just google it” does not cut it • Enterprise search: Intranet/G-Drive/Dropbox • Ecommerce search • SIEM • Observability: Logging, APM & Metrics

Slide 157

Slide 157

Search is not ‘done’ • Constant improvement • Data structures & algorithms (BKD tree for geo shapes) • Academic research moves to industry thanks to Apache Lucene

Slide 158

Slide 158

Search is still tough • Language specific analysis • Smart query parsing (nike red hoodie xl) • Geo based search • Anomaly detection • Incoporating feedback loops

Slide 159

Slide 159

Upcoming trends • Learning-to-Rank • Deep Learning • Feedback loop

Slide 160

Slide 160

Summary • Everything is a search problem! • Search is hard… and interesting • Distributed systems are hard… and interesting • Domain knowledge required • Data keeps exploding, good job chances!

Slide 161

Slide 161

Literature Books, books, books

Slide 162

Slide 162

Slide 163

Slide 163

Resources Links, links, links

Slide 164

Slide 164

Links https://lucene.apache.org/core/8_2_0/core/org/apache/lucene/search/similarities/ TFIDFSimilarity.html https://www.elastic.co/blog/whats-new-in-lucene-8 https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand https://speakerdeck.com/elastic/amusing-algorithms-and-data-structures https://www.elastic.co/blog/index-sorting-elasticsearch-6-0 https://raft.github.io/ https://github.com/elastic/elasticsearch-formal-models https://gist.github.com/spinscale/b62c8b357fae7db3f14b7d3127758951

Slide 165

Slide 165

Links - probabilistic data structures https://github.com/addthis/stream-lib https://github.com/DataDog/sketches-java https://github.com/HdrHistogram/HdrHistogram https://github.com/JohnStarich/java-skip-list https://github.com/addthis/stream-lib https://static.googleusercontent.com/media/research.google.com/fr/ pubs/archive/40671.pdf

Slide 166

Slide 166

Q&A Alexander Reelsen alex@elastic.co @spinscale