Introduction into full-text search with distributed search engines

A presentation at unKonf in October 2019 in Mannheim, Germany by Alexander Reelsen

Slide 1

Full-text search with distributed search engines Alexander Reelsen @spinscale alex@elastic.co

Slide 2

Agenda • Overview • Indexing: Analysis, Tokenization, Filtering, on disk data structures • Searching: Scoring, Algorithms & Optimization • Aggregations • Distributed systems and search •Q&A

Slide 3

Overview Full text search introduction

Slide 4

SELECT * FROM products WHERE name LIKE = ‘%topf%’

Slide 5

grep “topf” my_dataset.txt

Slide 6

Problem • Scales linearly with the data set size • Relevancy • Spell correction • Synonyms • Phrases

Slide 7

Inverted Index

Slide 8

The quick brown fox jumped over the lazy dog

Slide 9

Tokenize The quick brown fox jumped over the lazy dog 1 1 1 1 1 1 1 1 1

Slide 10

Sort The brown dog fox jumped lazy over quick the 1 1 1 1 1 1 1 1 1

Slide 11

Quick brown foxes leap over lazy dogs in summer

Slide 12

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Slide 13

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy dog

Slide 14

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy AND dog [1,2] AND [1] = [1]

Slide 15

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy OR dog [1,2] OR [1] = [1,2]

Slide 16

Technologies used today • Apache Lucene (search library) • Elasticsearch (distributed search engine built on top of Apache Lucene)

Slide 17

Indexing Analysis, Tokenization, Filtering Data structures

Slide 18

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quick 1 hit

Slide 19

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quicK 0 hits

Slide 20

Analysis: Tokenizer & Token Filters

Slide 21

Tokenization

Slide 22

Tokenization quick brown fox

Slide 23

Tokenization quick_brown_fox

Slide 24

Tokenization quick_brown_fox the lazy, white dog.

Slide 25

Tokenization quick_brown_fox the_lazy_white_dog

Slide 26

Tokenization quick_brown_fox the_lazy_white_dog https://unicode.org/reports/tr29/

Slide 27

Tokenization quick_brown_fox the_lazy_white_dog https://www.jade-hs.de

Slide 28

Tokenization quick_brown_fox the_lazy_white_dog https_www.jade_hs.de

Slide 29

Token Filter

Slide 30

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Slide 31

Token filter The Quick brown fox

Slide 32

Token filter Lowercase The Quick brown fox the quick brown fox

Slide 33

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox

Slide 34

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox

Slide 35

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Tokens can be changed, added, removed

Slide 36

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Queries need to be processed as well!

Slide 37

More analysis strategies • Phonetic analysis: Meyer vs. Meier • Stemming: foxes ⇾ fox • Compounding: Blumentopf ⇾ blumen topf • Folding: Spaß ⇾ Spass

Slide 38

(On-Disk) Data structures

Slide 39

What else is in an inverted index? • Documents: Find documents • Term frequencies: Relevancy • Positions: Positional Queries • Offsets: Highlighting • Stored fields: The original data

Slide 40

Segment: Unit of work • A fully self sufficient inverted index • An index consists of a number of segments • New segments are created for newly added documents • Segments are immutable!

Slide 41

Read-only data structures • Pro: Write-once, sequentially • Pro: Lock-free reading • Pro: File system cache • Contra: in-place updates & deletes • Contra: Housekeeping • Contra: Transactions

Slide 42

Segment: Deletes • Mark a document as deleted in a special file • Exclude it from searches • No space is freed! 1|2|3|4|5 3 6|7|8

Slide 43

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8

Slide 44

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8 1|2|4|5|6|7|8

Slide 45

Searching Precision vs. recall Scoring Algorithms and optimizations

Slide 46

Relevancy

Slide 47

Relevancy • Textbook answer: How well matches a document a query? • Business answer: Are the top search results those that make me the most money? • marketplace • hotel booking website • newspaper website

Slide 48

Scoring

Slide 49

Scoring: lazy dog • Naive: increase a counter if a term is matched • “the lazy dog” => score 2 • “the lazy frog” => score 1 • “the lazy lazy lazy lazy cat” => score 4 or 1?

Slide 50

Scoring: More than term frequency • How about incorporating information about the whole document corpus in scoring? • Are lesser common terms more relevant? • news paper: “dieselgate news”

Slide 51

Scoring: TF-IDF • Term frequency: number of times a term occurs in a field • Inverse document frequency: inverse function of the number of documents in which it occurs

Slide 52

Scoring: Vector space model • Each term is a dimension • The length is based on tf-idf calculation • Similarity is the angle between vectors • Cosine similarity: best match == angle 0°

Slide 53

Scoring: TF-IDF in Lucene score(q,d) = ∑ ( tf(t in d) · 2 idf(t) · t.getBoost() · norm(t,d) )

Slide 54

Slide 55

BM25 • Default in Apache Lucene/Elasticsearch • Works better with stopwords (high TF) • Term frequency saturation • Improved field length normalization (per document)

Slide 56

BM25 https://www.elastic.co/guide/en/elasticsearch/guide/2.x/pluggable-similarites.html

Slide 57

Precision vs. recall

Slide 58

Precision and Recall

Slide 59

Precision and Recall relevant documents irerelevant documents

Slide 60

Precision and Recall relevant documents irerelevant documents

Slide 61

True positives relevant documents irerelevant documents

Slide 62

True negatives relevant documents irerelevant documents

Slide 63

False positives relevant documents irerelevant documents

Slide 64

False negatives relevant documents irerelevant documents

Slide 65

Precision and recall • Precision: How many selected documents are relevant? • Recall: How many relevant documents are selected

Slide 66

Under the hood

Slide 67

Optimizations everywhere • leap frogging, skip lists • top-k • two phase iterations • integer compression

Slide 68

Query two phase iteration

Slide 69

Two phase iteration: Phrase query • Phrase query: “quick fox” • Approximation phase: document contains terms quick and fox • Verification phase: read positions of terms

Slide 70

Two phase iteration: Geo distance query • Geo distance query: Distance from reference point • Approximation phase: bbox around point • Verification phase: exact distance calculation

Slide 71

Two phase iteration: Geo distance query GET /my_locations/_search { “query”: { “bool” : { “filter” : { “geo_distance” : { “distance” : “200km”, “pin.location” : { “lat” : 40, “lon” : -70 } } } } } }

Slide 72

Two phase iteration: several queries • Powerful when several queries are used • “quick fox” AND brown • Approximation: quick AND fox AND brown • Verification: “quick fox” position check for hits

Slide 73

Skip lists & leap frogging

Slide 74

Skip lists • Term dictionary is a sorted skip list • Skip list is a linked list with ‘express lanes’ to leap forward https://en.wikipedia.org/wiki/Skip_list

Slide 75

Leap frogging elasticsearch AND kibana AND logstash

Slide 76

Leap frogging elasticsearch AND kibana AND logstash 266 102 568 98 302 266 60 102 199 18 59 150 5 5 102 1 3 5