Introduction into Full Text Search

A presentation at HYF Copenhagen in July 2020 in by Alexander Reelsen

Slide 1

Full-text search with distributed search engines Alexander Reelsen @spinscale alex@elastic.co

Slide 2

Agenda • Overview • Indexing: Analysis, Tokenization, Filtering, on disk data structures • Searching: Scoring, Algorithms & Optimization • Aggregations • Distributed systems and search •Q&A

Slide 3

Overview Full text search introduction

Slide 4

Why is search so important?

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

SELECT * FROM products WHERE name LIKE = ‘%topf%’

Slide 14

grep “topf” my_dataset.txt

Slide 15

Problem • Scales linearly with the data set size • Relevancy • Spell correction • Synonyms • Phrases

Slide 16

Inverted Index

Slide 17

The quick brown fox jumped over the lazy dog

Slide 18

Tokenize The quick brown fox jumped over the lazy dog 1 1 1 1 1 1 1 1 1

Slide 19

Sort The brown dog fox jumped lazy over quick the 1 1 1 1 1 1 1 1 1

Slide 20

Quick brown foxes leap over lazy dogs in summer

Slide 21

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Slide 22

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy dog

Slide 23

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy AND dog [1,2] AND [1] = [1]

Slide 24

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy OR dog [1,2] OR [1] = [1,2]

Slide 25

Technologies used today • Apache Lucene (search library) • Elasticsearch (distributed search engine built on top of Apache Lucene)

Slide 26

Indexing Analysis, Tokenization, Filtering Data structures

Slide 27

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quick 1 hit

Slide 28

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quicK 0 hits

Slide 29

Analysis: Tokenizer & Token Filters

Slide 30

Tokenization

Slide 31

Tokenization quick brown fox

Slide 32

Tokenization quick_brown_fox

Slide 33

Tokenization quick_brown_fox the lazy, white dog.

Slide 34

Tokenization quick_brown_fox the_lazy_white_dog

Slide 35

Tokenization quick_brown_fox the_lazy_white_dog https://unicode.org/reports/tr29/

Slide 36

Tokenization quick_brown_fox the_lazy_white_dog https://www.jade-hs.de

Slide 37

Tokenization quick_brown_fox the_lazy_white_dog https_www.jade_hs.de

Slide 38

Token Filter

Slide 39

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Slide 40

Token filter The Quick brown fox

Slide 41

Token filter Lowercase The Quick brown fox the quick brown fox

Slide 42

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox

Slide 43

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox

Slide 44

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Tokens can be changed, added, removed

Slide 45

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Queries need to be processed as well!

Slide 46

More analysis strategies • Phonetic analysis: Meyer vs. Meier • Stemming: foxes ⇾ fox • Compounding: Blumentopf ⇾ blumen topf • Folding: Spaß ⇾ Spass

Slide 47

(On-Disk) Data structures

Slide 48

What else is in an inverted index? • Documents: Find documents • Term frequencies: Relevancy • Positions: Positional Queries • Offsets: Highlighting • Stored fields: The original data

Slide 49

Segment: Unit of work • A fully self sufficient inverted index • An index consists of a number of segments • New segments are created for newly added documents • Segments are immutable!

Slide 50

Read-only data structures • Pro: Write-once, sequentially • Pro: Lock-free reading • Pro: File system cache • Contra: in-place updates & deletes • Contra: Housekeeping • Contra: Transactions

Slide 51

Segment: Deletes • Mark a document as deleted in a special file • Exclude it from searches • No space is freed! 1|2|3|4|5 3 6|7|8

Slide 52

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8

Slide 53

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8 1|2|4|5|6|7|8

Slide 54

Searching Precision vs. recall Scoring Algorithms and optimizations

Slide 55

Relevancy

Slide 56

Relevancy • Textbook answer: How well matches a document a query? • Business answer: Are the top search results those that make me the most money? • marketplace • hotel booking website • newspaper website

Slide 57

Scoring

Slide 58

Scoring: lazy dog • Naive: increase a counter if a term is matched • “the lazy dog” => score 2 • “the lazy frog” => score 1 • “the lazy lazy lazy lazy cat” => score 4 or 1?

Slide 59

Scoring: More than term frequency • How about incorporating information about the whole document corpus in scoring? • Are lesser common terms more relevant? • news paper: “dieselgate news”

Slide 60

Scoring: TF-IDF • Term frequency: number of times a term occurs in a field • Inverse document frequency: inverse function of the number of documents in which it occurs

Slide 61

Scoring: Vector space model • Each term is a dimension • The length is based on tf-idf calculation • Similarity is the angle between vectors • Cosine similarity: best match == angle 0°

Slide 62

Scoring: TF-IDF in Lucene score(q,d) = ∑ ( tf(t in d) · 2 idf(t) · t.getBoost() · norm(t,d) )

Slide 63

Slide 64

BM25 • Default in Apache Lucene/Elasticsearch • Works better with stopwords (high TF) • Term frequency saturation • Improved field length normalization (per document)

Slide 65

BM25 https://www.elastic.co/guide/en/elasticsearch/guide/2.x/pluggable-similarites.html

Slide 66

Precision vs. recall

Slide 67

Precision and Recall

Slide 68

Precision and Recall relevant documents irerelevant documents

Slide 69

Precision and Recall relevant documents irerelevant documents

Slide 70

True positives relevant documents irerelevant documents

Slide 71

True negatives relevant documents irerelevant documents

Slide 72

False positives relevant documents irerelevant documents

Slide 73

False negatives relevant documents irerelevant documents

Slide 74

Precision and recall • Precision: How many selected documents are relevant? • Recall: How many relevant documents are selected

Slide 75

Under the hood

Slide 76

Optimizations everywhere • leap frogging, skip lists • top-k • two phase iterations • integer compression

Slide 77

Query two phase iteration

Slide 78

Two phase iteration: Phrase query • Phrase query: “quick fox” • Approximation phase: document contains terms quick and fox • Verification phase: read positions of terms

Slide 79

Two phase iteration: Geo distance query • Geo distance query: Distance from reference point • Approximation phase: bbox around point • Verification phase: exact distance calculation

Slide 80

Two phase iteration: Geo distance query GET /my_locations/_search { “query”: { “bool” : { “filter” : { “geo_distance” : { “distance” : “200km”, “pin.location” : { “lat” : 40, “lon” : -70 } } } } } }

Slide 81

Two phase iteration: several queries • Powerful when several queries are used • “quick fox” AND brown • Approximation: quick AND fox AND brown • Verification: “quick fox” position check for hits

Slide 82

Skip lists & leap frogging

Slide 83

Skip lists • Term dictionary is a sorted skip list • Skip list is a linked list with ‘express lanes’ to leap forward https://en.wikipedia.org/wiki/Skip_list

Slide 84

Leap frogging elasticsearch AND kibana AND logstash

Slide 85

Leap frogging elasticsearch AND kibana AND logstash 266 102 568 98 302 266 60 102 199 18 59 150 5 5 102 1 3 5