Introduction into Full Text Search

A presentation at HYF Copenhagen by Alexander Reelsen

Full-text search with distributed search engines Alexander Reelsen @spinscale alex@elastic.co

Agenda • Overview • Indexing: Analysis, Tokenization, Filtering, on disk data structures • Searching: Scoring, Algorithms & Optimization • Aggregations • Distributed systems and search •Q&A

Overview Full text search introduction

Why is search so important?

SELECT * FROM products WHERE name LIKE = ‘%topf%’

grep “topf” my_dataset.txt

Problem • Scales linearly with the data set size • Relevancy • Spell correction • Synonyms • Phrases

Inverted Index

The quick brown fox jumped over the lazy dog

Tokenize The quick brown fox jumped over the lazy dog 1 1 1 1 1 1 1 1 1

Sort The brown dog fox jumped lazy over quick the 1 1 1 1 1 1 1 1 1

Quick brown foxes leap over lazy dogs in summer

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy dog

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy AND dog [1,2] AND [1] = [1]

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 lazy OR dog [1,2] OR [1] = [1,2]

Technologies used today • Apache Lucene (search library) • Elasticsearch (distributed search engine built on top of Apache Lucene)

Indexing Analysis, Tokenization, Filtering Data structures

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quick 1 hit

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1 quicK 0 hits

Analysis: Tokenizer & Token Filters

Tokenization

Tokenization quick brown fox

Tokenization quick_brown_fox

Tokenization quick_brown_fox the lazy, white dog.

Tokenization quick_brown_fox the_lazy_white_dog

Tokenization quick_brown_fox the_lazy_white_dog https://unicode.org/reports/tr29/

Tokenization quick_brown_fox the_lazy_white_dog https://www.jade-hs.de

Tokenization quick_brown_fox the_lazy_white_dog https_www.jade_hs.de

Token Filter

Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the 2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1

Token filter The Quick brown fox

Token filter Lowercase The Quick brown fox the quick brown fox

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Tokens can be changed, added, removed

Token filter The Quick brown fox Lowercase Stopwords the quick brown fox quick brown fox Synonyms quick,fast brown fox Queries need to be processed as well!

More analysis strategies • Phonetic analysis: Meyer vs. Meier • Stemming: foxes ⇾ fox • Compounding: Blumentopf ⇾ blumen topf • Folding: Spaß ⇾ Spass

(On-Disk) Data structures

What else is in an inverted index? • Documents: Find documents • Term frequencies: Relevancy • Positions: Positional Queries • Offsets: Highlighting • Stored fields: The original data

Segment: Unit of work • A fully self sufficient inverted index • An index consists of a number of segments • New segments are created for newly added documents • Segments are immutable!

Read-only data structures • Pro: Write-once, sequentially • Pro: Lock-free reading • Pro: File system cache • Contra: in-place updates & deletes • Contra: Housekeeping • Contra: Transactions

Segment: Deletes • Mark a document as deleted in a special file • Exclude it from searches • No space is freed! 1|2|3|4|5 3 6|7|8

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8

Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3 6|7|8 1|2|4|5|6|7|8

Searching Precision vs. recall Scoring Algorithms and optimizations

Relevancy

Relevancy • Textbook answer: How well matches a document a query? • Business answer: Are the top search results those that make me the most money? • marketplace • hotel booking website • newspaper website

Scoring

Scoring: lazy dog • Naive: increase a counter if a term is matched • “the lazy dog” => score 2 • “the lazy frog” => score 1 • “the lazy lazy lazy lazy cat” => score 4 or 1?

Scoring: More than term frequency • How about incorporating information about the whole document corpus in scoring? • Are lesser common terms more relevant? • news paper: “dieselgate news”

Scoring: TF-IDF • Term frequency: number of times a term occurs in a field • Inverse document frequency: inverse function of the number of documents in which it occurs

Scoring: Vector space model • Each term is a dimension • The length is based on tf-idf calculation • Similarity is the angle between vectors • Cosine similarity: best match == angle 0°

Scoring: TF-IDF in Lucene score(q,d) = ∑ ( tf(t in d) · 2 idf(t) · t.getBoost() · norm(t,d) )

BM25 • Default in Apache Lucene/Elasticsearch • Works better with stopwords (high TF) • Term frequency saturation • Improved field length normalization (per document)

BM25 https://www.elastic.co/guide/en/elasticsearch/guide/2.x/pluggable-similarites.html

Precision vs. recall

Precision and Recall

Precision and Recall relevant documents irerelevant documents

Precision and Recall relevant documents irerelevant documents

True positives relevant documents irerelevant documents

True negatives relevant documents irerelevant documents

False positives relevant documents irerelevant documents

False negatives relevant documents irerelevant documents

Precision and recall • Precision: How many selected documents are relevant? • Recall: How many relevant documents are selected

Under the hood

Optimizations everywhere • leap frogging, skip lists • top-k • two phase iterations • integer compression

Query two phase iteration

Two phase iteration: Phrase query • Phrase query: “quick fox” • Approximation phase: document contains terms quick and fox • Verification phase: read positions of terms

Two phase iteration: Geo distance query • Geo distance query: Distance from reference point • Approximation phase: bbox around point • Verification phase: exact distance calculation

Two phase iteration: Geo distance query GET /my_locations/_search { “query”: { “bool” : { “filter” : { “geo_distance” : { “distance” : “200km”, “pin.location” : { “lat” : 40, “lon” : -70 } } } } } }

Two phase iteration: several queries • Powerful when several queries are used • “quick fox” AND brown • Approximation: quick AND fox AND brown • Verification: “quick fox” position check for hits

Skip lists & leap frogging

Skip lists • Term dictionary is a sorted skip list • Skip list is a linked list with ‘express lanes’ to leap forward https://en.wikipedia.org/wiki/Skip_list

Leap frogging elasticsearch AND kibana AND logstash

Leap frogging elasticsearch AND kibana AND logstash 266 102 568 98 302 266 60 102 199 18 59 150 5 5 102 1 3 5

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit!

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Leap frogging logstash AND kibana AND elasticsearch 266 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Leap frogging logstash AND kibana AND elasticsearch 266 Done! 568 102 266 302 98 199 102 60 150 59 18 102 5 5 5 3 1 Hit! Hit!

Top-k retrieval

Top-k retrieval • elasticsearch OR kibana • top 10 results wanted • maximum score for kibana is 3.0 • maximum score for elasticsearch is 5.0 • collecting documents: when 10th hit has score > 3, then only documents with elasticsearch need to be collected • total hit count is not accurate

Top-k retrieval

Index sorting

Order index by field values • each segment is sorted before write • criteria can be chosen by the user 5|2|3|1|4 retrieve sort top 2 5|2|3|1|4 5|4|3|2|1 5|4

Order index by field values • each segment is sorted before write • criteria can be chosen by the user early termination 5|4|3|2|1 5|4

Aggregations Reducing data

Aggregations

aggregations documents

aggregations documents

Bucketing documents Pumps Sneakers Oxfords Sneakers Boots

Bucketing documents Sneakers Pumps Boots Oxfords

Bucketing documents Sneakers Pumps 2 bucket agg 1 metric agg doc_count Boots Oxfords 1 1

Bucketing documents Sneakers Pumps 50 bucket agg 95 metric agg avg price Boots Oxfords 90 23

Aggregations • bucket: terms, histogram, geo, range, sampler , significant text, nested • metric: value_count, avg, min, max, sum, stats, median deviation, geo, percentile, cardinality, • pipeline: min, max, sum, avg, derivative, stats, percentiles, cumulative sum, moving average, moving function, serial differencing

Distributed systems & search Fanning out a search, reducing the results

Elasticsearch

Elasticsearch in 10 seconds • Search Engine (FTS, Analytics, Geo), near real-time • Distributed, scalable, highly available, resilient • Interface: HTTP & JSON • Centrepiece of the Elastic Stack • Uneducated conservative guess: Tens of thousands of clusters worldwide, hundreds of thousands of instances

Distributed systems

Distributed systems • How do nodes communicate with each other? • Who is taking and executing decisions? • Failure detection? • Replication strategy? • Consistency? • Enter consensus algorithms…

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires processes to agree on some data value that is needed during computation https://en.wikipedia.org/wiki/Consensus_(computer_science) 122

Consensus algorithms • Leader based: Paxos, Raft • Non leader based: BTC, gossip

Consensus in Elasticsearch • Custom consensus algorithm, improving the existing one • Formally verified • Optimized for Elasticsearch use-case (rolling restarts, growing/shrinking clusters, log-ofoperations vs. cluster state)

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Consensus in Elasticsearch node 1 node 2 node 3 node 4

Master node tasks • Deciding where data should be stored • Pinging other nodes • Reacting on node leaves/joins • Updating cluster state • Distributing cluster state

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs1 node 4 cs1

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs1 node 4 cs1

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs2 node 4 cs1

Consensus in Elasticsearch node 1 cs1 node 2 cs1 node 3 cs2 node 4 cs1

Consensus in Elasticsearch node 1 cs1 node 2 cs2 node 3 cs2 node 4 cs2

Distributed search

Distributed search in Elasticsearch node 1 node 2 node 3 p0 Primary Shard (Lucene Index) node 4

Distributed search in Elasticsearch node 1 p0 node 2 r0 Replica shard (copy) node 3 node 4

Distributed search in Elasticsearch node 1 p0 node 2 r0 node 3 node 4 p1 r1 2nd Primary Shard Replica shard (copy)

Distributed search in Elasticsearch • Shard: Lucene index, unit of scale • Primary shard: Write scalability • Replica shard: Read scalability, availability

Distributed search in Elasticsearch node 1 p0 node 2 r0 node 3 node 4 p1 r1
Client connects any node with search request

Distributed search in Elasticsearch 2. Execute query against shards node 1 p0 node 2 r0 node 3 node 4 p1 r1
Client connects any node with search request

Distributed search in Elasticsearch 3. top-k search results are returned to coordinating node node 1 p0 node 2 r0 node 3 node 4 p1 r1
Client connects any node with search request

Distributed search in Elasticsearch 4. Create real top-k result list node 1 p0 node 2 r0 node 3 node 4 p1 r1
Client connects any node with search request

Distributed search in Elasticsearch 5. Fetch original documents node 1 p0 node 2 r0 node 3 node 4 p1 r1
Client connects any node with search request

Distributed search in Elasticsearch 5. Fetch original documents node 1 p0 node 2 r0 node 3 node 4 p1 r1 6. Return data to the client

Aggregations

Aggregations - cardinality node 1 p0 node 2 p1 POST /sales/_search?size=0 { “aggs” : { “type_count” : { “cardinality” : { “field” : “type” } } } }

Aggregations - cardinality node 1 p0 25 node 2 p1 40 How many distinct elements are in my index? What is the total? 40? 65? Naive solution: merge data to single dataset and count. Doesn’t scale! Solution: Use HyperLogLog++

HyperLogLog++ • Hash based counting • Trades in memory for accuracy • Fixed memory usage, based on configurable precision • Result: Small mergeable data structure, can easily be sent over the network

Aggregations - percentile node 1 p0 node 2 p1 GET latency/_search { “size”: 0, “aggs” : { “load_time_outlier” : { “percentiles” : { “field” : “load_time” } } } }

T-Digest • Extreme percentiles are more accurate than the Median • Percentiles are divided into buckets • When buckets grow over a boundary, approximation kicks in, saving memory in the process • The exact level of inaccuracy is difficult to generalize • Alternative: HDR histograms

Probabilistic data structures • bloom/cuckoo/quotient filters (membership check) • HyperLogLog++ (cardinality) • T-Digest, DDSketch, HDR histogram (percentile) • Count-Min sketch (frequency, top-k) • Hashing (similarity)

Demo Try it out yourself! https://ela.st/jade-hochschule-samples

Elastic Cloud Free 30 day trial https://ela.st/hack-your-future-2020

Upcoming trends & summary … or why you should take a closer look at search

Search is not just google… • “Just google it” does not cut it • Enterprise search: Intranet/G-Drive/Dropbox • Ecommerce search • SIEM • Observability: Logging, APM & Metrics

Search is not ‘done’ • Constant improvement • Data structures & algorithms (BKD tree for geo shapes) • Academic research moves to industry thanks to Apache Lucene

Search is still tough • Language specific analysis • Smart query parsing (nike red hoodie xl) • Geo based search • Anomaly detection • Incoporating feedback loops

Upcoming trends • Learning-to-Rank • Deep Learning • Feedback loop

Summary • Everything is a search problem! • Search is hard… and interesting • Distributed systems are hard… and interesting • Domain knowledge required • Data keeps exploding, good job chances!

Literature Books, books, books

Resources Links, links, links

Links https://lucene.apache.org/core/8_2_0/core/org/apache/lucene/search/similarities/ TFIDFSimilarity.html https://www.elastic.co/blog/whats-new-in-lucene-8 https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand https://speakerdeck.com/elastic/amusing-algorithms-and-data-structures https://www.elastic.co/blog/index-sorting-elasticsearch-6-0 https://raft.github.io/ https://github.com/elastic/elasticsearch-formal-models https://gist.github.com/spinscale/b62c8b357fae7db3f14b7d3127758951

Links - probabilistic data structures https://github.com/addthis/stream-lib https://github.com/DataDog/sketches-java https://github.com/HdrHistogram/HdrHistogram https://github.com/JohnStarich/java-skip-list https://github.com/addthis/stream-lib https://static.googleusercontent.com/media/research.google.com/fr/ pubs/archive/40671.pdf

Q&A Alexander Reelsen alex@elastic.co @spinscale

Alexander Reelsen
@spinscale

1 / 166

This presentation introduces the basic concept behind full-text search aka an inverted index, analysis, tokenization, token filtering, explains common terms like precision & recall, the basics of scoring, deep dives into some Lucene optimizations, and then moves on to aggregations and distributed systems.