Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the
2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1
Slide 40
Token filter The Quick brown fox
Slide 41
Token filter Lowercase
The Quick brown fox
the quick brown fox
Slide 42
Token filter The Quick brown fox
Lowercase
Stopwords
the quick brown fox
quick brown fox
Slide 43
Token filter The Quick brown fox
Lowercase
Stopwords
the quick brown fox
quick brown fox
Synonyms
quick,fast brown fox
Slide 44
Token filter The Quick brown fox
Lowercase
Stopwords
the quick brown fox
quick brown fox
Synonyms
quick,fast brown fox
Tokens can be changed, added, removed
Slide 45
Token filter The Quick brown fox
Lowercase
Stopwords
the quick brown fox
quick brown fox
Synonyms
quick,fast brown fox
Queries need to be processed as well!
Slide 46
More analysis strategies • Phonetic analysis: Meyer vs. Meier • Stemming: foxes ⇾ fox • Compounding: Blumentopf ⇾ blumen topf • Folding: Spaß ⇾ Spass
Slide 47
(On-Disk) Data structures
Slide 48
What else is in an inverted index? • Documents: Find documents • Term frequencies: Relevancy • Positions: Positional Queries • Offsets: Highlighting • Stored fields: The original data
Slide 49
Segment: Unit of work • A fully self sufficient inverted index • An index consists of a number of segments • New segments are created for newly added documents • Segments are immutable!
Segment: Deletes • Mark a document as deleted in a special file • Exclude it from searches • No space is freed! 1|2|3|4|5 3
6|7|8
Slide 52
Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3
6|7|8
Slide 53
Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3
6|7|8
1|2|4|5|6|7|8
Slide 54
Searching Precision vs. recall Scoring Algorithms and optimizations
Slide 55
Relevancy
Slide 56
Relevancy • Textbook answer: How well matches a document a query? • Business answer: Are the top search results those that make me the most money? • marketplace • hotel booking website • newspaper website
Slide 57
Scoring
Slide 58
Scoring: lazy dog • Naive: increase a counter if a term is matched • “the lazy dog” => score 2 • “the lazy frog” => score 1 • “the lazy lazy lazy lazy cat” => score 4 or 1?
Slide 59
Scoring: More than term frequency • How about incorporating information about the whole document corpus in scoring? • Are lesser common terms more relevant? • news paper: “dieselgate news”
Slide 60
Scoring: TF-IDF • Term frequency: number of times a term occurs in a field • Inverse document frequency: inverse function of the number of documents in which it occurs
Slide 61
Scoring: Vector space model • Each term is a dimension • The length is based on tf-idf calculation • Similarity is the angle between vectors • Cosine similarity: best match == angle 0°
Slide 62
Scoring: TF-IDF in Lucene
score(q,d) = ∑ ( tf(t in d) ·
2 idf(t)
· t.getBoost() · norm(t,d) )
Slide 63
Slide 64
BM25 • Default in Apache Lucene/Elasticsearch • Works better with stopwords (high TF) • Term frequency saturation • Improved field length normalization (per document)
Two phase iteration: several queries • Powerful when several queries are used • “quick fox” AND brown • Approximation: quick AND fox AND brown • Verification: “quick fox” position check for hits
Slide 82
Skip lists & leap frogging
Slide 83
Skip lists • Term dictionary is a sorted skip list • Skip list is a linked list with ‘express lanes’ to leap forward
https://en.wikipedia.org/wiki/Skip_list
Slide 84
Leap frogging elasticsearch AND kibana AND logstash
Top-k retrieval • elasticsearch OR kibana • top 10 results wanted • maximum score for kibana is 3.0 • maximum score for elasticsearch is 5.0 • collecting documents: when 10th hit has score > 3, then only documents with elasticsearch need to be collected • total hit count is not accurate
Slide 103
Top-k retrieval
Slide 104
Index sorting
Slide 105
Order index by field values • each segment is sorted before write • criteria can be chosen by the user
5|2|3|1|4
retrieve
sort
top 2
5|2|3|1|4
5|4|3|2|1
5|4
Slide 106
Order index by field values • each segment is sorted before write • criteria can be chosen by the user early termination 5|4|3|2|1
5|4
Distributed systems & search Fanning out a search, reducing the results
Slide 117
Elasticsearch
Slide 118
Slide 119
Elasticsearch in 10 seconds • Search Engine (FTS, Analytics, Geo), near real-time • Distributed, scalable, highly available, resilient • Interface: HTTP & JSON • Centrepiece of the Elastic Stack • Uneducated conservative guess: Tens of thousands of clusters worldwide, hundreds of thousands of instances
Slide 120
Distributed systems
Slide 121
Distributed systems • How do nodes communicate with each other? • Who is taking and executing decisions? • Failure detection? • Replication strategy? • Consistency? • Enter consensus algorithms…
Slide 122
A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires processes to agree on some data value that is needed during computation https://en.wikipedia.org/wiki/Consensus_(computer_science) 122
Master node tasks • Deciding where data should be stored • Pinging other nodes • Reacting on node leaves/joins • Updating cluster state • Distributing cluster state
Aggregations - cardinality
node 1
p0
25
node 2
p1
40
How many distinct elements are in my index? What is the total? 40? 65? Naive solution: merge data to single dataset and count. Doesn’t scale! Solution: Use HyperLogLog++
Slide 149
HyperLogLog++ • Hash based counting • Trades in memory for accuracy • Fixed memory usage, based on configurable precision • Result: Small mergeable data structure, can easily be sent over the network
T-Digest • Extreme percentiles are more accurate than the Median • Percentiles are divided into buckets • When buckets grow over a boundary, approximation kicks in, saving memory in the process • The exact level of inaccuracy is difficult to generalize • Alternative: HDR histograms
Demo Try it out yourself! https://ela.st/jade-hochschule-samples
Slide 154
Elastic Cloud Free 30 day trial
https://ela.st/hack-your-future-2020
Slide 155
Upcoming trends & summary … or why you should take a closer look at search
Slide 156
Search is not just google… • “Just google it” does not cut it • Enterprise search: Intranet/G-Drive/Dropbox • Ecommerce search • SIEM • Observability: Logging, APM & Metrics
Slide 157
Search is not ‘done’ • Constant improvement • Data structures & algorithms (BKD tree for geo shapes) • Academic research moves to industry thanks to Apache Lucene
Slide 158
Search is still tough • Language specific analysis • Smart query parsing (nike red hoodie xl) • Geo based search • Anomaly detection • Incoporating feedback loops
Slide 159
Upcoming trends • Learning-to-Rank • Deep Learning • Feedback loop
Slide 160
Summary • Everything is a search problem! • Search is hard… and interesting • Distributed systems are hard… and interesting • Domain knowledge required • Data keeps exploding, good job chances!