Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the
2 1 1,2 1 2 1 2 2 1 1,2 2 1,2 1 2 1
Slide 31
Token filter The Quick brown fox
Slide 32
Token filter Lowercase
The Quick brown fox
the quick brown fox
Slide 33
Token filter The Quick brown fox
Lowercase
Stopwords
the quick brown fox
quick brown fox
Slide 34
Token filter The Quick brown fox
Lowercase
Stopwords
the quick brown fox
quick brown fox
Synonyms
quick,fast brown fox
Slide 35
Token filter The Quick brown fox
Lowercase
Stopwords
the quick brown fox
quick brown fox
Synonyms
quick,fast brown fox
Tokens can be changed, added, removed
Slide 36
Token filter The Quick brown fox
Lowercase
Stopwords
the quick brown fox
quick brown fox
Synonyms
quick,fast brown fox
Queries need to be processed as well!
Slide 37
More analysis strategies • Phonetic analysis: Meyer vs. Meier • Stemming: foxes ⇾ fox • Compounding: Blumentopf ⇾ blumen topf • Folding: Spaß ⇾ Spass
Slide 38
(On-Disk) Data structures
Slide 39
What else is in an inverted index? • Documents: Find documents • Term frequencies: Relevancy • Positions: Positional Queries • Offsets: Highlighting • Stored fields: The original data
Slide 40
Segment: Unit of work • A fully self sufficient inverted index • An index consists of a number of segments • New segments are created for newly added documents • Segments are immutable!
Segment: Deletes • Mark a document as deleted in a special file • Exclude it from searches • No space is freed! 1|2|3|4|5 3
6|7|8
Slide 43
Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3
6|7|8
Slide 44
Segment: Merging • Number of segments needs to be kept reasonable • Merge multiple segments into one (smaller index) • Delete expired documents 1|2|3|4|5 3
6|7|8
1|2|4|5|6|7|8
Slide 45
Searching Precision vs. recall Scoring Algorithms and optimizations
Slide 46
Relevancy
Slide 47
Relevancy • Textbook answer: How well matches a document a query? • Business answer: Are the top search results those that make me the most money? • marketplace • hotel booking website • newspaper website
Slide 48
Scoring
Slide 49
Scoring: lazy dog • Naive: increase a counter if a term is matched • “the lazy dog” => score 2 • “the lazy frog” => score 1 • “the lazy lazy lazy lazy cat” => score 4 or 1?
Slide 50
Scoring: More than term frequency • How about incorporating information about the whole document corpus in scoring? • Are lesser common terms more relevant? • news paper: “dieselgate news”
Slide 51
Scoring: TF-IDF • Term frequency: number of times a term occurs in a field • Inverse document frequency: inverse function of the number of documents in which it occurs
Slide 52
Scoring: Vector space model • Each term is a dimension • The length is based on tf-idf calculation • Similarity is the angle between vectors • Cosine similarity: best match == angle 0°
Slide 53
Scoring: TF-IDF in Lucene
score(q,d) = ∑ ( tf(t in d) ·
2 idf(t)
· t.getBoost() · norm(t,d) )
Slide 54
Slide 55
BM25 • Default in Apache Lucene/Elasticsearch • Works better with stopwords (high TF) • Term frequency saturation • Improved field length normalization (per document)
Two phase iteration: several queries • Powerful when several queries are used • “quick fox” AND brown • Approximation: quick AND fox AND brown • Verification: “quick fox” position check for hits
Slide 73
Skip lists & leap frogging
Slide 74
Skip lists • Term dictionary is a sorted skip list • Skip list is a linked list with ‘express lanes’ to leap forward
https://en.wikipedia.org/wiki/Skip_list
Slide 75
Leap frogging elasticsearch AND kibana AND logstash
Top-k retrieval • elasticsearch OR kibana • top 10 results wanted • maximum score for kibana is 3.0 • maximum score for elasticsearch is 5.0 • collecting documents: when 10th hit has score > 3, then only documents with elasticsearch need to be collected • total hit count is not accurate
Slide 94
Top-k retrieval
Slide 95
Index sorting
Slide 96
Order index by field values • each segment is sorted before write • criteria can be chosen by the user
5|2|3|1|4
retrieve
sort
top 2
5|2|3|1|4
5|4|3|2|1
5|4
Slide 97
Order index by field values • each segment is sorted before write • criteria can be chosen by the user early termination 5|4|3|2|1
5|4
Distributed systems & search Fanning out a search, reducing the results
Slide 108
Elasticsearch
Slide 109
Slide 110
Elasticsearch in 10 seconds • Search Engine (FTS, Analytics, Geo), near real-time • Distributed, scalable, highly available, resilient • Interface: HTTP & JSON • Centrepiece of the Elastic Stack • Uneducated conservative guess: Tens of thousands of clusters worldwide, hundreds of thousands of instances
Slide 111
Distributed systems
Slide 112
Distributed systems • How do nodes communicate with each other? • Who is taking and executing decisions? • Failure detection? • Replication strategy? • Consistency? • Enter consensus algorithms…
Slide 113
A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires processes to agree on some data value that is needed during computation https://en.wikipedia.org/wiki/Consensus_(computer_science) 113
Master node tasks • Deciding where data should be stored • Pinging other nodes • Reacting on node leaves/joins • Updating cluster state • Distributing cluster state
Aggregations - cardinality
node 1
p0
25
node 2
p1
40
How many distinct elements are in my index? What is the total? 40? 65? Naive solution: merge data to single dataset and count. Doesn’t scale! Solution: Use HyperLogLog++
Slide 140
HyperLogLog++ • Hash based counting • Trades in memory for accuracy • Fixed memory usage, based on configurable precision • Result: Small mergeable data structure, can easily be sent over the network
T-Digest • Extreme percentiles are more accurate than the Median • Percentiles are divided into buckets • When buckets grow over a boundary, approximation kicks in, saving memory in the process • The exact level of inaccuracy is difficult to generalize • Alternative: HDR histograms
Demo Try it out yourself! https://ela.st/jade-hochschule-samples
Slide 145
Elastic Cloud Free 30 day trial
https://ela.st/university-wilhelmshaven
Slide 146
Upcoming trends & summary … or why you should take a closer look at search
Slide 147
Search is not just google… • “Just google it” does not cut it • Enterprise search: Intranet/G-Drive/Dropbox • Ecommerce search • SIEM • Observability: Logging, APM & Metrics
Slide 148
Search is not ‘done’ • Constant improvement • Data structures & algorithms (BKD tree for geo shapes) • Academic research moves to industry thanks to Apache Lucene
Slide 149
Search is still tough • Language specific analysis • Smart query parsing (nike red hoodie xl) • Geo based search • Anomaly detection • Incoporating feedback loops
Slide 150
Upcoming trends • Learning-to-Rank • Deep Learning • Feedback loop
Slide 151
Summary • Everything is a search problem! • Search is hard… and interesting • Distributed systems are hard… and interesting • Domain knowledge required • Data keeps exploding, good job chances!