Digging deeper into full-text search with Elasticsearch

A presentation at JUG Augsburg in March 2020 in by Alexander Reelsen

Slide 1

Slide 1

Elasticsearch - Digging deeper into full text search Alexander Reelsen Community Advocate alex@elastic.co | @spinscale

Slide 2

Slide 2

TOC How to run the Elastic Stack Datatypes: range_ , date_nanos , search-as-you-type , flattened Search: Field collapsing/ top_hits , distance_feature query, vector search, phonetic search Processors: dissect , enrich , inference Index lifecycle management

Slide 3

Slide 3

Elastic Stack

Slide 4

Slide 4

Elasticsearch in 10 seconds Search Engine (FTS, Analytics, Geo), near real-time Distributed, scalable, highly available, resilient Interface: HTTP & JSON Heart of the Elastic Stack (Kibana, Logstash, Beats)

Slide 5

Slide 5

Installation & Start # https://www.elastic.co/downloads/elasticsearch wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.1-darwin-x86_64.tar.gz # wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.1-linux-x86_64.tar.gz # wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.1-windows-x86_64.zip tar zxf elasticsearch-7.6.1-darwin-x86_64.tar.gz cd elasticsearch-7.6.1 ./bin/elasticsearch-plugin install analysis-phonetic ./bin/elasticsearch wget https://artifacts.elastic.co/downloads/kibana/kibana-7.6.1-darwin-x86_64.tar.gz # wget https://artifacts.elastic.co/downloads/kibana/kibana-7.6.1-linux-x86_64.tar.gz # wget https://artifacts.elastic.co/downloads/kibana/kibana-7.6.1-windows-x86_64.zip tar zxf kibana-7.6.1-darwin-x86_64.tar.gz cd kibana-7.6.1 ./bin/kibana Point your browser to http://localhost:5601/

Slide 6

Slide 6

Click Dev-Tools Samples in Kibana Samples in Github

Slide 7

Slide 7

Slide 8

Slide 8

Datatypes

Slide 9

Slide 9

range_ datatype Search in ranges Supported types: integer , float , long , double , date , ip Example: Model hotel room availabilities

Slide 10

Slide 10

date_range datatype # date_range datatype PUT range_index { “mappings”: { “properties”: { “time_frame”: { “type”: “date_range” } } } } PUT range_index/_doc/hotel_1_room_404 { “time_frame” : [ { “gte” : “2015-10-31”, “lte” : “2015-11-02” }, { “gte” : “2015-11-04”, “lte” : “2015-11-05” } ] }

Slide 11

Slide 11

Search GET range_index/_search { “query”: { “range”: { “time_frame”: { “gte”: “2015-11-04”, “lte”: “2015-11-05”, “relation”: “contains” } } } }

Slide 12

Slide 12

date_nanos datatype # date_nanos datatype PUT nanos_index { “mappings”: { “properties”: { “time_in_nanos”: { “type”: “date_nanos”, “format” : “yyyy-MM-dd’T’HH:mm:ss.nX” } } } } PUT nanos_index/_bulk?refresh { “index” : {}} {“time_in_nanos”:”2019-12-31T23:59:59.999999999Z”} { “index” : {}} {“time_in_nanos”:”2019-12-31T23:59:59.999Z”}

Slide 13

Slide 13

date_nanos datatype GET nanos_index/_search { “query”: { “range”: { “time_in_nanos”: { “gt”: “2019-12-31T23:59:59.999Z” } } } }

Slide 14

Slide 14

search-as-you-type datatype One of the most requested features Very fast, but requires maintenance: completion suggester Since Elasticsearch 7.0: Lucene Block max WAND

Slide 15

Slide 15

search-as-you-type datatype # search_as_you_type datatype PUT search_index { “mappings”: { “properties”: { “title”: { “type”: “search_as_you_type” } } } }

Slide 16

Slide 16

search-as-you-type datatype PUT search_index/_bulk?refresh { “index” : {} } { “title” : “This is it!” } { “index” : {} } { “title” : “This or that?” } { “index” : {} } { “title” : “Thin or thick?” } { “index” : {} } { “title” : “This is eval!” } { “index” : {} } { “title” : “Thick is not sick” }

Slide 17

Slide 17

search-as-you-type datatype GET search_index/_search { “query”: { “match_phrase_prefix”: { “title”: “thi” } } } GET search_index/_search { “query”: { “match_phrase_prefix”: { “title”: “this i” } } } GET search_index/_search { “query”: { “match_phrase_prefix”: { “title”: “this is e” } } }

Slide 18

Slide 18

search-as-you-type datatype # no need for terms to be next to each other GET search_index/_search { “query”: { “multi_match”: { “query”: “thick s”, “type”: “bool_prefix”, “operator”: “and”, “fields”: [ “title”, “title._2gram”, “title._3gram” ] } } }

Slide 19

Slide 19

flattened datatype Maps an entire object as a single field Prevents mapping explosion Allows only for some basic queries Searching: Think of a specialized keyword datatype

Slide 20

Slide 20

flattened datatype # flattened datatype PUT bug_reports { “mappings”: { “properties”: { “labels”: { “type”: “flattened” } } } }

Slide 21

Slide 21

flattened datatype POST bug_reports/_doc/1?refresh { “title”: “Results are not sorted correctly.”, “labels”: { “priority”: “urgent”, “release”: [“v1.2.5”, “v1.3.0”], “timestamp”: { “created”: 1541458026, “closed”: 1541457010 } } }

Slide 22

Slide 22

flattened datatype POST bug_reports/_search { “query”: { “term”: {“labels”: “urgent”} } } POST bug_reports/_search { “query”: { “term”: {“labels.release”: “v1.3.0”} } }

Slide 23

Slide 23

Searching

Slide 24

Slide 24

Bulk indexing # index some book data to play around with PUT books/_bulk { “index” : { “_id” : “database-internals” } } {“isbn13”:”978-1492040347”,”author”:”Alexander Petrov”, “title”:”Database Internals: A deep-dive into how distributed data systems work”,”publisher”:”O’Reilly”,”category”:[“databases”,”information systems”],”pages”:350,”price”:47.28,”format”:”paperback”,”rating”:4.5} { “index” : { “_id” : “designing-data-intensive-applications” } } {“isbn13”:”978-1449373320”, “author”:”Martin Kleppmann”, “title”:”Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems”,”publisher”:”O’Reilly”,”category”:[“databases” ],”pages”:590,”price”:31.06,”format”:”paperback”,”rating”:4.4} { “index” : { “_id” : “kafka-the-definitive-guide” } } {“isbn13”:”978-1491936160”,”author”:[ “Neha Narkhede”, “Gwen Shapira”, “Todd Palino”], “title”:”Kafka: The Definitive Guide: Real-time data and stream processing at scale”, “publisher”:”O’Reilly”,”category”:[“databases” ],”pages”:297,”price”:37.31,”format”:”paperback”,”rating”:3.9} { “index” : { “_id” : “effective-java” } } {“isbn13”:”978-1491936160”,”author”: “Joshua Block”, “title”:”Effective Java”, “publisher”:”Addison-Wesley”, “category”:[“programming languages”, “java” ],”pages”:412,”price”:27.91,”format”:”paperback”,”rating”:4.2} { “index” : { “_id” : “daemon” } } {“isbn13”:”978-1847249616”,”author”:”Daniel Suarez”, “title”:”Daemon”,”publisher”:”Quercus”,”category”:[“dystopia”,”novel”],”pages”:448,”price”:12.03,”format”:”paperback”,”rating”:4.0} { “index” : { “_id” : “cryptonomicon” } } {“isbn13”:”978-1847249616”,”author”:”Neal Stephenson”, “title”:”Cryptonomicon”,”publisher”:”Avon”,”category”:[“thriller”, “novel” ],”pages”:1152,”price”:6.99,”format”:”paperback”,”rating”:4.0} { “index” : { “_id” : “garbage-collection-handbook” } } {“isbn13”:”978-1420082791”,”author”: [ “Richard Jones”, “Antony Hosking”, “Eliot Moss” ], “title”:”The Garbage Collection Handbook: The Art of Automatic Memory Management”,”publisher”:”Taylor & Francis”,”category”:[“programming algorithms” ],”pages”:511,”price”:87.85,”format”:”paperback”,”rating”:5.0} { “index” : { “_id” : “radical-candor” } } {“isbn13”:”978-1250258403”,”author”: “Kim Scott”, “title”:”Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity”,”publisher”:”Macmillan”,”category”:[“human resources”,”management”, “new work”],”pages”:404,”price”:7.29,”format”:”paperback”,”rating”:4.0} { “index” : { “_id” : “never-split-the-difference” } } {“isbn13”:”978-1847941497”,”author”: “Chris Voss”, “title”:”Never Split the Difference: Negotiating as if Your Life Depended on It”,”publisher”:”Random House Business”,”category”:[“negotiation”, “sales”],”pages”:288,”price”:10.49,”format”:”paperback”,”rating”:4.3} { “index” : { “_id” : “not-giving-a-fsck” } } {“isbn13”:”978-0062641540”,”author”: “Mark Manson”, “title”:”The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life”,”publisher”:”Harper”,”category”:[“success”, “motivation”],”pages”:224,”price”:12.99,”format”:”paperback”,”rating”:4.4} { “index” : { “_id” : “permanent-record” } } {“isbn13”:”978-1250756541”,”author”: “Edward Snowden”, “title”:”Permanent Record”,”publisher”:”Macmillan”,”category”:[“politics”, “biography”],”pages”:339,”price”:12.99,”format”:”paperback”,”rating”:4.7}

Slide 25

Slide 25

Field Collapsing # field collapsing GET books/_search { “query”: { “bool”: { “must_not”: [ { “term”: { “category.keyword”: “novel” } } ] } }, “collapse”: { “field”: “publisher.keyword” }, “sort”: [ { “rating”: { “order”: “desc” } } ] }

Slide 26

Slide 26

top_hits Aggregation # top_hits aggregation GET books/_search { “size”: 0, “aggs”: { “by_format”: { “terms”: { “field”: “format.keyword” }, “aggs”: { “by_rating”: { “top_hits”: { “size”: 1, “sort”: [ { “rating”: “desc” } ] } } } } } }

Slide 27

Slide 27

distance_feature datatype & query # Add a new release_year field PUT books/_mapping { “properties” : { “release_year” : { “type” : “date”, “format” : “strict_year” } } }

Slide 28

Slide 28

distance_feature datatype & query # update release_year of all books PUT books/_bulk { “update” : { “_id” : “database-internals” } } { “doc” : { “release_year” : “2019” } } { “update” : { “_id” : “designing-data-intensive-applications” } } { “doc” : { “release_year” : “2017” } } { “update” : { “_id” : “kafka-the-definitive-guide” } } { “doc” : { “release_year” : “2017” } } { “update” : { “_id” : “effective-java” } } { “doc” : { “release_year” : “2017” } } { “update” : { “_id” : “daemon” } } { “doc” : { “release_year” : “2011” } } { “update” : { “_id” : “cryptonomicon” } } { “doc” : { “release_year” : “2002” } } { “update” : { “_id” : “garbage-collection-handbook” } } { “doc” : { “release_year” : “2011” } } { “update” : { “_id” : “radical-candor” } } { “doc” : { “release_year” : “2018” } } { “update” : { “_id” : “never-split-the-difference” } } { “doc” : { “release_year” : “2017” } } { “update” : { “_id” : “not-giving-a-fsck” } } { “doc” : { “release_year” : “2016” } } { “update” : { “_id” : “permanent-record” } } { “doc” : { “release_year” : “2019” } }

Slide 29

Slide 29

distance_feature datatype & query # newer books are more relevant # like function score, but waaaay faster GET /books/_search { “query”: { “bool”: { “filter”: { “range”: { “pages”: { “gte”: 500 } } }, “should”: { “distance_feature”: { “field”: “release_year”, “pivot”: “2555d”, “origin”: “now” } } } } }

Slide 30

Slide 30

Vector based scoring Scoring based on features dense datatype vectors can be used for scoring using vector field functions query vector required

Slide 31

Slide 31

Feature modelling Prefers long books: Range 0-10 Prefers good rated one: Range 0-5 Prefers cheaper books: 0-10 (inverse, 0 more than 100 EUR, 10 less than 10 EUR)

Slide 32

Slide 32

Mapping update # add vector field PUT books/_mapping { “properties”: { “vector_recommendation”: { “type”: “dense_vector”, “dims”: 3 } } }

Slide 33

Slide 33

Add vectors to documents # Add a vector for each document PUT books/_bulk { “update” : { “_id” : “database-internals” } } { “doc” : { “vector_recommendation” : [3.5, 4.5, 5.2] } } { “update” : { “_id” : “designing-data-intensive-applications” } } { “doc” : { “vector_recommendation” : [5.9, 4.4, 6.8] } } { “update” : { “_id” : “kafka-the-definitive-guide” } } { “doc” : { “vector_recommendation” : [2.97, 3.9, 6.2] } } { “update” : { “_id” : “effective-java” } } { “doc” : { “vector_recommendation” : [4.12, 4.2, 7.2] } } { “update” : { “_id” : “daemon” } } { “doc” : { “vector_recommendation” : [4.48, 4.0, 8.7] } } { “update” : { “_id” : “cryptonomicon” } } { “doc” : { “vector_recommendation” : [10.0, 4.0, 9.3] } } { “update” : { “_id” : “garbage-collection-handbook” } } { “doc” : { “vector_recommendation” : [5.1, 5.0, 1.3] } } { “update” : { “_id” : “radical-candor” } } { “doc” : { “vector_recommendation” : [4.0, 4.0, 9.2] } } { “update” : { “_id” : “never-split-the-difference” } } { “doc” : { “vector_recommendation” : [4.0, 4.3, 8.9] } } { “update” : { “_id” : “not-giving-a-fsck” } } { “doc” : { “vector_recommendation” : [2.8, 4.4, 8.9] } } { “update” : { “_id” : “permanent-record” } } { “doc” : { “vector_recommendation” : [3.3, 4.7, 8.7] } }

Slide 34

Slide 34

Search for short, cheap books with a good rating # short, good rating, cheap GET books/_search { “query”: { “script_score”: { “query” : { “match_all” : {} }, “script”: { “source”: “cosineSimilarity(params.query_vector, ‘vector_recommendation’) + 1.0”, “params”: { “query_vector”: [1.0, 5.0, 10.0] } } } } }

Slide 35

Slide 35

Search for long, any priced books with a good rating # long, good rating, any price GET books/_search { “query”: { “script_score”: { “query” : { “match_all” : {} }, “script”: { “source”: “cosineSimilarity(params.query_vector, ‘vector_recommendation’) + 1.0”, “params”: { “query_vector”: [10.0, 5.0, 5.0] } } } } }

Slide 36

Slide 36

Phonetic search Find similar terms by converting terms to their phonetic representation

Slide 37

Slide 37

phonetic mapping PUT phonetic_sample { “mappings”: { “properties”: { “name”: { “type”: “text”, “fields”: { “metaphone”: { “type”: “text”, “analyzer”: “metaphone_analyzer” }, “koelner”: { “type”: “text”, “analyzer”: “koelner_analyzer” }, “soundex”: { “type”: “text”, “analyzer”: “soundex_analyzer” } } } } }, “settings”: { “index”: { “analysis”: { “analyzer”: { “metaphone_analyzer”: { “tokenizer”: “standard”, “filter”: [ “lowercase”, “phonetic_filter” ] }, “soundex_analyzer”: { “tokenizer”: “standard”, “filter”: [ “lowercase”, “soundex_filter” ] }, “koelner_analyzer”: { “tokenizer”: “standard”, “filter”: [ “lowercase”, “koelner_filter” ] } }, “filter”: { “phonetic_filter”: { “type”: “phonetic”, “encoder”: “metaphone”, “replace”: false }, “soundex_filter”: { “type”: “phonetic”, “encoder”: “soundex”, “replace”: false }, “koelner_filter”: { “type”: “phonetic”, “encoder”: “koelnerphonetik”, “replace”: false } } } } } }

Slide 38

Slide 38

Metaphone/Soundex POST phonetic_sample/_analyze { “field”: “name.metaphone”, “text”: “Joe Blocks” } POST phonetic_sample/_analyze { “field”: “name.soundex”, “text”: “Joe Blocks” }

Slide 39

Slide 39

Koelner phonetik POST phonetic_sample/_analyze { “field”: “name.koelner”, “text”: “Aleksander” } POST phonetic_sample/_analyze { “field”: “name.koelner”, “text”: “Alexander” }

Slide 40

Slide 40

Meier/Maier/Mayer/Meyer PUT phonetic_sample/_bulk?refresh { “index” : { “_id” : 1 }} {“name”:”Peter Meyer”} { “index” : { “_id” : 2 }} {“name”:”Peter Meier”} { “index” : { “_id” : 3 }} {“name”:”Peter Maier”} { “index” : { “_id” : 4 }} {“name”:”Peter Mayer”} GET phonetic_sample/_search { “query”: { “match”: { “name.metaphone”: “Maier” } } } GET phonetic_sample/_search { “query”: { “match”: { “name.koelner”: “Maier” } } }

Slide 41

Slide 41

Ingest Processors

Slide 42

Slide 42

Dissect processor Grok processor is hard to configure for simple cases regular expressions are complex and CPU heavy dissect does not use regexes, syntax is simpler

Slide 43

Slide 43

Dissect processor # dissect processor POST _ingest/pipeline/_simulate { “pipeline”: { “processors”: [ { “dissect”: { “field”: “input”, “pattern”: “%{url}?%{param_string}” } }, { “kv”: { “field”: “param_string”, “target_field”: “params”, “field_split”: “&”, “value_split”: “=” } } ] }, “docs”: [ {“_source” : { “input” : “https://example.org?foo=bar&spam=eggs” } } ] }

Slide 44

Slide 44

Enrich processor Enrich documents with data from another index Processor uses an enrich policy Since: 7.5

Slide 45

Slide 45

Enrich processor: zip to city lookup # enrich processor PUT cities/_bulk?refresh { “index” : { “_id” : “munich”} } {“zip”:”80331”,”city”:”Munich”} { “index” : { “_id” : “berlin”} } {“zip”:”10965”,”city”:”Berlin”} PUT /_enrich/policy/zip-policy { “match”: { “indices”: “cities”, “match_field”: “zip”, “enrich_fields”: [ “city” ] } } POST /_enrich/policy/zip-policy/_execute GET _cat/indices/.enrich-*

Slide 46

Slide 46

Enrich processor: zip to city lookup POST /_ingest/pipeline/_simulate { “pipeline”: { “processors”: [ { “enrich”: { “policy_name”: “zip-policy”, “field”: “zip”, “target_field”: “city”, “max_matches”: “1” } } ] }, “docs”: [ { “_id”: “first”, “_source” : { “zip” : “80331” } } , { “_id”: “second”, “_source” : { “zip” : “50667” } } ] }

Slide 47

Slide 47

Inference processor Uses a pre-trained data frame analytics model to infer Built-in language identification

Slide 48

Slide 48

Inference processor # inference processor POST _ingest/pipeline/_simulate { “pipeline”: { “processors”: [ { “inference”: { “model_id”: “lang_ident_model_1”, “inference_config”: { “classification”: {}}, “field_mappings”: {} } } ] }, “docs”: [ { “_source”: { “text”: “This is an english text” } }, { “_source”: { “text”: “Das ist ein deutscher Text” } }, { “_source”: { “text”: ” ” } }, { “_source”: { “text”: “Ceci est un texte en français” } } ] } 我爱北京天安⻔

Slide 49

Slide 49

Index Lifecycle Management

Slide 50

Slide 50

Index Lifecycle Management control aging indices configuration via a lifecycle policy policy split into phases per action hot action: set priority, unfollow, rollover warm action: set priority, unfollow, read-only, allocate, shrink, force merge cold action: set priority, allocate, freeze delete action: delete

Slide 51

Slide 51

Sample policy # index lifecycle management PUT _ilm/policy/full_policy { “policy”: { “phases”: { “hot”: { “actions”: { “rollover”: { “max_age”: “7d”, “max_size”: “50G” } } }, “warm”: { “min_age”: “30d”, “actions”: { “forcemerge”: { “max_num_segments”: 1 }, “shrink”: { “number_of_shards”: 1 }, “allocate”: { “number_of_replicas”: 2 } } }, “cold”: { “min_age”: “60d”, “actions”: { “allocate”: { “require”: { “type”: “cold” } } } }, “delete”: { “min_age”: “90d”, “actions”: { “delete”: {} } } } } }

Slide 52

Slide 52

Summary

Slide 53

Slide 53

Summary Understanding search is hard Use the reference documentation Ask your users about expectations, do not guess!

Slide 54

Slide 54

More Aggregations… rare_terms : Find the long tail of few occuring terms auto_date_histogram : Specify the number of buckets without needing to know their values geotile_grid : Create maptile based buckets string_stats : Extract some string statistics like min/max/avg string length top_metrics (likely 7.7): Extract field values based on another field sort value ( top_hits for single values)

Slide 55

Slide 55

More datatypes histogram : Pre aggregated data, can be used in percentils aggs wildcard (likely 7.8): Optimized field for wildcard queries ( grep style)

Slide 56

Slide 56

More Queries & Data Transformations pinned Query: Promote selected documents to rank at the top Rollups and Transforms

Slide 57

Slide 57

Elastic Cloud

Slide 58

Slide 58

Elastic Support Subscriptions

Slide 59

Slide 59

Getting more help

Slide 60

Slide 60

Discuss Forum https://discuss.elastic.co

Slide 61

Slide 61

Community & Meetups https://community.elastic.co

Slide 62

Slide 62

Official Elastic Training https://training.elastic.co

Slide 63

Slide 63

Thanks for listening Q&A Alexander Reelsen Community Advocate alex@elastic.co | @spinscale