Skip to main content

How to sort by title

· One min read

Analyzer

  • Char Mapping
  • remove special chars by regex [^a-zA-Z0-9]
  • lowercase

Index

  • fielddata=true

Char Mapping

Ä -> a 
Ü -> u
Ö -> o
ä -> a
ü -> u
ö -> o
ß -> ss

How to sort by last purchased

· 2 min read

Feeding

For each product the account id and the last order date for this product is sent in an array.

{
"sku" : "",
"accountIdsSort": [
{
"accountId": "001O0Q0001vkeX7QXX",
"lastOrderDate": "2021-10-04T06:11:13Z"
},
{
"accountId": "00I0OQQQ01vkiX5ZXX",
"lastOrderDate": "2021-09-04T06:33:25Z"
}
]
}

Sort by json string

In the index we serialize every entry to a string.

{
"accountIdsSort": [
"{accountId=001O0Q0001vkeX7QXX, lastOrderDate=2021-10-04T06:11:13Z}",
"{accountId=00I0OQQQ01vkiX5ZXX, lastOrderDate=2021-09-04T06:33:25Z}"
]
}

To sort the entries a sort _script is used. The script filter all entries from the array with the prefix of the account id {accountId=001O0Q0001vkeX7QXX And return the first matching entry. Elastic is using this for the sorting.

[
{
"_script": {
"script": {
"source": "return doc['accountIdsSort'].stream().filter(x -> x.startsWith('{accountId=' + params.accountId)).findFirst().orElse('a');",
"params": {
"accountId": "$query.f.accountIds"
}
},
"type": "string",
"order": "desc"
}
}
]

Advantages:

  • easy to implement, because the json structure must not be parsed

Disadvantages:

  • For customers with a lot of purchased products the performance is bad
  • For example the request took 10s for a customer with 13000 products

Sort by a hash

Because of the bad performance and timeouts another approach with hashed long values was implemented.

[
{
"_script": {
"script": {
"source": "return doc['accountIdsSortHashes'].stream().filter(x -> x > params.accountIdHash && x < params.accountIdHash + 1000000000).findFirst().orElse(0);",
"params": {
"accountIdHash": "$accountIdHash"
}
},
"type": "number",
"order": "desc"
}
}
]

Advantages:

  • Good Performance
  • For example the request took 600ms for a customer with 13000 products

Disadvantages:

  • very complex to implement
  • the json must be parsed
  • the account id in combination with the last order date must be hashed for feeding to the index and for every search, with the same algorithm
  • the hash function must not produce duplicates
  • difficult to test

Hash function in JAVA

    public static long hashToLong(String input) throws NoSuchAlgorithmException {
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hashBytes = digest.digest(input.getBytes());
long value = new BigInteger(1, hashBytes).longValue();
return Math.abs(value % 1000000000000000000L) / 100000000;
}

Long accountIdHash = hashToLong(accountIdDTO.getAccountId());
ZonedDateTime zonedDateTime = ZonedDateTime.parse(accountIdDTO.getLastOrderDate(), DateTimeFormatter.ISO_DATE_TIME);
Instant instant = zonedDateTime.toInstant();
long finalHash = accountIdHash * 1000000000 + instant.toEpochMilli() / 10000;

elastic-mapping/elastic-mapping

· One min read

generation -> gen

Stemming

tokenizer: standard char_filter: mapping filter:

  • Synonym Graph
  • delimiter_search_index
  • lowercase
  • query_stopwords
  • Synonym Graph
  • stem_override
  • Snowball
  • unique

Synonyme

  • the Synonym Graph is used on the first position to handle synonyms that are not splitted by the delimiter filter
  • the synonym filter on the first position is case sensitive
  • the lowercase filter must be place after the delimiter filter, otherwise the delimiter filter don't work correctly

Token filter [delimiter_search_index] cannot be used to parse synonyms

suggest/suggest

· One min read

Current endpoints

Suggest keywords based on search data

  • TODO phrases

advantage

  • the suggest always returns search results
  • after a new word is typed (whitespace) the next token is computed automatically

disadvantage

  • sometime the suggest makes no sense
  • just one word is suggested

whent to use

  • if no tracking data is available
  • as a fallback for longtail searches

search-page-behaviour/search-page-behaviour

· One min read

Eine Suchergebnisseite setzt sich aus mehreren Kompontenen zusammen.

  • Suchschlitz
  • Produkte
  • Filter / Facetten
  • Sortierung
  • Paginierung
  • Nulltrefferseite

Der Nutzer hat mit diesen Komponenten die Möglichkeit das Suchergebnis weiter zu verfeinern.

In der Folgenden Tabelle sind in der ersten Spalte die Aktionen aufgelistet, mit denen der Nutzer mit der Suchergebnisseite interagieren kann. Die weiteren Spalten behinhalten die Komponenten mit ihrem jeweiligen Verhalten bei einer ausgeführten Aktion

actiondocumentsfilterpaginationsortingsearch box
select filterreloadkeep other filtersreset--
remove filterreloadkeep other filtersreset--
use sliderreloadkeep other filtersreset--
change sortingreload-resetchange-
type in suggest box----show suggest
submit a searchreloadresetreset-hide suggest
click on document-----

QSC Elastic Upgrade

· 2 min read

In this blog post i will show our best practice how to upgrade elasticsearch.

There are different ways to upgrad elatic:

2.) New Cluster upgrade

img.png

Phase 1

  • create a new cluster with the new version 2.0
  • configure the feeding, that both clusters version 1.0 and 2.0 are feeded in parallel
  • trigger a full feed that the new cluster 2.0 get all data (is it possible to run a full feed just for one cluster?)
  • both cluster still get all updates in parallel
  • acceptance test and performance test for the new cluster
  • switch the search to the new cluster/index
  • run in parallel until you are certain that the new cluster works (2 weeks)

Phase 2

  • remove the old cluster

Advantages

  • you can test the new version before switch to the new cluster
  • you can have operate both versions in parallel, until you are certain that the new version is working also with production load
  • you can rollback in case of problems
  • almost no risk in the upgrade process

Disadvantages

  • the update process is more complex
  • more hardware resources is neccesary
  • the update need more time

How to use the elastic word delimiter graph filter with synonyms

· 2 min read

1. aproach

  • the lowercase filter can not be applied before the word_delimiter_graph filter, because the word_delimiter_graph filter split on lower upper case transitions

PUT my_index

{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_word_delimiter_filter",
"lowercase",
"my_synonym_filter"
]
}
},
"filter": {
"my_word_delimiter_filter": {
"type": "word_delimiter_graph",
"generate_word_parts": true,
"catenate_words": false
},
"my_synonym_filter": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
}
}
}
}
}
  • The problem: The synonym_graph filter can not be used after the word_delimiter_graph filter
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [my_word_delimiter_filter] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [my_word_delimiter_filter] cannot be used to parse synonyms"
},
"status": 400
}

img.png

2. aproach

  • use the synonym filter before the word_delimiter filter
{
"filter": [
"my_synonym_filter",
"my_word_delimiter_filter",
"lowercase"
]
}

Problem:

  • Synonym gen,generation
  • word: Gen
  • the synonym is not applied because Gen is uppercase
  • to have all synonyms in upper and lowercase is not an option

3. aproach - use a multiplexer

PUT my_index

{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_word_delimiter_filter",
"lowercase",
"my_synonym_filter"
]
}
},
"filter": {
"my_multiplexer": {
"type": "multiplexer",
"filters": [ "delimiter_search_index, lowercase",
"lowercase, Synonym Graph, query_stopwords, stem_override, Snowball, unique"
]
},
"my_word_delimiter_filter": {
"type": "word_delimiter_graph",
"generate_word_parts": true,
"catenate_words": false
},
"my_synonym_filter": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
}
}
}
}
}

External price slider

· 2 min read

Prices in ecommerce are a complex topic. In B2B scenarios you can have a different price for each customer for one product. The price can also depend of the purchased quantity (Staffelpreise).

Simple approach

  • index all price variants in a sepearte field per document
{
"title" : "iphone x",
"price_customer_1": 499.0,
"price_customer_2": 599.0,
"price_customer_n": 699.0
}

advanteage

  • simple logic
  • the query logic changes not that much, just the price field is dynamic, dependent on the variant
  • this approach works with all products in the index - for large result sets

disadvantage:

  • there can be a lot of variants
    • the index gets messed up with a lot of data, that is never used
    • for each variant one index field is required
    • computation of all variants can take a long time
    • the prices can change and must be frequently updated
  • its hard to keep the data in sync

If the prices are stable and there are not that much variants (max. 100) per product, than the simple approach is a good choice.

External price service

  • use an external price service to compute the prices at runtime

img.png

    1. get all ids
    • retrieve all ids of the products, that match the search query of the user
    • to get all ids set size: 1000
    • the result must not be sorted
    • 1000 is the upper limit, you can increase this limit, but do performance test, before running this in production
    1. get all prices by id
    • retrieve all prices by id from the external price service
    • price parameters identifies the variant of the price, e.g. customer id, contract number, ...
    1. compute the price slider
    • compute the min and max values
    • if the price slider is used, compute the ids for the filtered values
    1. get data for the ids
    • retrieve the data for the relevant ids
    • apply the paging, sorting, rows from the user query
    1. add the slider to the response

advanteage

  • can be used with a lot of price variants

disadvantage

  • just works with small results (< 1000 products)
  • complex logic

showcases

· One min read

Stemming

drucker -> drücker -> druck spule -> spüle deckel -> decke

Decompounding

sachleistungsportfolio -> sachleistung sport folio