Blog | QSC Documentation

How to sort by title

April 21, 2024 · One min read

Analyzer

Char Mapping
remove special chars by regex [^a-zA-Z0-9]
lowercase

Index

fielddata=true

Char Mapping

Ä -> a 
Ü -> u
Ö -> o
ä -> a
ü -> u
ö -> o
ß -> ss

Feeding

For each product the account id and the last order date for this product is sent in an array.

{
  "sku" : "",
  "accountIdsSort": [
    {
      "accountId": "001O0Q0001vkeX7QXX",
      "lastOrderDate": "2021-10-04T06:11:13Z"
    },
    {
      "accountId": "00I0OQQQ01vkiX5ZXX",
      "lastOrderDate": "2021-09-04T06:33:25Z"
    }
  ]
}

Sort by json string

In the index we serialize every entry to a string.

{
  "accountIdsSort": [
    "{accountId=001O0Q0001vkeX7QXX, lastOrderDate=2021-10-04T06:11:13Z}",
    "{accountId=00I0OQQQ01vkiX5ZXX, lastOrderDate=2021-09-04T06:33:25Z}"
  ]
}

To sort the entries a sort _script is used. The script filter all entries from the array with the prefix of the account id {accountId=001O0Q0001vkeX7QXX And return the first matching entry. Elastic is using this for the sorting.

[
    {
        "_script": {
            "script": {
                "source": "return doc['accountIdsSort'].stream().filter(x -> x.startsWith('{accountId=' + params.accountId)).findFirst().orElse('a');",
                "params": {
                    "accountId": "$query.f.accountIds"
                }
            },
            "type": "string",
            "order": "desc"
        }
    }
]

Advantages:

easy to implement, because the json structure must not be parsed

Disadvantages:

For customers with a lot of purchased products the performance is bad
For example the request took 10s for a customer with 13000 products

Sort by a hash

Because of the bad performance and timeouts another approach with hashed long values was implemented.

[
    {
        "_script": {
            "script": {
                "source": "return doc['accountIdsSortHashes'].stream().filter(x -> x > params.accountIdHash && x < params.accountIdHash + 1000000000).findFirst().orElse(0);",
                "params": {
                    "accountIdHash": "$accountIdHash"
                }
            },
            "type": "number",
            "order": "desc"
        }
    }
]

Advantages:

Good Performance
For example the request took 600ms for a customer with 13000 products

Disadvantages:

very complex to implement
the json must be parsed
the account id in combination with the last order date must be hashed for feeding to the index and for every search, with the same algorithm
the hash function must not produce duplicates
difficult to test

Hash function in JAVA

    public static long hashToLong(String input) throws NoSuchAlgorithmException {
        MessageDigest digest = MessageDigest.getInstance("SHA-256");
        byte[] hashBytes = digest.digest(input.getBytes());
        long value = new BigInteger(1, hashBytes).longValue();
        return Math.abs(value % 1000000000000000000L) / 100000000;
    }
    
    Long accountIdHash =  hashToLong(accountIdDTO.getAccountId());
    ZonedDateTime zonedDateTime = ZonedDateTime.parse(accountIdDTO.getLastOrderDate(), DateTimeFormatter.ISO_DATE_TIME);
    Instant instant = zonedDateTime.toInstant();
    long finalHash = accountIdHash * 1000000000 + instant.toEpochMilli() / 10000;

elastic-mapping/elastic-mapping

June 27, 2023 · One min read

generation -> gen

Stemming

tokenizer: standard char_filter: mapping filter:

Synonym Graph
delimiter_search_index
lowercase
query_stopwords
Synonym Graph
stem_override
Snowball
unique

Synonyme

the Synonym Graph is used on the first position to handle synonyms that are not splitted by the delimiter filter
the synonym filter on the first position is case sensitive
the lowercase filter must be place after the delimiter filter, otherwise the delimiter filter don't work correctly

Token filter [delimiter_search_index] cannot be used to parse synonyms

suggest/suggest

June 27, 2023 · One min read

Current endpoints

https://qsc.quasiris.de/api/v1/search/suggest2/ab/products?q=wago - deprected - used by AB
https://qsc.quasiris.de/api/v1/search/suggest/ab/products?q=wago - deprecated - used by shopware plugin
https://qsc-dev.quasiris.de/api/v1/suggest/wins/suggest?q=iphone - secured by token
https://qsc-dev.quasiris.de/api/v1/suggest/ab/products?q=w - secured by token
a suggest can have multiple results
- keywords
- products
- categories

Suggest keywords based on search data

TODO phrases

advantage

the suggest always returns search results
after a new word is typed (whitespace) the next token is computed automatically

disadvantage

sometime the suggest makes no sense
just one word is suggested

whent to use

if no tracking data is available
as a fallback for longtail searches

search-page-behaviour/search-page-behaviour

April 5, 2023 · One min read

Eine Suchergebnisseite setzt sich aus mehreren Kompontenen zusammen.

Suchschlitz
Produkte
Filter / Facetten
Sortierung
Paginierung
Nulltrefferseite

Der Nutzer hat mit diesen Komponenten die Möglichkeit das Suchergebnis weiter zu verfeinern.

In der Folgenden Tabelle sind in der ersten Spalte die Aktionen aufgelistet, mit denen der Nutzer mit der Suchergebnisseite interagieren kann. Die weiteren Spalten behinhalten die Komponenten mit ihrem jeweiligen Verhalten bei einer ausgeführten Aktion

action	documents	filter	pagination	sorting	search box
select filter	reload	keep other filters	reset	-	-
remove filter	reload	keep other filters	reset	-	-
use slider	reload	keep other filters	reset	-	-
change sorting	reload	-	reset	change	-
type in suggest box	-	-	-	-	show suggest
submit a search	reload	reset	reset	-	hide suggest
click on document	-	-	-	-	-

QSC Elastic Upgrade

March 8, 2023 · 2 min read

In this blog post i will show our best practice how to upgrade elasticsearch.

There are different ways to upgrad elatic:

1.) In place upgrade - https://docs.aws.amazon.com/opensearch-service/latest/developerguide/version-migration.html
2.) New Cluster upgrade

2.) New Cluster upgrade

Phase 1

create a new cluster with the new version 2.0
configure the feeding, that both clusters version 1.0 and 2.0 are feeded in parallel
trigger a full feed that the new cluster 2.0 get all data (is it possible to run a full feed just for one cluster?)
both cluster still get all updates in parallel
acceptance test and performance test for the new cluster
switch the search to the new cluster/index
run in parallel until you are certain that the new cluster works (2 weeks)

Phase 2

remove the old cluster

Advantages

you can test the new version before switch to the new cluster
you can have operate both versions in parallel, until you are certain that the new version is working also with production load
you can rollback in case of problems
almost no risk in the upgrade process

Disadvantages

the update process is more complex
more hardware resources is neccesary
the update need more time

How to use the elastic word delimiter graph filter with synonyms

March 5, 2023 · 2 min read

1. aproach

the lowercase filter can not be applied before the word_delimiter_graph filter, because the word_delimiter_graph filter split on lower upper case transitions

PUT my_index

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_word_delimiter_filter",
            "lowercase",
            "my_synonym_filter"
          ]
        }
      },
      "filter": {
        "my_word_delimiter_filter": {
          "type": "word_delimiter_graph",
          "generate_word_parts": true,
          "catenate_words": false
        },
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms_path": "analysis/synonyms.txt"
        }
      }
    }
  }
}

The problem: The synonym_graph filter can not be used after the word_delimiter_graph filter

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Token filter [my_word_delimiter_filter] cannot be used to parse synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Token filter [my_word_delimiter_filter] cannot be used to parse synonyms"
  },
  "status": 400
}

In the elastic documentation it is answered why.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html

2. aproach

use the synonym filter before the word_delimiter filter

{
  "filter": [
    "my_synonym_filter",
    "my_word_delimiter_filter",
    "lowercase"
  ]
}

Problem:

Synonym gen,generation
word: Gen
the synonym is not applied because Gen is uppercase
to have all synonyms in upper and lowercase is not an option

3. aproach - use a multiplexer

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-multiplexer-tokenfilter.html

PUT my_index

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_word_delimiter_filter",
            "lowercase",
            "my_synonym_filter"
          ]
        }
      },
      "filter": {
                "my_multiplexer": {
          "type": "multiplexer",
          "filters": [  "delimiter_search_index, lowercase", 
                        "lowercase, Synonym Graph, query_stopwords, stem_override, Snowball, unique" 
                        ]
        },
        "my_word_delimiter_filter": {
          "type": "word_delimiter_graph",
          "generate_word_parts": true,
          "catenate_words": false
        },
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms_path": "analysis/synonyms.txt"
        }
      }
    }
  }
}

External price slider

February 3, 2023 · 2 min read

Prices in ecommerce are a complex topic. In B2B scenarios you can have a different price for each customer for one product. The price can also depend of the purchased quantity (Staffelpreise).

Simple approach

index all price variants in a sepearte field per document

{
  "title" : "iphone x",
  "price_customer_1": 499.0,
  "price_customer_2": 599.0,
  "price_customer_n": 699.0
}

advanteage

simple logic
the query logic changes not that much, just the price field is dynamic, dependent on the variant
this approach works with all products in the index - for large result sets

disadvantage:

there can be a lot of variants
- the index gets messed up with a lot of data, that is never used
- for each variant one index field is required
- computation of all variants can take a long time
- the prices can change and must be frequently updated
its hard to keep the data in sync

If the prices are stable and there are not that much variants (max. 100) per product, than the simple approach is a good choice.

External price service

use an external price service to compute the prices at runtime

1. get all ids
- retrieve all ids of the products, that match the search query of the user
- to get all ids set size: 1000
- the result must not be sorted
- 1000 is the upper limit, you can increase this limit, but do performance test, before running this in production
1. get all prices by id
- retrieve all prices by id from the external price service
- price parameters identifies the variant of the price, e.g. customer id, contract number, ...
1. compute the price slider
- compute the min and max values
- if the price slider is used, compute the ids for the filtered values
1. get data for the ids
- retrieve the data for the relevant ids
- apply the paging, sorting, rows from the user query
1. add the slider to the response

advanteage

can be used with a lot of price variants

disadvantage

just works with small results (< 1000 products)
complex logic

open-data

January 3, 2023 · One min read

https://dummyjson.com/products/1

https://www.imdb.com/

showcases

January 3, 2023 · One min read

Stemming

drucker -> drücker -> druck spule -> spüle deckel -> decke

Decompounding

sachleistungsportfolio -> sachleistung sport folio

Char Mapping​

Feeding​

Sort by json string​

Sort by a hash​

Hash function in JAVA​

Stemming​

Synonym Graph​

Synonyme​

Current endpoints​

Suggest keywords based on search data​

2.) New Cluster upgrade​

1. aproach​

2. aproach​

3. aproach - use a multiplexer​

Simple approach​

advanteage​

disadvantage:​

External price service​

advanteage​

disadvantage​

Stemming​

Decompounding​

Char Mapping

Feeding

Sort by json string

Sort by a hash

Hash function in JAVA

Stemming

Synonym Graph

Synonyme

Current endpoints

Suggest keywords based on search data

2.) New Cluster upgrade

1. aproach

2. aproach

3. aproach - use a multiplexer

Simple approach

advanteage

disadvantage:

External price service

advanteage

disadvantage

Stemming

Decompounding