elasticsearch - how edge ngram token filter differs from ngram token filter? -
as new elastic search, not able identify difference between ngram token filter , edge ngram token filter.
how these 2 differ each other in processing tokens?
i think documentation pretty clear on this:
this tokenizer similar ngram keeps n-grams start @ beginning of token.
and best example ngram
tokenizer again comes documentation:
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'fc schalke 04' # fc, sc, sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
with tokenizer definition:
"type" : "ngram", "min_gram" : "2", "max_gram" : "3", "token_chars": [ "letter", "digit" ]
in short:
- the tokenizer, depending on configuration, create tokens. in example:
fc
,schalke
,04
. ngram
generates groups of characters of minimummin_gram
size , maximummax_gram
size input text. basically, tokens split small chunks , each chunk anchored on character (it doesn't matter character is, of them create chunks).edgengram
same chunks start beginning of each token. basically, chunks anchored @ beginning of tokens.
for same text above, edgengram
generates this: fc, sc, sch, scha, schal, 04
. every "word" in text considered , every "word" first character starting point (f
fc
, s
schalke
, 0
04
).
Comments
Post a Comment