Skip to content

Elasticsearch custom similarity plugin to calculate score based on TF * IDF and term recency. Plugin causes that terms with more recent timestamp have higher score.

License

Notifications You must be signed in to change notification settings

tisonet/elasticsearch-termrecencyboosting-plugin

Repository files navigation

Elasticsearch term recency boosting similarity plugin

Elasticsearch custom similarity plugin to calculate score based on TF * IDF and term recency. Plugin causes that terms with more recent timestamp have higher score. Similarity uses Elasticsearch TF * IDF (BM25) similarity and multiply given score with term recency score.

Plugin is inspired by https://github.com/sdauletau/elasticsearch-position-similarity

Build

./gradlew clean assemble

Plugin zip file is then located in build/distributions folder

Install / Remove plugin from Elasticsearch

elasticsearch-plugin install file:////tisonet-elasticsearch-termrecencyboosting-plugin-5.6.10.zip
elasticsearch-plugin remove BM25-recency

Plugin settings

decay_function - Decay functions score a term recency with a function that decays depending on the distance of current time. We have exp, linear and gauss. Default linear.

scale - Defines the number of hours from now at which the computed score will equal decay parameter. Default 24.

decay - The decay parameter defines how terms are scored at the distance given at scale. Default 0.5.

weight - The recency score booster to enhance recency weight. Default 1.0.

More about decay functions can be found on Elasticsearch page https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-decay

Run in Docker

cd docker
./rebuild.sh
docker-compose up

Open Kibana in web browser: http://localhost:5601/app/kibana Login: elastic:changeme

Create index

PUT /test_index
{
  "settings": {
    "similarity": {
      "recencySimilarity": {
        "type": "BM25-recency",
        "decay_function": "exp",
        "scale": "24",
        "decay": "0.5",
        "weight": "1"
      }
    },
    "analysis": {
      "analyzer": {
        "recencyPayloadAnalyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "timestampPayloadFilter"
          ]
        }
      },
      "filter": {
        "timestampPayloadFilter": {
          "delimiter": "|",
          "encoding": "int",
          "type": "delimited_payload_filter"
        }
      }
    }
  }
}

Create mapping

PUT /test_index/test_type/_mapping
{
  "test_type": {
    "properties": {
      "field1": {
        "type": "text"
      },
      "field2": {
        "type": "text",
        "norms": false,
        "term_vector": "with_positions_offsets_payloads",
        "analyzer": "recencyPayloadAnalyzer",
        "similarity": "recencySimilarity"
      }
    }
  }
}

Add documents

Term timestamp is defined as a number of hours since epoch time.

Change term timestamp to something more recent:

Javascript

console.dir(parseInt(new Date().getTime() / 3600000))

Python

import time
print(int(time.time() / 3600))
PUT /test_index/test_type/1
{"field1" : "bar foo", "field2" : "bar|428192 foo|428192"}


PUT /test_index/test_type/2
{"field1" : "foo foo bar bar bar", "field2" : "foo|428191 foo|428190 bar|428191 bar|428190 bar|428189"}


PUT /test_index/test_type/3
{"field1" : "bar bar foo foo", "field2" : "bar|428150 bar|428150 foo|428150 foo|428150"}

POST /test_index/_refresh

Search

GET /test_index/test_type/_search?pretty=true
{
  "explain": true,
  "query": {
    "match": {
      "field2": "foo"
    }
  }
}

About

Elasticsearch custom similarity plugin to calculate score based on TF * IDF and term recency. Plugin causes that terms with more recent timestamp have higher score.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages