Update README

lewinfox · Sep 30, 2023 · 11cfb39 · 11cfb39
1 parent 2fe3884
commit 11cfb39
Show file tree

Hide file tree

Showing 2 changed files with 283 additions and 0 deletions.
diff --git a/README.Rmd b/README.Rmd
@@ -143,6 +143,126 @@ lev_token_sort_ratio(x, y)
 lev_token_set_ratio(x, y)
 ```
 
+### `lev_weighted_token_ratio()`
+
+The `lev_weighted_*()` family of functions work slightly differently from the others. They always
+tokenise their input, and they allow you to assign different weights to specific tokens. This allows
+you to exert some influence over parts of the input strings that are more or less interesting to
+you.
+
+For example, maybe you're comparing company names from different sources, trying to match them up.
+
+``` {r weighted-tokens-1}
+lev_ratio("united widgets, ltd", "utd widgets, ltd") # Note the typos
+```
+
+These strings score quite highly already, but the `"ltd"` in each name isn't very helpful. We can
+use `lev_weighted_token_ratio()` to reduce the impact of `"ltd"`.
+
+**NOTE** Because the tokenisation affects the score, we can't compare the output of the 
+`lev_weighted_*()` functions with the non-weighted versions. To get a baseline, call the weighted
+function without supplying a `weights` argument.
+
+``` {r weighted-tokens-2}
+lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd")
+
+lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0.1))
+```
+
+De-weighting `"ltd"` has reduced the similarity score of the strings, which gives a more accurate
+impression of their similarity.
+
+We can remove the effect of `"ltd"` altogether by setting its weight to zero.
+
+``` {r weighted-tokens-3}
+lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0))
+
+lev_weighted_token_ratio("united widgets", "utd widgets")
+```
+
+De-weighting also works the other way - if the token to be weighted appears in one string but not
+the other, then de-weighting it _increases_ the similarity score:
+
+``` {r weighted-token-4}
+lev_weighted_token_ratio("utd widgets", "united widgets, ltd")
+
+lev_weighted_token_ratio("utd widgets", "united widgets, ltd", weights = list(ltd = 0.1))
+```
+
+#### Limitations of token weighting
+
+`lev_weighted_token_ratio()` has a key limitation: tokens will only be weighted if:
+
+* The token appears in the same position in both strings (i.e. it's the first/second/third, etc. 
+  token in both)
+* OR the strings contain different numbers of tokens, and the corresponding token position in the
+  other string is empty.
+
+This is probably easiest to see by example.
+
+``` {r weighted-token-5}
+lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd")
+lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd", weights = list(ltd = 0.1, limited = 0.1))
+```
+
+In this case the weighting has had no effect. Why not? Internally, the function has tokenised the
+strings as follows:
+
+| token_1   | token_2   | token_3   |
+|-----------|-----------|-----------|
+| "utd"     | "widgets" | "limited" |
+| "united"  | "widgets" | "ltd"     |
+
+Because the token `"ltd"` doesn't appear in the same position in both strings, the function doesn't
+apply any weights.
+
+This is a deliberate decision; while in the example above it's easy to say "well, clearly ltd and 
+limited are the same thing so we ought to weight them", how should we handle a less clear example?
+
+``` {r weighted-token-6}
+lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam")
+lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam", weights = list(spam = 0.1, eggs = 0.5))
+```
+
+In this case it's hard to say what the "correct" approach would be. There isn't a meaningful way of
+applying weights to dissimilar tokens. In situations like "ltd"/"limited", a pre-cleaning or
+standardisation process might be helpful, but that is outside the scope of what `levitate` offers.
+
+I recommend exploring `lev_weighted_token_sort_ratio()` and `lev_weighted_token_set_ratio()` as
+they may give more useful results for some problems. Remember, **weighting is going to be most 
+useful when compared to the unweighted output of the same function**.
+
+
+## Ranking functions
+
+A common problem in this area is "given a string x and a set of strings y, which string in y is most
+/ least similar to x?". `levitate` provides two functions to help with this: `lev_score_multiple()` 
+and `lev_best_match()`.
+
+`lev_score_multiple()` returns a ranked list of candidates. By default the highest-scoring is first.
+
+``` {r score-multiple}
+lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"))
+```
+
+`lev_best_match()` returns the best matched string without any score information.
+
+``` {r best-match}
+lev_best_match("bilbo", c("gandalf", "frodo", "legolas"))
+```
+
+Both functions take a `.fn` argument which allows you to select a different ranking function. The
+default is `lev_ratio()` but you can pick another or write your own. See `?lev_score_multiple` for
+details.
+
+You can also reverse the direction of sorting by using `decreasing = FALSE`. This reverses the sort
+direction so _lower_ scoring items are preferred. This may be helpful if you're using a distance 
+rather than a similarity measure, or if you want to return least similar strings.
+
+``` {r best-match-reverse}
+lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"), decreasing = FALSE)
+```
+
 ## Porting code from `thefuzz` or `fuzzywuzzyR`
 
 Results differ between `levitate` and `thefuzz`, not least because

diff --git a/README.md b/README.md
@@ -179,6 +179,169 @@ lev_token_set_ratio(x, y)
 #> [1] 0.7435897
 ```
 
+### `lev_weighted_token_ratio()`
+
+The `lev_weighted_*()` family of functions work slightly differently
+from the others. They always tokenise their input, and they allow you to
+assign different weights to specific tokens. This allows you to exert
+some influence over parts of the input strings that are more or less
+interesting to you.
+
+For example, maybe you’re comparing company names from different
+sources, trying to match them up.
+
+``` r
+lev_ratio("united widgets, ltd", "utd widgets, ltd") # Note the typos
+#> [1] 0.8421053
+```
+
+These strings score quite highly already, but the `"ltd"` in each name
+isn’t very helpful. We can use `lev_weighted_token_ratio()` to reduce
+the impact of `"ltd"`.
+
+**NOTE** Because the tokenisation affects the score, we can’t compare
+the output of the `lev_weighted_*()` functions with the non-weighted
+versions. To get a baseline, call the weighted function without
+supplying a `weights` argument.
+
+``` r
+lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd")
+#> [1] 0.8125
+
+lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0.1))
+#> [1] 0.7744361
+```
+
+De-weighting `"ltd"` has reduced the similarity score of the strings,
+which gives a more accurate impression of their similarity.
+
+We can remove the effect of `"ltd"` altogether by setting its weight to
+zero.
+
+``` r
+lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0))
+#> [1] 0.7692308
+
+lev_weighted_token_ratio("united widgets", "utd widgets")
+#> [1] 0.7692308
+```
+
+De-weighting also works the other way - if the token to be weighted
+appears in one string but not the other, then de-weighting it
+*increases* the similarity score:
+
+``` r
+lev_weighted_token_ratio("utd widgets", "united widgets, ltd")
+#> [1] 0.625
+
+lev_weighted_token_ratio("utd widgets", "united widgets, ltd", weights = list(ltd = 0.1))
+#> [1] 0.7518797
+```
+
+#### Limitations of token weighting
+
+`lev_weighted_token_ratio()` has a key limitation: tokens will only be
+weighted if:
+
+- The token appears in the same position in both strings (i.e. it’s the
+  first/second/third, etc. token in both)
+- OR the strings contain different numbers of tokens, and the
+  corresponding token position in the other string is empty.
+
+This is probably easiest to see by example.
+
+``` r
+lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd")
+#> [1] 0.65
+lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd", weights = list(ltd = 0.1, limited = 0.1))
+#> [1] 0.65
+```
+
+In this case the weighting has had no effect. Why not? Internally, the
+function has tokenised the strings as follows:
+
+| token_1  | token_2   | token_3   |
+|----------|-----------|-----------|
+| “utd”    | “widgets” | “limited” |
+| “united” | “widgets” | “ltd”     |
+
+Because the token `"ltd"` doesn’t appear in the same position in both
+strings, the function doesn’t apply any weights.
+
+This is a deliberate decision; while in the example above it’s easy to
+say “well, clearly ltd and limited are the same thing so we ought to
+weight them”, how should we handle a less clear example?
+
+``` r
+lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam")
+#> [1] 0.1176471
+lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam", weights = list(spam = 0.1, eggs = 0.5))
+#> [1] 0.1176471
+```
+
+In this case it’s hard to say what the “correct” approach would be.
+There isn’t a meaningful way of applying weights to dissimilar tokens.
+In situations like “ltd”/“limited”, a pre-cleaning or standardisation
+process might be helpful, but that is outside the scope of what
+`levitate` offers.
+
+I recommend exploring `lev_weighted_token_sort_ratio()` and
+`lev_weighted_token_set_ratio()` as they may give more useful results
+for some problems. Remember, **weighting is going to be most useful when
+compared to the unweighted output of the same function**.
+
+## Ranking functions
+
+A common problem in this area is “given a string x and a set of strings
+y, which string in y is most / least similar to x?”. `levitate` provides
+two functions to help with this: `lev_score_multiple()` and
+`lev_best_match()`.
+
+`lev_score_multiple()` returns a ranked list of candidates. By default
+the highest-scoring is first.
+
+``` r
+lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"))
+#> $frodo
+#> [1] 0.2
+#> 
+#> $legolas
+#> [1] 0.1428571
+#> 
+#> $gandalf
+#> [1] 0
+```
+
+`lev_best_match()` returns the best matched string without any score
+information.
+
+``` r
+lev_best_match("bilbo", c("gandalf", "frodo", "legolas"))
+#> [1] "frodo"
+```
+
+Both functions take a `.fn` argument which allows you to select a
+different ranking function. The default is `lev_ratio()` but you can
+pick another or write your own. See `?lev_score_multiple` for details.
+
+You can also reverse the direction of sorting by using
+`decreasing = FALSE`. This reverses the sort direction so *lower*
+scoring items are preferred. This may be helpful if you’re using a
+distance rather than a similarity measure, or if you want to return
+least similar strings.
+
+``` r
+lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"), decreasing = FALSE)
+#> $gandalf
+#> [1] 0
+#> 
+#> $legolas
+#> [1] 0.1428571
+#> 
+#> $frodo
+#> [1] 0.2
+```
+
 ## Porting code from `thefuzz` or `fuzzywuzzyR`
 
 Results differ between `levitate` and `thefuzz`, not least because