diff --git a/README.Rmd b/README.Rmd index 561130e..e217631 100644 --- a/README.Rmd +++ b/README.Rmd @@ -143,6 +143,126 @@ lev_token_sort_ratio(x, y) lev_token_set_ratio(x, y) ``` +### `lev_weighted_token_ratio()` + +The `lev_weighted_*()` family of functions work slightly differently from the others. They always +tokenise their input, and they allow you to assign different weights to specific tokens. This allows +you to exert some influence over parts of the input strings that are more or less interesting to +you. + +For example, maybe you're comparing company names from different sources, trying to match them up. + +``` {r weighted-tokens-1} +lev_ratio("united widgets, ltd", "utd widgets, ltd") # Note the typos +``` + +These strings score quite highly already, but the `"ltd"` in each name isn't very helpful. We can +use `lev_weighted_token_ratio()` to reduce the impact of `"ltd"`. + +**NOTE** Because the tokenisation affects the score, we can't compare the output of the +`lev_weighted_*()` functions with the non-weighted versions. To get a baseline, call the weighted +function without supplying a `weights` argument. + +``` {r weighted-tokens-2} +lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd") + +lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0.1)) +``` + +De-weighting `"ltd"` has reduced the similarity score of the strings, which gives a more accurate +impression of their similarity. + +We can remove the effect of `"ltd"` altogether by setting its weight to zero. + +``` {r weighted-tokens-3} +lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0)) + +lev_weighted_token_ratio("united widgets", "utd widgets") +``` + +De-weighting also works the other way - if the token to be weighted appears in one string but not +the other, then de-weighting it _increases_ the similarity score: + +``` {r weighted-token-4} +lev_weighted_token_ratio("utd widgets", "united widgets, ltd") + +lev_weighted_token_ratio("utd widgets", "united widgets, ltd", weights = list(ltd = 0.1)) +``` + +#### Limitations of token weighting + +`lev_weighted_token_ratio()` has a key limitation: tokens will only be weighted if: + +* The token appears in the same position in both strings (i.e. it's the first/second/third, etc. + token in both) +* OR the strings contain different numbers of tokens, and the corresponding token position in the + other string is empty. + +This is probably easiest to see by example. + +``` {r weighted-token-5} +lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd") +lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd", weights = list(ltd = 0.1, limited = 0.1)) +``` + +In this case the weighting has had no effect. Why not? Internally, the function has tokenised the +strings as follows: + +| token_1 | token_2 | token_3 | +|-----------|-----------|-----------| +| "utd" | "widgets" | "limited" | +| "united" | "widgets" | "ltd" | + +Because the token `"ltd"` doesn't appear in the same position in both strings, the function doesn't +apply any weights. + +This is a deliberate decision; while in the example above it's easy to say "well, clearly ltd and +limited are the same thing so we ought to weight them", how should we handle a less clear example? + +``` {r weighted-token-6} +lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam") +lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam", weights = list(spam = 0.1, eggs = 0.5)) +``` + +In this case it's hard to say what the "correct" approach would be. There isn't a meaningful way of +applying weights to dissimilar tokens. In situations like "ltd"/"limited", a pre-cleaning or +standardisation process might be helpful, but that is outside the scope of what `levitate` offers. + +I recommend exploring `lev_weighted_token_sort_ratio()` and `lev_weighted_token_set_ratio()` as +they may give more useful results for some problems. Remember, **weighting is going to be most +useful when compared to the unweighted output of the same function**. + + +## Ranking functions + +A common problem in this area is "given a string x and a set of strings y, which string in y is most +/ least similar to x?". `levitate` provides two functions to help with this: `lev_score_multiple()` +and `lev_best_match()`. + +`lev_score_multiple()` returns a ranked list of candidates. By default the highest-scoring is first. + +``` {r score-multiple} +lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas")) +``` + +`lev_best_match()` returns the best matched string without any score information. + +``` {r best-match} +lev_best_match("bilbo", c("gandalf", "frodo", "legolas")) +``` + +Both functions take a `.fn` argument which allows you to select a different ranking function. The +default is `lev_ratio()` but you can pick another or write your own. See `?lev_score_multiple` for +details. + +You can also reverse the direction of sorting by using `decreasing = FALSE`. This reverses the sort +direction so _lower_ scoring items are preferred. This may be helpful if you're using a distance +rather than a similarity measure, or if you want to return least similar strings. + +``` {r best-match-reverse} +lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"), decreasing = FALSE) +``` + ## Porting code from `thefuzz` or `fuzzywuzzyR` Results differ between `levitate` and `thefuzz`, not least because diff --git a/README.md b/README.md index de16c35..c3b3ee7 100644 --- a/README.md +++ b/README.md @@ -179,6 +179,169 @@ lev_token_set_ratio(x, y) #> [1] 0.7435897 ``` +### `lev_weighted_token_ratio()` + +The `lev_weighted_*()` family of functions work slightly differently +from the others. They always tokenise their input, and they allow you to +assign different weights to specific tokens. This allows you to exert +some influence over parts of the input strings that are more or less +interesting to you. + +For example, maybe you’re comparing company names from different +sources, trying to match them up. + +``` r +lev_ratio("united widgets, ltd", "utd widgets, ltd") # Note the typos +#> [1] 0.8421053 +``` + +These strings score quite highly already, but the `"ltd"` in each name +isn’t very helpful. We can use `lev_weighted_token_ratio()` to reduce +the impact of `"ltd"`. + +**NOTE** Because the tokenisation affects the score, we can’t compare +the output of the `lev_weighted_*()` functions with the non-weighted +versions. To get a baseline, call the weighted function without +supplying a `weights` argument. + +``` r +lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd") +#> [1] 0.8125 + +lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0.1)) +#> [1] 0.7744361 +``` + +De-weighting `"ltd"` has reduced the similarity score of the strings, +which gives a more accurate impression of their similarity. + +We can remove the effect of `"ltd"` altogether by setting its weight to +zero. + +``` r +lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0)) +#> [1] 0.7692308 + +lev_weighted_token_ratio("united widgets", "utd widgets") +#> [1] 0.7692308 +``` + +De-weighting also works the other way - if the token to be weighted +appears in one string but not the other, then de-weighting it +*increases* the similarity score: + +``` r +lev_weighted_token_ratio("utd widgets", "united widgets, ltd") +#> [1] 0.625 + +lev_weighted_token_ratio("utd widgets", "united widgets, ltd", weights = list(ltd = 0.1)) +#> [1] 0.7518797 +``` + +#### Limitations of token weighting + +`lev_weighted_token_ratio()` has a key limitation: tokens will only be +weighted if: + +- The token appears in the same position in both strings (i.e. it’s the + first/second/third, etc. token in both) +- OR the strings contain different numbers of tokens, and the + corresponding token position in the other string is empty. + +This is probably easiest to see by example. + +``` r +lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd") +#> [1] 0.65 +lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd", weights = list(ltd = 0.1, limited = 0.1)) +#> [1] 0.65 +``` + +In this case the weighting has had no effect. Why not? Internally, the +function has tokenised the strings as follows: + +| token_1 | token_2 | token_3 | +|----------|-----------|-----------| +| “utd” | “widgets” | “limited” | +| “united” | “widgets” | “ltd” | + +Because the token `"ltd"` doesn’t appear in the same position in both +strings, the function doesn’t apply any weights. + +This is a deliberate decision; while in the example above it’s easy to +say “well, clearly ltd and limited are the same thing so we ought to +weight them”, how should we handle a less clear example? + +``` r +lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam") +#> [1] 0.1176471 +lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam", weights = list(spam = 0.1, eggs = 0.5)) +#> [1] 0.1176471 +``` + +In this case it’s hard to say what the “correct” approach would be. +There isn’t a meaningful way of applying weights to dissimilar tokens. +In situations like “ltd”/“limited”, a pre-cleaning or standardisation +process might be helpful, but that is outside the scope of what +`levitate` offers. + +I recommend exploring `lev_weighted_token_sort_ratio()` and +`lev_weighted_token_set_ratio()` as they may give more useful results +for some problems. Remember, **weighting is going to be most useful when +compared to the unweighted output of the same function**. + +## Ranking functions + +A common problem in this area is “given a string x and a set of strings +y, which string in y is most / least similar to x?”. `levitate` provides +two functions to help with this: `lev_score_multiple()` and +`lev_best_match()`. + +`lev_score_multiple()` returns a ranked list of candidates. By default +the highest-scoring is first. + +``` r +lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas")) +#> $frodo +#> [1] 0.2 +#> +#> $legolas +#> [1] 0.1428571 +#> +#> $gandalf +#> [1] 0 +``` + +`lev_best_match()` returns the best matched string without any score +information. + +``` r +lev_best_match("bilbo", c("gandalf", "frodo", "legolas")) +#> [1] "frodo" +``` + +Both functions take a `.fn` argument which allows you to select a +different ranking function. The default is `lev_ratio()` but you can +pick another or write your own. See `?lev_score_multiple` for details. + +You can also reverse the direction of sorting by using +`decreasing = FALSE`. This reverses the sort direction so *lower* +scoring items are preferred. This may be helpful if you’re using a +distance rather than a similarity measure, or if you want to return +least similar strings. + +``` r +lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"), decreasing = FALSE) +#> $gandalf +#> [1] 0 +#> +#> $legolas +#> [1] 0.1428571 +#> +#> $frodo +#> [1] 0.2 +``` + ## Porting code from `thefuzz` or `fuzzywuzzyR` Results differ between `levitate` and `thefuzz`, not least because