Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
lewinfox committed Sep 30, 2023
1 parent 2fe3884 commit 11cfb39
Show file tree
Hide file tree
Showing 2 changed files with 283 additions and 0 deletions.
120 changes: 120 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,126 @@ lev_token_sort_ratio(x, y)
lev_token_set_ratio(x, y)
```

### `lev_weighted_token_ratio()`

The `lev_weighted_*()` family of functions work slightly differently from the others. They always
tokenise their input, and they allow you to assign different weights to specific tokens. This allows
you to exert some influence over parts of the input strings that are more or less interesting to
you.

For example, maybe you're comparing company names from different sources, trying to match them up.

``` {r weighted-tokens-1}
lev_ratio("united widgets, ltd", "utd widgets, ltd") # Note the typos
```

These strings score quite highly already, but the `"ltd"` in each name isn't very helpful. We can
use `lev_weighted_token_ratio()` to reduce the impact of `"ltd"`.

**NOTE** Because the tokenisation affects the score, we can't compare the output of the
`lev_weighted_*()` functions with the non-weighted versions. To get a baseline, call the weighted
function without supplying a `weights` argument.

``` {r weighted-tokens-2}
lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd")
lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0.1))
```

De-weighting `"ltd"` has reduced the similarity score of the strings, which gives a more accurate
impression of their similarity.

We can remove the effect of `"ltd"` altogether by setting its weight to zero.

``` {r weighted-tokens-3}
lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0))
lev_weighted_token_ratio("united widgets", "utd widgets")
```

De-weighting also works the other way - if the token to be weighted appears in one string but not
the other, then de-weighting it _increases_ the similarity score:

``` {r weighted-token-4}
lev_weighted_token_ratio("utd widgets", "united widgets, ltd")
lev_weighted_token_ratio("utd widgets", "united widgets, ltd", weights = list(ltd = 0.1))
```

#### Limitations of token weighting

`lev_weighted_token_ratio()` has a key limitation: tokens will only be weighted if:

* The token appears in the same position in both strings (i.e. it's the first/second/third, etc.
token in both)
* OR the strings contain different numbers of tokens, and the corresponding token position in the
other string is empty.

This is probably easiest to see by example.

``` {r weighted-token-5}
lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd")
lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd", weights = list(ltd = 0.1, limited = 0.1))
```

In this case the weighting has had no effect. Why not? Internally, the function has tokenised the
strings as follows:

| token_1 | token_2 | token_3 |
|-----------|-----------|-----------|
| "utd" | "widgets" | "limited" |
| "united" | "widgets" | "ltd" |

Because the token `"ltd"` doesn't appear in the same position in both strings, the function doesn't
apply any weights.

This is a deliberate decision; while in the example above it's easy to say "well, clearly ltd and
limited are the same thing so we ought to weight them", how should we handle a less clear example?

``` {r weighted-token-6}
lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam")
lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam", weights = list(spam = 0.1, eggs = 0.5))
```

In this case it's hard to say what the "correct" approach would be. There isn't a meaningful way of
applying weights to dissimilar tokens. In situations like "ltd"/"limited", a pre-cleaning or
standardisation process might be helpful, but that is outside the scope of what `levitate` offers.

I recommend exploring `lev_weighted_token_sort_ratio()` and `lev_weighted_token_set_ratio()` as
they may give more useful results for some problems. Remember, **weighting is going to be most
useful when compared to the unweighted output of the same function**.


## Ranking functions

A common problem in this area is "given a string x and a set of strings y, which string in y is most
/ least similar to x?". `levitate` provides two functions to help with this: `lev_score_multiple()`
and `lev_best_match()`.

`lev_score_multiple()` returns a ranked list of candidates. By default the highest-scoring is first.

``` {r score-multiple}
lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"))
```

`lev_best_match()` returns the best matched string without any score information.

``` {r best-match}
lev_best_match("bilbo", c("gandalf", "frodo", "legolas"))
```

Both functions take a `.fn` argument which allows you to select a different ranking function. The
default is `lev_ratio()` but you can pick another or write your own. See `?lev_score_multiple` for
details.

You can also reverse the direction of sorting by using `decreasing = FALSE`. This reverses the sort
direction so _lower_ scoring items are preferred. This may be helpful if you're using a distance
rather than a similarity measure, or if you want to return least similar strings.

``` {r best-match-reverse}
lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"), decreasing = FALSE)
```

## Porting code from `thefuzz` or `fuzzywuzzyR`

Results differ between `levitate` and `thefuzz`, not least because
Expand Down
163 changes: 163 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,169 @@ lev_token_set_ratio(x, y)
#> [1] 0.7435897
```

### `lev_weighted_token_ratio()`

The `lev_weighted_*()` family of functions work slightly differently
from the others. They always tokenise their input, and they allow you to
assign different weights to specific tokens. This allows you to exert
some influence over parts of the input strings that are more or less
interesting to you.

For example, maybe you’re comparing company names from different
sources, trying to match them up.

``` r
lev_ratio("united widgets, ltd", "utd widgets, ltd") # Note the typos
#> [1] 0.8421053
```

These strings score quite highly already, but the `"ltd"` in each name
isn’t very helpful. We can use `lev_weighted_token_ratio()` to reduce
the impact of `"ltd"`.

**NOTE** Because the tokenisation affects the score, we can’t compare
the output of the `lev_weighted_*()` functions with the non-weighted
versions. To get a baseline, call the weighted function without
supplying a `weights` argument.

``` r
lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd")
#> [1] 0.8125

lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0.1))
#> [1] 0.7744361
```

De-weighting `"ltd"` has reduced the similarity score of the strings,
which gives a more accurate impression of their similarity.

We can remove the effect of `"ltd"` altogether by setting its weight to
zero.

``` r
lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0))
#> [1] 0.7692308

lev_weighted_token_ratio("united widgets", "utd widgets")
#> [1] 0.7692308
```

De-weighting also works the other way - if the token to be weighted
appears in one string but not the other, then de-weighting it
*increases* the similarity score:

``` r
lev_weighted_token_ratio("utd widgets", "united widgets, ltd")
#> [1] 0.625

lev_weighted_token_ratio("utd widgets", "united widgets, ltd", weights = list(ltd = 0.1))
#> [1] 0.7518797
```

#### Limitations of token weighting

`lev_weighted_token_ratio()` has a key limitation: tokens will only be
weighted if:

- The token appears in the same position in both strings (i.e. it’s the
first/second/third, etc. token in both)
- OR the strings contain different numbers of tokens, and the
corresponding token position in the other string is empty.

This is probably easiest to see by example.

``` r
lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd")
#> [1] 0.65
lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd", weights = list(ltd = 0.1, limited = 0.1))
#> [1] 0.65
```

In this case the weighting has had no effect. Why not? Internally, the
function has tokenised the strings as follows:

| token_1 | token_2 | token_3 |
|----------|-----------|-----------|
| “utd” | “widgets” | “limited” |
| “united” | “widgets” | “ltd” |

Because the token `"ltd"` doesn’t appear in the same position in both
strings, the function doesn’t apply any weights.

This is a deliberate decision; while in the example above it’s easy to
say “well, clearly ltd and limited are the same thing so we ought to
weight them”, how should we handle a less clear example?

``` r
lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam")
#> [1] 0.1176471
lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam", weights = list(spam = 0.1, eggs = 0.5))
#> [1] 0.1176471
```

In this case it’s hard to say what the “correct” approach would be.
There isn’t a meaningful way of applying weights to dissimilar tokens.
In situations like “ltd”/“limited”, a pre-cleaning or standardisation
process might be helpful, but that is outside the scope of what
`levitate` offers.

I recommend exploring `lev_weighted_token_sort_ratio()` and
`lev_weighted_token_set_ratio()` as they may give more useful results
for some problems. Remember, **weighting is going to be most useful when
compared to the unweighted output of the same function**.

## Ranking functions

A common problem in this area is “given a string x and a set of strings
y, which string in y is most / least similar to x?”. `levitate` provides
two functions to help with this: `lev_score_multiple()` and
`lev_best_match()`.

`lev_score_multiple()` returns a ranked list of candidates. By default
the highest-scoring is first.

``` r
lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"))
#> $frodo
#> [1] 0.2
#>
#> $legolas
#> [1] 0.1428571
#>
#> $gandalf
#> [1] 0
```

`lev_best_match()` returns the best matched string without any score
information.

``` r
lev_best_match("bilbo", c("gandalf", "frodo", "legolas"))
#> [1] "frodo"
```

Both functions take a `.fn` argument which allows you to select a
different ranking function. The default is `lev_ratio()` but you can
pick another or write your own. See `?lev_score_multiple` for details.

You can also reverse the direction of sorting by using
`decreasing = FALSE`. This reverses the sort direction so *lower*
scoring items are preferred. This may be helpful if you’re using a
distance rather than a similarity measure, or if you want to return
least similar strings.

``` r
lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"), decreasing = FALSE)
#> $gandalf
#> [1] 0
#>
#> $legolas
#> [1] 0.1428571
#>
#> $frodo
#> [1] 0.2
```

## Porting code from `thefuzz` or `fuzzywuzzyR`

Results differ between `levitate` and `thefuzz`, not least because
Expand Down

0 comments on commit 11cfb39

Please sign in to comment.