Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document the new analysis-phonenumber plugin #8469

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion _analyzers/supported-analyzers/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,13 @@ Analyzer | Analysis performed | Analyzer output

## Language analyzers

OpenSearch supports analyzers for various languages. For more information, see [Language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/).
OpenSearch supports multiple language analyzers. For more information, see [Language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/).

## Additional analyzers

The following table lists the additional analyzers that OpenSearch supports.

| Analyzer | Analysis performed |
|:---------------|:---------------------------------------------------------------------------------------------------------|
| `phone` | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers. |
| `phone-search` | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. |
128 changes: 128 additions & 0 deletions _analyzers/supported-analyzers/phone-analyzers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
layout: default
title: Phone number
parent: Analyzers
nav_order: 140
---

# Phone number analyzers

The `analysis-phonenumber` plugin provides analyzers and tokenizers for parsing phone numbers. A dedicated analyzer is required because parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance). For common misconceptions regarding phone number parsing, see [Falsehoods programmers believe about phone numbers](https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md).


OpenSearch supports the following phone number analyzers:

* [`phone`](#the-phone-analyzer): An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) to use at indexing time.
* [`phone-search`](#the-phone-search-analyzer): A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) to use at search time.

Internally, the plugin uses the [`libphonenumber`](https://github.com/google/libphonenumber) library and follows its parsing rules.

The phone number analyzers are not meant to find phone numbers in larger texts. Instead, you should use them on fields that only contain phone numbers.
{: .note}

## Installing the plugin

Before you can use the phone number analyzers, you must install the `analysis-phonenumber` plugin by running the following command:

```sh
./bin/opensearch-plugin install analysis-phonenumber
```

## Specifying a default region

You can optionally specify a default region for parsing phone numbers by providing the `phone-region` parameter within the analyzer. Valid phone regions are represented by ISO 3166 country codes. For more information, see [List of ISO 3166 country codes](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes).

Check warning on line 33 in _analyzers/supported-analyzers/phone-analyzers.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.AcronymParentheses] 'ISO': Spell out acronyms the first time that you use them on a page and follow them with the acronym in parentheses. Subsequently, use the acronym alone. Raw Output: {"message": "[OpenSearch.AcronymParentheses] 'ISO': Spell out acronyms the first time that you use them on a page and follow them with the acronym in parentheses. Subsequently, use the acronym alone.", "location": {"path": "_analyzers/supported-analyzers/phone-analyzers.md", "range": {"start": {"line": 33, "column": 294}}}, "severity": "WARNING"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not a native speaker, but i'd have expected "[..] the list [..]" here? (it probably was me who wrote it like this in the first place? 🫣)

Suggested change
You can optionally specify a default region for parsing phone numbers by providing the `phone-region` parameter within the analyzer. Valid phone regions are represented by ISO 3166 country codes. For more information, see [List of ISO 3166 country codes](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes).
You can optionally specify a default region for parsing phone numbers by providing the `phone-region` parameter within the analyzer. Valid phone regions are represented by ISO 3166 country codes. For more information, see the [list of ISO 3166 country codes](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rursprung Either way is correct 😄


When tokenizing phone numbers containing the international calling prefix `+`, the default region is irrelevant. However, for phone numbers that use a national prefix for international numbers (for example, `001` instead of `+1` to dial Northern America from most European countries), the region needs to be provided. You can also properly index local phone numbers with no international prefix by specifying the region.

## Example

The following request creates an index containing one field that ingests phone numbers for Switzerland (region code `CH`):

```json
PUT /example-phone
{
"settings": {
"analysis": {
"analyzer": {
"phone-ch": {
"type": "phone",
"phone-region": "CH"
},
"phone-search-ch": {
"type": "phone-search",
"phone-region": "CH"
}
}
}
},
"mappings": {
"properties": {
"phoneNumber": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw: do examples usually use camelCase, snake_case or kebab-case? this one currently uses kebab-case for the analyzer names (the actual phone-search analyzer does so too) and index name but camelCase for the field here (sorry, don't know how i ended up with this mix in this example)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, OpenSearch uses snake_case in requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this should be phone_number? or is it not worth the effort to discuss this for the documentation? 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, it should all be snake case, so yes, phone_number is preferred. That said, even our own alerting plugin uses camel case. I think it would be nice to change to snake case 😄

"type": "text",
"analyzer": "phone-ch",
"search_analyzer": "phone-search-ch"
}
}
}
}
```
{% include copy-curl.html %}

## The phone analyzer

The `phone` analyzer generates n-grams based on the given phone number. A (fictional) Swiss phone number containing an international calling prefix can be parsed with or without the Swiss-specific phone region. Thus, the following two requests will produce the same result:

```json
GET /example-phone/_analyze
{
"analyzer" : "phone-ch",
"text" : "+41 60 555 12 34"
}
```
{% include copy-curl.html %}

```json
GET /example-phone/_analyze
{
"analyzer" : "phone",
"text" : "+41 60 555 12 34"
}
```
{% include copy-curl.html %}

The response contains the generated n-grams:

```json
["+41 60 555 12 34", "6055512", "41605551", "416055512", "6055", "41605551234", ...]
rursprung marked this conversation as resolved.
Show resolved Hide resolved
```

However, if you specify the phone number without the international calling prefix `+` (either by using `0041` or omitting
the international calling prefix altogether), then only the analyzer configured with the correct phone region can parse the number:

```json
GET /example-phone/_analyze
{
"analyzer" : "phone-ch",
"text" : "060 555 12 34"
}
```
{% include copy-curl.html %}

## The phone-search analyzer

In contrast, the `phone-search` analyzer does not create n-grams and only issues some basic tokens. Thus, the following request:

```json
GET /example-phone/_analyze
{
"analyzer" : "phone-search",
"text" : "+41 60 555 12 34"
}
```
{% include copy-curl.html %}

Is parsed into the following tokens:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had is lowercase as it's the continuation of line 113. now it feels like a new sentence but isn't one (lacking a subject)

Copy link
Collaborator

@natebower natebower Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rursprung I'm not a huge fan of the split-sentence structure, generally, and would prefer that we not use it because it results in this type of issue (case in point: I'd prefer that new lines/sentences/phrases not begin with lowercase letters).


```json
["+41 60 555 12 34", "41 60 555 12 34", "41605551234", "605551234", "41"]
```
47 changes: 24 additions & 23 deletions _install-and-configure/additional-plugins/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,29 +9,30 @@ nav_order: 10

There are many more plugins available in addition to those provided by the standard distribution of OpenSearch. These additional plugins have been built by OpenSearch developers or members of the OpenSearch community. While it isn't possible to provide an exhaustive list (because many plugins are not maintained in an OpenSearch GitHub repository), the following plugins, available in the [OpenSearch/plugins](https://github.com/opensearch-project/OpenSearch/tree/main/plugins) directory on GitHub, are some of the plugins that can be installed using one of the installation options, for example, using the command `bin/opensearch-plugin install <plugin-name>`.

| Plugin name | Earliest available version |
| :--- | :--- |
| analysis-icu | 1.0.0 |
| analysis-kuromoji | 1.0.0 |
| analysis-nori | 1.0.0 |
| analysis-phonetic | 1.0.0 |
| analysis-smartcn | 1.0.0 |
| analysis-stempel | 1.0.0 |
| analysis-ukrainian | 1.0.0 |
| discovery-azure-classic | 1.0.0 |
| discovery-ec2 | 1.0.0 |
| discovery-gce | 1.0.0 |
| [`ingest-attachment`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/) | 1.0.0 |
| mapper-annotated-text | 1.0.0 |
| mapper-murmur3 | 1.0.0 |
| [`mapper-size`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/) | 1.0.0 |
| query-insights | 2.12.0 |
| repository-azure | 1.0.0 |
| repository-gcs | 1.0.0 |
| repository-hdfs | 1.0.0 |
| repository-s3 | 1.0.0 |
| store-smb | 1.0.0 |
| transport-nio | 1.0.0 |
| Plugin name | Earliest available version |
|:-----------------------------------------------------------------------------------------------------------------------|:---------------------------|
| analysis-icu | 1.0.0 |
| analysis-kuromoji | 1.0.0 |
| analysis-nori | 1.0.0 |
| [`analysis-phonenumber`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/) | 2.18.0 |
| analysis-phonetic | 1.0.0 |
| analysis-smartcn | 1.0.0 |
| analysis-stempel | 1.0.0 |
| analysis-ukrainian | 1.0.0 |
| discovery-azure-classic | 1.0.0 |
| discovery-ec2 | 1.0.0 |
| discovery-gce | 1.0.0 |
| [`ingest-attachment`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/) | 1.0.0 |
| mapper-annotated-text | 1.0.0 |
| mapper-murmur3 | 1.0.0 |
| [`mapper-size`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/) | 1.0.0 |
| query-insights | 2.12.0 |
| repository-azure | 1.0.0 |
| repository-gcs | 1.0.0 |
| repository-hdfs | 1.0.0 |
| repository-s3 | 1.0.0 |
| store-smb | 1.0.0 |
| transport-nio | 1.0.0 |

## Related articles

Expand Down
Loading