From d81da265bbfaf93a1bf7c0fd2d64385352ee7a40 Mon Sep 17 00:00:00 2001 From: Ralph Ursprung Date: Fri, 4 Oct 2024 18:54:04 +0200 Subject: [PATCH 1/5] document the new `analysis-phonenumber` plugin this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves #8389 Co-authored-by: Fanit Kolchina Signed-off-by: Fanit Kolchina Signed-off-by: Ralph Ursprung --- _analyzers/supported-analyzers/index.md | 11 +- .../supported-analyzers/phone-analyzers.md | 121 ++++++++++++++++++ .../additional-plugins/index.md | 47 +++---- 3 files changed, 155 insertions(+), 24 deletions(-) create mode 100644 _analyzers/supported-analyzers/phone-analyzers.md diff --git a/_analyzers/supported-analyzers/index.md b/_analyzers/supported-analyzers/index.md index af6ce6c3a6..f43e18e0f1 100644 --- a/_analyzers/supported-analyzers/index.md +++ b/_analyzers/supported-analyzers/index.md @@ -29,4 +29,13 @@ Analyzer | Analysis performed | Analyzer output ## Language analyzers -OpenSearch supports analyzers for various languages. For more information, see [Language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/). \ No newline at end of file +OpenSearch supports analyzers for various languages. For more information, see [Language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/). + +## Additional analyzers + +The following table lists the additional analyzers that OpenSearch supports. + +| Analyzer | Analysis performed | +|:---------------|:---------------------------------------------------------------------------------------------------------| +| `phone` | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers. | +| `phone-search` | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. | diff --git a/_analyzers/supported-analyzers/phone-analyzers.md b/_analyzers/supported-analyzers/phone-analyzers.md new file mode 100644 index 0000000000..e5951978f4 --- /dev/null +++ b/_analyzers/supported-analyzers/phone-analyzers.md @@ -0,0 +1,121 @@ +--- +layout: default +title: Phone number +parent: Analyzers +nav_order: 140 +--- + +# Phone number analyzers + +The `analysis-phonenumber` plugin provides analyzers and tokenizers for parsing phone numbers. +A dedicated analyzer is required because parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance). For common pitfalls in parsing phone numbers, see [Falsehoods programmers believe about phone numbers](https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md). + + +OpenSearch supports the following phone number analyzers: + +* `phone`: An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) to use at indexing time. +* `phone-search`: A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) to use at search time. + +Internally, the plugin uses the [`libphonenumber`](https://github.com/google/libphonenumber) library and follows its parsing rules. + +The phone number analyzers are not meant to find phone numbers in larger texts. Instead, you should use them on fields which contain phone numbers alone. +{: .note} + +## Installing the plugin + +Before you can use phone number analyzers, you must install the `analysis-phonenumber` plugin by running the following command: + +```sh +./bin/opensearch-plugin install analysis-phonenumber +``` + +## Specifying a default region + +You can optionally specify a default region for parsing phone numbers by providing the `phone-region` parameter within the analyzer. Valid phone regions are ISO 3166 country codes. For more information, see [List of ISO 3166 country codes](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes). + +When tokenizing phone numbers containing the international calling prefix `+`, the default region is irrelevant. However, for phone numbers which either use a national prefix for international numbers (for example, `001` instead of `+1` to dial Northern America from most European countries), the region needs to be provided. You can also properly index local phone numbers with no international prefix if you specify the region. + +## Example + +The following request creates an index containing one field, which ingests phone numbers for Switzerland (region code `CH`): + +```json +PUT /example-phone +{ + "settings": { + "analysis": { + "analyzer": { + "phone-ch": { + "type": "phone", + "phone-region": "CH" + }, + "phone-search-ch": { + "type": "phone-search", + "phone-region": "CH" + } + } + } + }, + "mappings": { + "properties": { + "phoneNumber": { + "type": "text", + "analyzer": "phone-ch", + "search_analyzer": "phone-search-ch" + } + } + } +} +``` +{% include copy-curl.html %} + +Analysing a (fictional) Swiss phone number with an international calling prefix will work the same with either the Swiss-specific phone region or without: +```json +GET /example-phone/_analyze +{ + "analyzer" : "phone-ch", + "text" : "+41 60 555 12 34" +} +``` +{% include copy-curl.html %} + +and + +```json +GET /example-phone/_analyze +{ + "analyzer" : "phone", + "text" : "+41 60 555 12 34" +} +``` +{% include copy-curl.html %} + +will produce the same result: +```json +["+41 60 555 12 34", "6055512", "41605551", "416055512", "6055", "41605551234", ...] +``` + +If, however, the phone number is given without the international calling prefix `+` (either by using `0041` or omitting +the international calling prefix altogether) then only the analyzer with the correct phone region will be able to parse it: +```json +GET /example-phone/_analyze +{ + "analyzer" : "phone-ch", + "text" : "060 555 12 34" +} +``` +{% include copy-curl.html %} + +In contrast the `phone-search` analyzer does not create n-grams and only issues some basic tokens: +```json +GET /example-phone/_analyze +{ + "analyzer" : "phone-search", + "text" : "+41 60 555 12 34" +} +``` +{% include copy-curl.html %} + +```json +["+41 60 555 12 34", "41 60 555 12 34", "41605551234", "605551234", "41"] +``` diff --git a/_install-and-configure/additional-plugins/index.md b/_install-and-configure/additional-plugins/index.md index 87d0662442..afc17cd8b2 100644 --- a/_install-and-configure/additional-plugins/index.md +++ b/_install-and-configure/additional-plugins/index.md @@ -9,29 +9,30 @@ nav_order: 10 There are many more plugins available in addition to those provided by the standard distribution of OpenSearch. These additional plugins have been built by OpenSearch developers or members of the OpenSearch community. While it isn't possible to provide an exhaustive list (because many plugins are not maintained in an OpenSearch GitHub repository), the following plugins, available in the [OpenSearch/plugins](https://github.com/opensearch-project/OpenSearch/tree/main/plugins) directory on GitHub, are some of the plugins that can be installed using one of the installation options, for example, using the command `bin/opensearch-plugin install `. -| Plugin name | Earliest available version | -| :--- | :--- | -| analysis-icu | 1.0.0 | -| analysis-kuromoji | 1.0.0 | -| analysis-nori | 1.0.0 | -| analysis-phonetic | 1.0.0 | -| analysis-smartcn | 1.0.0 | -| analysis-stempel | 1.0.0 | -| analysis-ukrainian | 1.0.0 | -| discovery-azure-classic | 1.0.0 | -| discovery-ec2 | 1.0.0 | -| discovery-gce | 1.0.0 | -| [`ingest-attachment`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/) | 1.0.0 | -| mapper-annotated-text | 1.0.0 | -| mapper-murmur3 | 1.0.0 | -| [`mapper-size`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/) | 1.0.0 | -| query-insights | 2.12.0 | -| repository-azure | 1.0.0 | -| repository-gcs | 1.0.0 | -| repository-hdfs | 1.0.0 | -| repository-s3 | 1.0.0 | -| store-smb | 1.0.0 | -| transport-nio | 1.0.0 | +| Plugin name | Earliest available version | +|:-----------------------------------------------------------------------------------------------------------------------|:---------------------------| +| analysis-icu | 1.0.0 | +| analysis-kuromoji | 1.0.0 | +| analysis-nori | 1.0.0 | +| [`analysis-phonenumber`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/) | 2.18.0 | +| analysis-phonetic | 1.0.0 | +| analysis-smartcn | 1.0.0 | +| analysis-stempel | 1.0.0 | +| analysis-ukrainian | 1.0.0 | +| discovery-azure-classic | 1.0.0 | +| discovery-ec2 | 1.0.0 | +| discovery-gce | 1.0.0 | +| [`ingest-attachment`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/) | 1.0.0 | +| mapper-annotated-text | 1.0.0 | +| mapper-murmur3 | 1.0.0 | +| [`mapper-size`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/) | 1.0.0 | +| query-insights | 2.12.0 | +| repository-azure | 1.0.0 | +| repository-gcs | 1.0.0 | +| repository-hdfs | 1.0.0 | +| repository-s3 | 1.0.0 | +| store-smb | 1.0.0 | +| transport-nio | 1.0.0 | ## Related articles From 0c422c4bc8aaa80933b1a7dbdfa6812710d49d25 Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Fri, 11 Oct 2024 11:02:07 -0400 Subject: [PATCH 2/5] Minor rewrites Signed-off-by: Fanit Kolchina --- .../supported-analyzers/phone-analyzers.md | 29 ++++++++++++------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/_analyzers/supported-analyzers/phone-analyzers.md b/_analyzers/supported-analyzers/phone-analyzers.md index e5951978f4..f3f9bbd4e9 100644 --- a/_analyzers/supported-analyzers/phone-analyzers.md +++ b/_analyzers/supported-analyzers/phone-analyzers.md @@ -7,14 +7,13 @@ nav_order: 140 # Phone number analyzers -The `analysis-phonenumber` plugin provides analyzers and tokenizers for parsing phone numbers. -A dedicated analyzer is required because parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance). For common pitfalls in parsing phone numbers, see [Falsehoods programmers believe about phone numbers](https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md). +The `analysis-phonenumber` plugin provides analyzers and tokenizers for parsing phone numbers. A dedicated analyzer is required because parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance). For common pitfalls in parsing phone numbers, see [Falsehoods programmers believe about phone numbers](https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md). OpenSearch supports the following phone number analyzers: -* `phone`: An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) to use at indexing time. -* `phone-search`: A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) to use at search time. +* [`phone`](#the-phone-analyzer): An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) to use at indexing time. +* [`phone-search`](#the-phone-search-analyzer): A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) to use at search time. Internally, the plugin uses the [`libphonenumber`](https://github.com/google/libphonenumber) library and follows its parsing rules. @@ -69,7 +68,10 @@ PUT /example-phone ``` {% include copy-curl.html %} -Analysing a (fictional) Swiss phone number with an international calling prefix will work the same with either the Swiss-specific phone region or without: +## The phone analyzer + +The `phone` analyzer generates n-grams based on the given phone number. Analyzing a (fictional) Swiss phone number containing an international calling prefix can be parsed with or without the Swiss-specific phone region. Thus, the following two requests will produce the same result: + ```json GET /example-phone/_analyze { @@ -79,8 +81,6 @@ GET /example-phone/_analyze ``` {% include copy-curl.html %} -and - ```json GET /example-phone/_analyze { @@ -90,13 +90,15 @@ GET /example-phone/_analyze ``` {% include copy-curl.html %} -will produce the same result: +The response contains the generated n-grams: + ```json ["+41 60 555 12 34", "6055512", "41605551", "416055512", "6055", "41605551234", ...] ``` -If, however, the phone number is given without the international calling prefix `+` (either by using `0041` or omitting -the international calling prefix altogether) then only the analyzer with the correct phone region will be able to parse it: +However, if you specify the phone number without the international calling prefix `+` (either by using `0041` or omitting +the international calling prefix altogether), then only the analyzer configured with the correct phone region can parse it: + ```json GET /example-phone/_analyze { @@ -106,7 +108,10 @@ GET /example-phone/_analyze ``` {% include copy-curl.html %} -In contrast the `phone-search` analyzer does not create n-grams and only issues some basic tokens: +## The phone-search analyzer + +In contrast, the `phone-search` analyzer does not create n-grams and only issues some basic tokens. Thus, the following request: + ```json GET /example-phone/_analyze { @@ -116,6 +121,8 @@ GET /example-phone/_analyze ``` {% include copy-curl.html %} +is parsed into the following tokens: + ```json ["+41 60 555 12 34", "41 60 555 12 34", "41605551234", "605551234", "41"] ``` From a197b13e67a004dd56ca559f9a5802309ae927dd Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Mon, 14 Oct 2024 09:33:13 -0400 Subject: [PATCH 3/5] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _analyzers/supported-analyzers/index.md | 2 +- _analyzers/supported-analyzers/phone-analyzers.md | 14 +++++++------- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/_analyzers/supported-analyzers/index.md b/_analyzers/supported-analyzers/index.md index f43e18e0f1..5616936179 100644 --- a/_analyzers/supported-analyzers/index.md +++ b/_analyzers/supported-analyzers/index.md @@ -29,7 +29,7 @@ Analyzer | Analysis performed | Analyzer output ## Language analyzers -OpenSearch supports analyzers for various languages. For more information, see [Language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/). +OpenSearch supports multiple language analyzers. For more information, see [Language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/). ## Additional analyzers diff --git a/_analyzers/supported-analyzers/phone-analyzers.md b/_analyzers/supported-analyzers/phone-analyzers.md index f3f9bbd4e9..c7579a5ab0 100644 --- a/_analyzers/supported-analyzers/phone-analyzers.md +++ b/_analyzers/supported-analyzers/phone-analyzers.md @@ -22,7 +22,7 @@ The phone number analyzers are not meant to find phone numbers in larger texts. ## Installing the plugin -Before you can use phone number analyzers, you must install the `analysis-phonenumber` plugin by running the following command: +Before you can use the phone number analyzers, you must install the `analysis-phonenumber` plugin by running the following command: ```sh ./bin/opensearch-plugin install analysis-phonenumber @@ -30,13 +30,13 @@ Before you can use phone number analyzers, you must install the `analysis-phonen ## Specifying a default region -You can optionally specify a default region for parsing phone numbers by providing the `phone-region` parameter within the analyzer. Valid phone regions are ISO 3166 country codes. For more information, see [List of ISO 3166 country codes](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes). +You can optionally specify a default region for parsing phone numbers by providing the `phone-region` parameter within the analyzer. Valid phone regions are represented by ISO 3166 country codes. For more information, see [List of ISO 3166 country codes](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes). -When tokenizing phone numbers containing the international calling prefix `+`, the default region is irrelevant. However, for phone numbers which either use a national prefix for international numbers (for example, `001` instead of `+1` to dial Northern America from most European countries), the region needs to be provided. You can also properly index local phone numbers with no international prefix if you specify the region. +When tokenizing phone numbers containing the international calling prefix `+`, the default region is irrelevant. However, for phone numbers that use a national prefix for international numbers (for example, `001` instead of `+1` to dial Northern America from most European countries), the region needs to be provided. You can also properly index local phone numbers with no international prefix by specifying the region. ## Example -The following request creates an index containing one field, which ingests phone numbers for Switzerland (region code `CH`): +The following request creates an index containing one field that ingests phone numbers for Switzerland (region code `CH`): ```json PUT /example-phone @@ -70,7 +70,7 @@ PUT /example-phone ## The phone analyzer -The `phone` analyzer generates n-grams based on the given phone number. Analyzing a (fictional) Swiss phone number containing an international calling prefix can be parsed with or without the Swiss-specific phone region. Thus, the following two requests will produce the same result: +The `phone` analyzer generates n-grams based on the given phone number. A (fictional) Swiss phone number containing an international calling prefix can be parsed with or without the Swiss-specific phone region. Thus, the following two requests will produce the same result: ```json GET /example-phone/_analyze @@ -97,7 +97,7 @@ The response contains the generated n-grams: ``` However, if you specify the phone number without the international calling prefix `+` (either by using `0041` or omitting -the international calling prefix altogether), then only the analyzer configured with the correct phone region can parse it: +the international calling prefix altogether), then only the analyzer configured with the correct phone region can parse the number: ```json GET /example-phone/_analyze @@ -121,7 +121,7 @@ GET /example-phone/_analyze ``` {% include copy-curl.html %} -is parsed into the following tokens: +Is parsed into the following tokens: ```json ["+41 60 555 12 34", "41 60 555 12 34", "41605551234", "605551234", "41"] From 584cde4730a2dbcf1042f1ff68f64ef06ba1dee9 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Mon, 14 Oct 2024 09:34:20 -0400 Subject: [PATCH 4/5] Update _analyzers/supported-analyzers/phone-analyzers.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _analyzers/supported-analyzers/phone-analyzers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/supported-analyzers/phone-analyzers.md b/_analyzers/supported-analyzers/phone-analyzers.md index c7579a5ab0..b85336c48e 100644 --- a/_analyzers/supported-analyzers/phone-analyzers.md +++ b/_analyzers/supported-analyzers/phone-analyzers.md @@ -7,7 +7,7 @@ nav_order: 140 # Phone number analyzers -The `analysis-phonenumber` plugin provides analyzers and tokenizers for parsing phone numbers. A dedicated analyzer is required because parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance). For common pitfalls in parsing phone numbers, see [Falsehoods programmers believe about phone numbers](https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md). +The `analysis-phonenumber` plugin provides analyzers and tokenizers for parsing phone numbers. A dedicated analyzer is required because parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance). For common misconceptions regarding phone number parsing, see [Falsehoods programmers believe about phone numbers](https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md). OpenSearch supports the following phone number analyzers: From cc2ed9debe5da8bd30ea813f4ff25384033b3d74 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Mon, 14 Oct 2024 09:34:47 -0400 Subject: [PATCH 5/5] Update _analyzers/supported-analyzers/phone-analyzers.md Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _analyzers/supported-analyzers/phone-analyzers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/supported-analyzers/phone-analyzers.md b/_analyzers/supported-analyzers/phone-analyzers.md index b85336c48e..cba9ef34dd 100644 --- a/_analyzers/supported-analyzers/phone-analyzers.md +++ b/_analyzers/supported-analyzers/phone-analyzers.md @@ -17,7 +17,7 @@ OpenSearch supports the following phone number analyzers: Internally, the plugin uses the [`libphonenumber`](https://github.com/google/libphonenumber) library and follows its parsing rules. -The phone number analyzers are not meant to find phone numbers in larger texts. Instead, you should use them on fields which contain phone numbers alone. +The phone number analyzers are not meant to find phone numbers in larger texts. Instead, you should use them on fields that only contain phone numbers. {: .note} ## Installing the plugin