Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

启用segmentation之后,支持让字/词与拼音的建立关联 #389

Open
fuweichin opened this issue May 4, 2024 · 2 comments
Open

Comments

@fuweichin
Copy link

如下API,segmentation虽然对拼音做了分组,但是没有返回对应的汉字词语(或其索引)。

console.log(pinyin("我喜欢你", {
  segment: "segmentit",         // 启用分词
  group: true,                  // 启用词组
}));                            // [ [ 'wǒ' ], [ 'xǐhuān' ], [ 'nǐ' ] ]

这样导致我无法根据返回值关联汉字与拼音,无法为字/词创建如下拼音注解。

<ruby><rt></rt></ruby>
<ruby>喜欢<rt>xǐhuān</rt></ruby>
<ruby><rt></rt></ruby>

要解决这个问题,需要修改API的返回值格式,否则就得先分词,再对与每个字/词调一遍pinyin()方法。

如果不想修改现有pinyin()方法的返回值格式(以免造成breaking change),那么我提议添加一个方法pinyin.segment(),用法如下:

pinyin.segment("我喜欢你", {
  method: "segmentit",          // 选择分词的实现方法
  group: true,                  // 启用词组
  // ...其他选项保持与pinyin()的选项一样
})

// 返回值格式
[
  {segment: '我', index: 0, candidates: ['wǒ']},
  {segment: '喜欢', index: 1, candidates: ['xǐhuān']},
  {segment: '你', index: 3, candidates: ['nǐ']}
]

注:原拼音候选数组将移动一个对象作为属性candidates的值,另需在对象中返回属性segment,index二者至少其一。

参考类似的API设计

@fuweichin
Copy link
Author

附带说几句:

因代码import PinyinBase, { getPinyinInstance } from "./PinyinBase";缺少扩展名造成浏览器加载出错。为了让esm能直接在浏览器中使用,需要保证import specifier带有.js扩展名。保证specifier带.js扩展名一种方法是:使用tsc-esm命令代替tsc命令来编译输出esm

因代码import { Segment, useDefault } from "segmentit";造依赖项耦合(segmentit.js比较大,达3.65M),考虑改成插件架构,让调用者按需动态加载segmentation实现、按需动态加载数据字典。

假如如不需要分词,且只需要用到常用2500字,加载7.4MB的脚本似乎不划算,期待有一份针对在线场景考虑的轻量版。

@etuardu
Copy link

etuardu commented Aug 24, 2024

I needed this feature as well. As a workaround, I segmented the text using the same command internally used by the pinyin library, and then combined the resulting arrays:

const { Segment, useDefault } = require('segmentit')
const segmentit = useDefault(new Segment())
const text = "我喜欢你"
const segments = segmentit.doSegment(text, { simple: true })
// [ '我', '喜欢', '你' ]
candidates = pinyin(text, { segment: "segmentit", group: true })
// [ [ 'wǒ' ], [ 'xǐhuān' ], [ 'nǐ' ] ]
const words = segments.map((segment, index) => ({
  segment,
  index,
  candidate: candidates[index]
}))
// [
//   { segment: '我', index: 0, candidate: [ 'wǒ' ] },
//   { segment: '喜欢', index: 1, candidate: [ 'xǐhuān' ] },
//   { segment: '你', index: 2, candidate: [ 'nǐ' ] }
// ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants