New Members phj Posted January 10, 2024 at 09:53 AM New Members Report Posted January 10, 2024 at 09:53 AM Does anyone know a good programming library / script to convert pinyin with tone marks to tone numbers? I need to convert multi-syllable words without spaces. The biggest challenge is the segmentation. For example bàngōngshì" should be converted to "ban4gong1shi4" and not something messed up like "bang4o1ngshi4". So suggestions for libraries / scripts that do just the segmentation are also welcome, it will be easy to convert the tone numbers once the word/sentence is properly segmented. PHP is preferred as the project I need this for is in PHP, but other programming languages are also welcome. Thanks! Quote
Demonic_Duck Posted January 10, 2024 at 06:21 PM Report Posted January 10, 2024 at 06:21 PM I'd approach it something like this in JavaScript: [snip] Seems to give decent results: https://observablehq.com/@lionel-rowe/pinyin-tone-marks-to-numbers Quote Input: Bùjiǔ, Liú lǎoshī yòu huílai le, hòumian gēnzhe Shùyùn pàngpàng de wàipó. Wàipó jǔzhe làzhú, yīlù dàshēng de dūnangzhe shénme. Wǒ gēn Shùyùn xiàng liǎng ge mù'ǒu, bù gǎn chū yī shēng. Shùyùn de wàipó yòng Guǎngxīhuà duì wǒmen shuō, “Nǐmen yào sǐ a! Dàshuǐ bǎ shé chōng chūlai; nǐmen bù pà shé lái yǎosǐ nǐmen a?” Output: Bu4jiu3, Liu2 lao3shi1 you4 hui2lai le, hou4mian gen1zhe Shu4yun4 pang4pang4 de wai4po2. Wai4po2 ju3zhe la4zhu2, yi1lu4 da4sheng1 de du1nangzhe shen2me. Wo3 gen1 Shu4yun4 xiang4 liang3 ge mu4'ou3, bu4 gan3 chu1 yi1 sheng1. Shu4yun4 de wai4po2 yong4 Guang3xi1hua4 dui4 wo3men shuo1, “Ni3men yao4 si3 a! Da4shui3 ba3 she2 chong1 chu1lai; ni3men bu4 pa4 she2 lai2 yao3si3 ni3men a?” Should be easy enough to port to PHP as it looks like PHP also supports Unicode normalization forms, Unicode character properties in regexes, and regex splitting with capture groups (with PREG_SPLIT_DELIM_CAPTURE). 1 Quote
New Members phj Posted January 11, 2024 at 04:10 AM Author New Members Report Posted January 11, 2024 at 04:10 AM Nice! Thanks! It does seem to have one omission, it does not support the u-umlaut: shěnglüè lǚyóu I managed to fix it by changing these lines: Quote // grab the last diacritic in the syllable const [mark] = syllable.match(/\p{M}(?!.*\p{M})/u) ?? [] The added negative look-ahead makes it match the last diacritic instead of the first. This seems to work for the test cases I used. I also need the 5th tone mark for neutral tones. I added this by modifying this line: Quote if (!mark) return syllable+'5' I will try to throw more test cases at it later on. If all works well, I will try to convert it to PHP and post the result here. Quote
Demonic_Duck Posted January 11, 2024 at 04:31 AM Report Posted January 11, 2024 at 04:31 AM Here's my attempt at a PHP port: https://replit.com/@lionel_rowe/convert-to-tone-numbers#index.php Quote
Demonic_Duck Posted January 12, 2024 at 01:49 AM Report Posted January 12, 2024 at 01:49 AM On 1/11/2024 at 12:10 PM, phj said: It does seem to have one omission, it does not support the u-umlaut: shěnglüè lǚyóu I managed to fix it by changing these lines: Quote // grab the last diacritic in the syllable const [mark] = syllable.match(/\p{M}(?!.*\p{M})/u) ?? [] The added negative look-ahead makes it match the last diacritic instead of the first. This seems to work for the test cases I used. Good catch. I think a more robust method would be matching all diacritics until one of the relevant 4 is found — if NFD always spits out the umlaut before any of the 4 the tone marks, that's just an implementation detail. On 1/11/2024 at 12:10 PM, phj said: I also need the 5th tone mark for neutral tones. I added this by modifying this line: Quote if (!mark) return syllable+'5' Should work fine as long as the input is guaranteed to only be Pinyin syllables (rather than e.g. Pinyin mixed with English). Edit: updated both JS and PHP versions with this logic Quote
calculatrix Posted January 16, 2024 at 07:50 AM Report Posted January 16, 2024 at 07:50 AM Here's one in Python. It uses zhon which provides a quite impressive (1697 characters) Regex for extracting accented Pinyins, and unidecode which normalizes all accented characters. Since that also kills the accents from u-umlaut (ü) we have to take care of these. All non-pinyin substrings are kept unchanged. Quote # -*- coding: utf-8 -*- import zhon.pinyin import re from unidecode import unidecode acc2tone = {'ā': '1', 'ē': '1', 'ī': '1', 'ō': '1', 'ū': '1', 'ǖ': '1', 'á': '2', 'é': '2', 'í': '2', 'ó': '2', 'ú': '2', 'ǘ': '2', 'ḿ': '2', 'ń': '2', 'ǎ': '3', 'ě': '3', 'ǐ': '3', 'ǒ': '3', 'ǔ': '3', 'ǚ': '3', 'ň': '3', 'à': '4', 'è': '4', 'ì': '4', 'ò': '4', 'ù': '4', 'ǜ': '4', 'ǹ': '4'} accented="".join([a for a in acc2tone]) def accented2tones(sentence): lastposition = 0 retval = "" for match in re.finditer(zhon.pinyin.acc_syl,sentence,re.IGNORECASE): retval += sentence[lastposition:match.start()] pinyin = sentence[match.start():match.end()] m = re.findall('[%s]' % accented, pinyin) tone = acc2tone.get(m[0],'"") if len(m) > 0 else '"" umlaut = re.findall("[üǖǘǚǜ]",pinyin) pinyin = pinyin.replace(umlaut[0],"ü")+tone if len(umlaut) > 0 else unidecode(pinyin)+tone retval += pinyin lastposition = match.end() retval += sentence[lastposition:] return retval teststring = 'Wǒ gēn Shùyùn xiàng liǎng ge mù\'ǒu, bù gǎn chū yī shēng. 我买了一辆车。 lǚyóu . Shùyùn de wàipó yòng Guǎngxīhuà duì wǒmen shuō, “Test Nǐmen yào sǐ a! Dàshuǐ bǎ shé chōng chūlai; nǐmen bù pà shé lái yǎosǐ nǐmen a?”' print(teststring) print(accented2tones(teststring)) returns Quote Wo3 gen1 Shu4yun4 xiang4 liang3 ge mu4'ou3, bu4 gan3 chu1 yi1 sheng1. 我买了一辆车。 lü3you2 . Shu4yun4 de wai4po2 yong4 Guang3xi1hua4 dui4 wo3men shuo1, “Test Ni3men yao4 si3 a! Da4shui3 ba3 she2 chong1 chu1lai; ni3men bu4 pa4 she2 lai2 yao3si3 ni3men a?” Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.