Jump to content
Chinese-Forums
  • Sign Up

Tool to separate pinyin into syllables?


Recommended Posts

Posted

hello,

I have downloaded the HSK dictionary which is good, but in the pinyin column it does not separate syllables, for example, work is pinyin-ed as gongzuo (with the tones)..I actually would like to separate gong and zuo (keeping their tones)...

does anyone know how to do this in excel?

thanks

Posted

I can't think of an easy way for Excel to break the words into individual syllables. However, it should be possible to write a script that will search for each possible pinyin syllable (since there are a finite number of them) and then insert a space when appropriate.

Is there a particular reason why you want to do this? There are rules as to when syllables should be linked together and when they should be separated by spaces, See, http://www.pinyin.info/readings/zyg/rules.html. There are also times when, out of context, a character in a sentence can form a word with either the syllable before it or the syllable after it. Keeping the appropriate syllables together avoids confusion.

I used to like to have my syllables separated. However, these days, it annoys me when textbooks separate syllables that should be together.

Posted

1. If the pinyin is currently marked as tones, run it through a converter to get numbers (Wenlin will do this, should be something online somewhere). Ie, you want to put in gōngzuò and get out gong1zuo4

2. Do a bunch of search and replaces. '1' > '1 ', '2' > '2 ', etc to put in the spaces.

3. If necessary, run through the converter in the other direction to get the tone marks back.

You'll end up with unnecessary spaces at the end of words, but I can't see why that would be a problem. A more sophisticated text editor (actually perhaps just Word if you look for advanced settings) would be able to remove spaces at the end of lines if you need to. As above, I'm not sure why you'd want to do this, but that's how . . .

Posted

hi, thanks for the response.

I don't have Wenlin, but the approach you suggested would work, so I will have to look for something else which does the job in both directions, as I would need to get the tones back, of course.

The reason why I want to separate syllables is that when doing a search for pinyin you must make a choice: either end up finding 'ying' when you are looking for 'yin' (which are two totally different words) or, if you don't want to get 'ying' in your search results, you must tell the search engine for spot spaces and stop the search there.

I prefer not to find all 'ying' or 'shuai' when looking for 'yin' or 'shu' and many other..and therefore, I need to place spaces between pinyin, or pin yin :)

I am happy to do a excel macro, something that go through all the 400 possible pin yin and make the separation, but then I thought..if the macro finds something like yingai, how does it know it is yin gai or ying ai...(I am not sure the example exists in real life, but maybe there would be other ambiguous break-ups...which, if indeed possible, would make a simple software too difficult to program.

humm...maybe there aren't that many ambiguous by or tri syllables..

Posted

Does excel allow regex searches? e.g. search for yin[^g] to get yin but not ying? But I guess if you do a lot of searches that would get old quickly.

Posted

I don't know if Excel does regex searches. I use the data in excel but then use Java to do searches, and Java does do regex.

I would be surprised if Visual Basic would not do as well, in which case you can create a VB macro that does the job..but I am not sure about VB searches.

Posted

Sir, VBA ist not a proper programming language, it is a joke. Therefore it does not do something as useful as regexps.

That said, how do search your Excel file with Java? Did you hack together something of your own? Why not load the data into a proper DB instead? MySQL supports regex searches.

Also, what OS are you on? On Linux, you could simply use grep to do your searches. Say:

for@lone:/tmp$ cat > file.txt <> yin
> ying
> foo
> EOF
for@lone:/tmp$ cat file.txt 
yin
ying
foo
for@lone:/tmp$ grep 'yin>' file.txt 
yin
for@lone:/tmp$ grep 'yin' file.txt 
yin
ying

Posted

hello,

yes, I have actually developed my own software using Java for practising Chinese characters and do advanced searches.

I am just about to complete a new graphic output, which I think is very good, and will let you know how to download it.

I have asked this forum before for comments on the software, will do so again once it's been uploaded again..just a couple of days.

Posted

thanks, Chinesetools for the link..useful, but still does not separate syllables.

by the way, if anyone is interested in an excel macro that takes out the tones, I have it and can post it on request.

  • 4 weeks later...
Posted

Hi ilprincipe,

I realize that this post is a month old; but if you are still stuck..........

I know it's possible to add spaces after tones by breaking-up each word into individual letters and differentiating the letters that have the tones so that you can add the spaces, before putting it all back together again.

=Example=

- Take a name like lù yǔpíng

- Break the letters into separate cells - use LEFT, RIGHT, &. LEN commands.

- Differentiate which letters have tones - many ways, a VLOOKUP or an IF statement from a table of tones is one way.

- Add a space in front of the letters that require it, using the '&' command e.g. =" "&A1

- Put it all back together again A1&A2&A3 etc

I realize that I would need to write a novel here to explain this in any great detail :), so there is a working example attached.

I hope that this helps someone.

By the way - the example is only up to a maximum of 50 characters; you can easily add to this by copy-dragging more lines in to the work area.

- Jesta

Pinyin2Gapped_Text.xls

Posted

MadJesta, thanks for your reply.

however, in the meantime I had found another solution which may be simpler and it does the job well. It assumes, however, that the user have two columns already, one with the pin yin (not-separated) and one with the corresponding Hanzi. This should not be a problem as most character lists do have the hanzi next to it, for example HSK or any other word list.

1) download an entire chinese dictionary, for example CEDIT, which has pin yin already separated

2) do a vlookup function for the Hanzi character across the whole CEDIT (two columns only, Hanzi and pin yin).

and it works beautifully.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...