zhiming Posted July 25, 2008 at 04:11 AM Report Posted July 25, 2008 at 04:11 AM I am looking for a good regular expression for recognizing and validating Pin1Yin1 with numbers used to specify the tones. The best I have found is this: /[A-Za-z]+[A-Za-z1-5,'- ]*/ but it could accept quite a bit of non-pinyin input. Does anyone know of a better RegEx string for doing this? I didn't want to spend hours working on one if someone else has already done it. It may turn out that a truly validating expression would be ridiculously long with way too many '|' in it anyways. Quote
lemur Posted July 25, 2008 at 11:27 AM Report Posted July 25, 2008 at 11:27 AM Yeah a regular expression for that would probably end up looking ugly. It might be worth considering just building a lookup table of all possible pinyin combinations and use that. A table does not look cool or clever but it wins in clarity and maintainability. (If there's a bug in a complicated regex a bug fix is likely to introduce another bug.) There's always someone who's going to scream that a table is not as efficient but it really depends on how the table is implemented. Also, early optimization is a bad idea. Hmm... I just had an idea: if there's no table of valid pinyin combinations which you can copy, you could always take the Unihan database files and extract all pinyin from the Mandarin pronunciation field. This probably will cover all possibilities but you could always check against other sources. Quote
jbradfor Posted July 25, 2008 at 02:24 PM Report Posted July 25, 2008 at 02:24 PM Yeah, it depends on how accurate you want it to be. If you want it to accept only valid pinyin, a table is the way to go. Or just do a regex with every valid pinyin, separated by |, e.g. (a|ai|an|ang|ao|ba|bai|ban|bang|bao....) (sh) If you want something just a bit more accurate, the next step would be to separate vowels from the consonants, e.g. [^aeiou]?[h]?[aeiou]+[ng]* [i think I forgot my regex -- those '?' are trying to be the rule for "zero or one"] Beyond that, the next step up might be to separate the initials from the finals, i.e. list all the valid initials, then all the valid finals, e.g. (b|c|ch|d|f|g|h|....|z|zh)?(a|ai|an|ang|ao|....)?[1-5] This is still not accurate, as most combinations of initials crossed with finals are actually not used. [Note that this would also accept '3' as a valid pinyin, so there is room for improvement there....] As I recall from the pinyin rules, there tends to be groups of initials that go with certain groups of finals, so one could use that fact to improve the accuracy. Hey, this is getting fun! P.S. a google search yields the following from http://www.krugle.org/examples/p-Oc0lNzFtMJQF9PvX/DictionaryView.java , but that seems a bit weird. private boolean isPinyinSearch(String searchString) { boolean isPinyin = false; // see if search string matches pinyin regex pattern Pattern pat = Pattern.compile("[aeiou](w)*:?(d|_)"); Matcher m = pat.matcher(searchString); if (m.find()) { isPinyin = true; } return isPinyin; } Quote
imron Posted July 25, 2008 at 02:53 PM Report Posted July 25, 2008 at 02:53 PM I agree. You're unlikely to get a completely accurate regular expression that isn't an absolute monster - especially if you want it to be correct for tones also. It most likely wouldn't perform anywhere near as fast as a table either. There are plenty of tables of valid syllables available. Here's a decent one. http://www.pinyin.info/rules/initials_finals.html Quote
zhiming Posted July 25, 2008 at 05:44 PM Author Report Posted July 25, 2008 at 05:44 PM I guess maybe I should state why I am asking this question. I am trying to write a lexer that and differentiate between pinyin and any other text. I have a version that looks like this: tone [1-5] %% "bai"{tone}? | "bang"{tone}? | "ban"{tone}? | "bao"{tone}? | "ba"{tone}? | ... { return PINYIN; } My problem is that syllable could be entered in with capitals. So 'bai' could be 'Bai' or 'BAI' ... I guess what I could do is turn "bai"{tone}? into [bb][Aa][ii]{tone}?. I am not aware of any lex/flex shortcuts for this. Quote
zhiming Posted July 25, 2008 at 05:58 PM Author Report Posted July 25, 2008 at 05:58 PM By the way I wrote a C# function that will tell you if a word is a syllable or not. ParseMandarinSyllable.txt attached. It returns null if the input is not a mandarin pinyin syllable and a structure with the syllable parsed into it's initial, final, and tone if it is a true mandarin syllable. This doesn't handle strings with multiple syllables or any other non-Mandarin text intermingled. I am wondering if this is faster than a hash map / dictionary with the key being the pinyin syllable in all lowercase without the tone and the value being the MandarinPinyinSyllable structure mentioned above. ParseMandarinSyllable.txt Quote
lemur Posted July 25, 2008 at 07:21 PM Report Posted July 25, 2008 at 07:21 PM Hash tables of the sort you can get in Java, python, and also I guess C#, are made to handle general cases so they may not be as effective as a few switch commands. But there are always surprises. I've seen enough cases where people were foaming at the mouth that solution X was more effective and then when real profiling was performed it was found to be less efficient than a solution which was initially discarded as "obviously inefficient". In the specific case you are trying to tackle, I think it would be possible to optimize the hashing algorithm. I've done that a number of times in Java. Instead of using the stock hash tables, I performed a simple transformation on my input so that I would be able to just use a plain old array to perform the lookup. Quote
imron Posted July 26, 2008 at 01:04 AM Report Posted July 26, 2008 at 01:04 AM I am not aware of any lex/flex shortcuts for this.Convert the input string to lower case when doing the comparisons Edit: Just realised you were asking specifically about lex/flex. In this case, just specify -i when running flex, and it will generate a case-insensitive scanner. Pinyinput basically uses C++'s std::map for checking valid syllables. It also has a bunch of switch statements when finding/replacing a vowel with its tonemark equivalent. It also has some rule checking for apostrophes and such, so that pin'gan will be marked as invalid and ping'an won't. Quote
imron Posted July 26, 2008 at 02:07 AM Report Posted July 26, 2008 at 02:07 AM Ok, it's been ages since I played with flex, but thought I'd have a play around with it again. Here is a flex file that should give you a good starting point. It defines: Initial - a pinyin initial Final - a pinyin final that can't standalone CompleteFinal - a pinyin final that can standalone Tone - a tone number It uses v for ü In the grammar, I've manually specified a match for tone versus no tones (instead of just using the ? operator) in case you want the flexibility to differentiate between the two. There are also some limitations in that it will match pingan as ping an rather than pin gan, plus it also allows potentially invalid combinations of initials/finals, and doesn't account for erhua. I'm sure there are other problems with it too, but that's what you get for half an hours work Anyway, it should be enough to get you started, so the above problems are left as an exercise for the reader compile with: flex -i pscan.flex gcc -lfl -o pscan lex.yy.c This will produce a case-insensitive scanner. Type a line of text and then press enter, and it will split out the pinyin syllables. Use CTRL-D to exit. Ok, so the forums doesn't let me upload a .flex file, so here it is inline. Copy this and save it to pscan.flex and you should be good to go. %option noyywrap INITIAL b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|w|x|y|z|zh|sh|ch FINAL u|ua|uo|uai|ui|uan|uang|un|ueng|i|ia|ie|iao|iu|ian|iang|in|ing|iong|v|ve|van|vn COMPLETE_FINAL a|o|e|ai|ao|ou|an|ang|en|eng|er TONE [1-5] %% {INITIAL}{FINAL} { printf( "Found syllable: %sn", yytext ); } {INITIAL}{FINAL}{TONE} { printf( "Found Syllable with a tone: %sn", yytext ); } {INITIAL}{COMPLETE_FINAL} { printf( "Found syllable: %sn", yytext ); } {INITIAL}{COMPLETE_FINAL}{TONE} { printf( "Found Syllable with a tone: %sn", yytext ); } {COMPLETE_FINAL} { printf( "Found syllable: %sn", yytext ); } {COMPLETE_FINAL}{TONE} { printf( "Found Syllable with a tone: %sn", yytext ); } [ tn]+ /*eat whitespace*/ . printf( "Unrecognized character: %sn", yytext ); %% Quote
roddy Posted July 26, 2008 at 08:39 AM Report Posted July 26, 2008 at 08:39 AM Can't really cater for all possible file extensions, but anything that isn't recognized can be bundled up in a .rar or .zip file for uploading if necessary. Quote
imron Posted July 26, 2008 at 01:56 PM Report Posted July 26, 2008 at 01:56 PM Yeah, which I would have done it was more than say a 40 line file. As it was, zipping it up and posting it was more hassle than just copying and pasting Quote
toddwaze Posted July 26, 2008 at 03:47 PM Report Posted July 26, 2008 at 03:47 PM There's a trade-off between the length of the regex its accuracy. An expression that matches all "phonetically possible" syllables, even if a few don't actually exist in standard mandarin (e.g. "fai", "shong", etc) would probably be reasonably compact. A regex exactly equivalent to one of those tables of "possible initials and finals" is far less pretty. I tried, and came up with this (newlines added for clarity): ((b|p|m)(i|ie|iao|ian|in|ing|u) |(()|b|p|m)(a|o|ai|ei|ao|an|en|ang|eng) |m(e|ou|iu) |f(a|o|ei|ou|an|en|ang|eng|u) |(d|t|n|l|g|k|h)(a|e|ai|ao|ou|an|ang|eng|u|uo|uan|ong) |(d|n|l|g|h)ei |(n|g|k|h)en |(d|t|n|l)(i|ie|iao|ian|ing) |(d|n|l)iu |(n|l)(in|iang|ü|üe) |(g|k|h)(ua|uai|ui|un|uang) |(d|t)ui |(d|t|l)un |l(ia|üan|ün) |(j|q|x)(i|ia|ie|iao|iu|ian|in|iang|ing|u|ue|uan|un|iong) |(zh|ch|sh|z|c|s)(i|a|e|ai|ao|ou|an|en|ang|eng|u|uo|ui|uan|un) |r(i|e|ao|ou|an|en|ang|eng|u|uo|ui|uan|un) |(zh|sh|z)ei |(zh|ch|sh)(ua|uai|uang) |(zh|ch|r|z|c|s)ong |y(i|a|o|e|ai|ao|ou|an|in|ang|ing|u|ue|uan|un|ong) |w(u|a|o|ai|ei|an|en|ang|eng) |e|er|ou|pou)[1-5]? Of course that's only one possible solution, and I don't claim that it's the best, nor have I done any tricks like re-writing "zh|ch|sh|z|c|s" as "(z|c|s)h?". Also, this doesn't account for erhuayin or multi-syllable words. If your aim is to distinguish pinyin from other text, identifying *valid* pinyin may not be enough though, because some words which are valid pinyin might not actually *be* pinyin (unless tone marks are mandatory). For example some English words such as "women", "fang", etc. are valid pinyin, as are some words in other languages. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.