character Posted February 11, 2008 at 05:27 PM Report Posted February 11, 2008 at 05:27 PM The online and offline tools I've looked at are either for simplified characters and/or only do a small amount of text at a time. (By large amount I mean anywhere from several hundred characters to (ideally) an entire book's worth.) Quote
renzhe Posted February 11, 2008 at 06:19 PM Report Posted February 11, 2008 at 06:19 PM You want to generate vocabulary lists based on a corpus (text collection) you specify? I don't know of any tools, but it should be possible to program something like that with the help of CEDICT and a scripting language. Quote
character Posted February 11, 2008 at 06:26 PM Author Report Posted February 11, 2008 at 06:26 PM You want to generate vocabulary lists based on a corpus (text collection) you specify?Exactly. My current process is partly-manual and therefore slow.I don't know of any tools, but it should be possible to program something like that with the help of CEDICT and a scripting language.Yeah, I can program something to do it. I just hope there is some existing tool I missed. Quote
LaoZhang Posted February 11, 2008 at 07:25 PM Report Posted February 11, 2008 at 07:25 PM one idea: and I have no idea if this would work, import the text into an access database, import as a space delimited text file, then you could export the data table. one big problem though, is that you'll miss out on the binoms, trinoms, idioms, etc. Quote
PangPang Posted February 11, 2008 at 08:14 PM Report Posted February 11, 2008 at 08:14 PM Have you tried adsotrans? The creator of the software is on this forum and there is a forum group dedicated to it. For example, I ran the following text through it: 香港警方春节前针对网上流传欲照照片,大举拘捕疑犯,并搜获一千多张淫亵照片。然而在警方高调宣布已侦破欲照源头后,照片仍源源不绝流出,至今网上流传的艺人欲照已达四百多张,涉及七名女性,其中大部分为艺人 And got back: 香港 香港 Hong Kong Unit:Noun 警方春节 警方春節 police Spring Festival Unit:Noun 前 前 before Unit:Other 针对 針對 in connection with Unit:Preposition 网上 網上 Internet Unit:Noun 流传 流傳 to transmit Unit:Verb 欲 欲 desire Unit:Noun 照 照 according to Unit:Noun 照片 照片 photograph Unit:Noun , , , Unit:Punctuation:Comma 大 大 big Unit:Adjective 举 舉 to lift Unit:Verb 拘捕 拘捕 arrest Unit:Noun 疑犯 疑犯 criminal suspect Unit:Noun , , , Unit:Punctuation:Comma 并 並 to combine Unit:Verb 搜 搜 to search Unit:Verb 获 獲 to get Unit:Verb 一 一 one Unit:Noun 千 千 thousand Unit:Noun 多张淫亵 多張淫褻 DuoZhangYinxie Unit:Phonetic:Place:ProperNoun 照片 照片 photograph Unit:Noun 。 . Unit:Punctuation:Terminal:Period 然而 然而 however Unit:Noun 在 在 to be at Unit:Verb 警方 警方 police Unit:Noun 高调 high-pitched Unit:Adjective 宣布 宣布 announcement Unit:Noun 已 已 to stop Unit:Verb 侦破 偵破 to solve Unit:Verb 欲 欲 desire Unit:Noun 照 照 according to Unit:Noun 源头 源頭 source Unit:Noun 后 後 after Unit:Temporal , , , Unit:Punctuation:Comma 照片 照片 photograph Unit:Noun 仍 仍 to remain Unit:Verb 源源 源源 root root Unit:Noun 不绝 nonstop Unit:Adverb 流出 流出 to flow out Unit:Verb , , , Unit:Punctuation:Comma 至今 至今 up to now Unit:Other 网上 網上 Internet Unit:Noun 流传 流傳 to transmit Unit:Verb 的 的 Unit:Special:De01 艺人 藝人 performing artist Unit:Noun 欲 欲 desire Unit:Noun 照 照 according to Unit:Noun 已 已 to stop Unit:Verb 达 達 to reach Unit:Verb 四 四 four Unit:Noun 百多 百多 more than 100 Unit:Number:Plural 张 張 Zhang Unit:Noun:Name , , , Unit:Punctuation:Comma 涉及 涉及 to involve Unit:Verb 七 七 7 Unit:Number:Plural 名 名 name Unit:Noun 女性 女性 female Unit:Adjective , , , Unit:Punctuation:Comma 其中 其中 among Unit:Noun 大部分 大部分 on the large part Unit:Other 为 為 to be Unit:Verb 艺人 藝人 performing artist Unit:Noun The example uses simplified characters, but I believe it should work with traditional as well. Quote
trevelyan Posted February 11, 2008 at 08:32 PM Report Posted February 11, 2008 at 08:32 PM Adso can take care of this quite easily. If you have a large corpus of texts you can also use it to generate statistical data and be more selective about the sorts of words you print out. Download site: http://adsotrans.com/downloads/ Once the software is compiled/installed, the vocab list Pang Pang pasted in above can be generated on the command line with this: ./adso -f [input file] --vocab An updated version is going to be going online tomorrow that includes a command-line binary for use on Windows. Otherwise you're likely stuck with either using the precompiled Debian package or compiling yourself from source on Linux. Quote
character Posted February 13, 2008 at 04:18 PM Author Report Posted February 13, 2008 at 04:18 PM Thanks, I'll try to get Asdo working. Quote
character Posted February 22, 2008 at 07:17 PM Author Report Posted February 22, 2008 at 07:17 PM An update: It doesn't look like Adso is a real solution yet for traditional characters. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.