Jump to content
Chinese-Forums
  • Sign Up

Best way to generate vocab. list for large amount of traditional characters?


character

Recommended Posts

You want to generate vocabulary lists based on a corpus (text collection) you specify?
Exactly. My current process is partly-manual and therefore slow.
I don't know of any tools, but it should be possible to program something like that with the help of CEDICT and a scripting language.
Yeah, I can program something to do it. I just hope there is some existing tool I missed.
Link to comment
Share on other sites

one idea: and I have no idea if this would work, import the text into an access database, import as a space delimited text file, then you could export the data table. one big problem though, is that you'll miss out on the binoms, trinoms, idioms, etc.

Link to comment
Share on other sites

Have you tried adsotrans? The creator of the software is on this forum and there is a forum group dedicated to it.

For example, I ran the following text through it:

香港警方春节前针对网上流传欲照照片,大举拘捕疑犯,并搜获一千多张淫亵照片。然而在警方高调宣布已侦破欲照源头后,照片仍源源不绝流出,至今网上流传的艺人欲照已达四百多张,涉及七名女性,其中大部分为艺人

And got back:

香港    香港    Hong Kong       Unit:Noun
警方春节        警方春節        police Spring Festival  Unit:Noun
前      前      before  Unit:Other
针对    針對    in connection with      Unit:Preposition
网上    網上    Internet        Unit:Noun
流传    流傳    to transmit     Unit:Verb
欲      欲      desire  Unit:Noun
照      照      according to    Unit:Noun
照片    照片    photograph      Unit:Noun
,      ,      ,       Unit:Punctuation:Comma
大      大      big     Unit:Adjective
举      舉      to lift         Unit:Verb
拘捕    拘捕    arrest  Unit:Noun
疑犯    疑犯    criminal suspect        Unit:Noun
,      ,      ,       Unit:Punctuation:Comma
并      並      to combine      Unit:Verb
搜      搜      to search       Unit:Verb
获      獲      to get  Unit:Verb
一      一      one     Unit:Noun
千      千      thousand        Unit:Noun
多张淫亵        多張淫褻        DuoZhangYinxie  Unit:Phonetic:Place:ProperNoun
照片    照片    photograph      Unit:Noun
。              .       Unit:Punctuation:Terminal:Period
然而    然而    however         Unit:Noun
在      在      to be at        Unit:Verb
警方    警方    police  Unit:Noun
高调            high-pitched    Unit:Adjective
宣布    宣布    announcement    Unit:Noun
已      已      to stop         Unit:Verb
侦破    偵破    to solve        Unit:Verb
欲      欲      desire  Unit:Noun
照      照      according to    Unit:Noun
源头    源頭    source  Unit:Noun
后      後      after   Unit:Temporal
,      ,      ,       Unit:Punctuation:Comma
照片    照片    photograph      Unit:Noun
仍      仍      to remain       Unit:Verb
源源    源源    root root       Unit:Noun
不绝            nonstop         Unit:Adverb
流出    流出    to flow out     Unit:Verb
,      ,      ,       Unit:Punctuation:Comma
至今    至今    up to now       Unit:Other
网上    網上    Internet        Unit:Noun
流传    流傳    to transmit     Unit:Verb
的      的              Unit:Special:De01
艺人    藝人    performing artist       Unit:Noun
欲      欲      desire  Unit:Noun
照      照      according to    Unit:Noun
已      已      to stop         Unit:Verb
达      達      to reach        Unit:Verb
四      四      four    Unit:Noun
百多    百多    more than 100   Unit:Number:Plural
张      張      Zhang   Unit:Noun:Name
,      ,      ,       Unit:Punctuation:Comma
涉及    涉及    to involve      Unit:Verb
七      七      7       Unit:Number:Plural
名      名      name    Unit:Noun
女性    女性    female  Unit:Adjective
,      ,      ,       Unit:Punctuation:Comma
其中    其中    among   Unit:Noun
大部分  大部分  on the large part       Unit:Other
为      為      to be   Unit:Verb
艺人    藝人    performing artist       Unit:Noun

The example uses simplified characters, but I believe it should work with traditional as well.

Link to comment
Share on other sites

Adso can take care of this quite easily. If you have a large corpus of texts you can also use it to generate statistical data and be more selective about the sorts of words you print out. Download site:

http://adsotrans.com/downloads/

Once the software is compiled/installed, the vocab list Pang Pang pasted in above can be generated on the command line with this:

./adso -f [input file] --vocab

An updated version is going to be going online tomorrow that includes a command-line binary for use on Windows. Otherwise you're likely stuck with either using the precompiled Debian package or compiling yourself from source on Linux.

Link to comment
Share on other sites

  • 2 weeks later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...