buzhongren Posted April 19, 2010 at 01:48 PM Report Posted April 19, 2010 at 01:48 PM c_redman: 不但 . . . . . . . . . . . . 而且 Nice homework. For the English examples check the CQP box and drop the ; at the end of the examples. It automatically tags a count limit and the ; cause a premature end of syntax. xiele, Jim Quote
geminni88 Posted June 19, 2010 at 10:29 PM Report Posted June 19, 2010 at 10:29 PM Jun Da in his Chinese Text Computing gives a list of Bigrams (two characters)in frequency order. However, these were compiled by programs and there are some errors. The link is http://lingua.mtsu.edu/chinese-computing/ If you are trying to improve you Chinese vocabulary, a better way might be to get the lists of vocabulary that are used in test of Chinese proficiency in the PRC (HSK) and in Taiwan (TOP). They are graded into levels such as Beginning, Intermediate, Advanced. The HSK list can be found at http://en.wiktionary.org/wiki/Category:Chinese_language and the TOP list can be downloaded under the reference links on http://www.sc-top.org.tw/english/download.php Hope this helps your study. Quote
yaokong Posted May 5, 2012 at 04:24 AM Report Posted May 5, 2012 at 04:24 AM Is there any way to generate such word (not character) frequency lists for individual texts? I have quite a lot of (Buddhist) documents with rare words and names not appearing in any dictionary. I would like to learn the most common expressions before trying to read those texts. It's fine if the list contains mistakes (due to low mutual information), I think I would be able to recognize those. I found some scripts here and there, which are supposed to do this, but could not run them successfully. - Stanford Chinese Segmenter for example gives error message Could not reserve enough space for object heap Could not create the Java virtual machine. - SegTag by 史晓东 crashes on launch. Such a tool would be a great help for all those who learn non-standard, less frequent parts of Chinese! Well, actually for anybody trying to read a document with unknown words. It would be very useful to find out what are the most common words and sort out the unknown ones for study. Quote
Daan Posted May 5, 2012 at 10:19 AM Report Posted May 5, 2012 at 10:19 AM I've done some work with the Stanford Chinese Word segmenter, but I've always been able to fix such errors by allocating a bit more memory. If you're running the segmenter on a Linux system, try this: /path/to/segmenter/segment.sh ctb unsegmented.txt UTF-8 0 > segmented.txt 2> /dev/null Your input file should be encoded in UTF-8. If this throws any errors, you can edit segment.sh to increase the amount of memory allocated to Java: something like 2gb be more than enough. If that doesn't cut it, you may want to split your input file into a few smaller files. The procedure should be the same on Windows, only you'd be looking at segment.bat, I think. But I haven't tried using this tool on Windows. By the way, you should know that this segmenter will only add word boundaries to your text: it won't calculate these frequency statistics for you. But that's easily solved with a few lines of code, which I'd be happy to write for you if you don't know how to. I should still have that code lying around somewhere anway; I used it to compile this frequency list of the words used in Sina Weibo messages. Slightly off-topic, but if you're interested, you can also check out the linguistically annotated corpus of Sina Weibo messages I built Quote
BertR Posted May 5, 2012 at 11:41 AM Report Posted May 5, 2012 at 11:41 AM Dag Daan, That's a really interesting project. I was wondering whether it is possible to download the complete frequency list at once? Quote
yaokong Posted May 5, 2012 at 12:22 PM Report Posted May 5, 2012 at 12:22 PM Daan, thanks for the answer, unfortunately I run the script on a Windows netbook with about 1 GB of free RAM. I first tried a whole book (few hundred kB), then just a short chapter (6 kB) but still couldn't run the segmenter. I have no access to more powerful (and memoryful) computers, so I am a bit lost :-( Is there no other program which needs less RAM? I could wait days, until its ready calculating :-) It can also be a mac app, but also my mac has only 1-1.2 GB of free memory. Quote
icebear Posted May 5, 2012 at 12:30 PM Report Posted May 5, 2012 at 12:30 PM I use this tool quite a bit (for articles): http://www.zhtoolkit.com/apps/wordlist/create-list.cgi Elsewhere on that website you'll find a link to a downloadable executable which can handle longer texts (books). Quote
yaokong Posted May 5, 2012 at 01:51 PM Report Posted May 5, 2012 at 01:51 PM icebear, what does it do to words not in the dictionaries (see my earlier post)? Anyway, I'll give it a try, thanks for the tip! Quote
icebear Posted May 5, 2012 at 05:17 PM Report Posted May 5, 2012 at 05:17 PM I'd guess the online version either drops them or lists them as having zero frequency (see end of list). It will list their characters still... What you can do is download the executable file from that website and then add a dictionary specific to the type of material you are reading - its a very easy process (putting a tab-delimited UTF-8 dictionary text file into a folder). Quote
Daan Posted May 5, 2012 at 07:14 PM Report Posted May 5, 2012 at 07:14 PM BertR, sure - just go to the "open access" page and download the words file yaokong, could you attach that file to a post here? I'll see whether I can get it to work with just 1gb of RAM. May help if I tweak the settings a bit Quote
yaokong Posted May 6, 2012 at 08:46 AM Report Posted May 6, 2012 at 08:46 AM icebear, that's exactly the point: no dictionary that I know of contains many rare expressions and transliterated terms and names (from Sanskrit, Pali, Tibetan etc.) in these documents. That is the main reason I'd like to extract them in the first place. I could then do the research on these when I have the time for it, and even add them as dictionary entries to Pleco, and later read them (even away from all research possibilities), knowing what these terms mean and recognizing the (sometimes bizarre transliterations of) names. Daan, thanks! Please try the following (rather large) book (you can cut out a smaller piece of it). Let me know how it went, and if you have any ideas how I could do it by myself in future (on my Windows netbook or rather old Macbook, both with only 2GB of memory). http://www.sendspace.com/file/4d2rpl or uploading.com Quote
Daan Posted May 6, 2012 at 10:03 AM Report Posted May 6, 2012 at 10:03 AM Hmm. Here's what I did: I converted that file from GB encoding to UTF-8 using Notepad++. I called the converted file "buddhist.utf8.txt" and ran "segment.bat ctb buddhist.utf8.txt UTF-8 0 1> segmented.txt" from the Windows command prompt in a directory that contained both the decompressed word segmenter and the file "buddhist.utf8.txt". I didn't change the default settings, that is, allocating 2GB of RAM to the segmenter. This completed in 38 seconds on my Intel Core i5-2430M (2.4GHz) system, running Windows 7. The segmented text was written to a file called "segmented.txt" in that same directory. I tried lowering the memory limit to 800MB, which should work for you. That significantly increased running time, from 38 seconds to slightly over 5 minutes, but that shouldn't be a problem I hope. Here's how to lower the memory limit: open "segment.bat" in Notepad++ and go to line 51, where it says "java -mx2g ...". Change this into "java -mx800m ..." (and don't touch anything else). Now if you run the same command, the segmenter will never try to use more than 800mb of memory. You can experiment with other settings, though if you go any lower than 400-500MB I doubt it will still run. But you'll just get an "out of memory" error, after which you can try again with slightly more memory. If you want to use the Peking University training data rather than the Chinese Treebank model I used, just replace "ctb" with "pku" in the command. To change the memory allocated to a PKU session, go to line 41 in segment.bat and make the same change. After the segmentation's done, you'll have a file called "segment.txt" which will be UTF-8 encoded, and which will contain spaces between words, so it should then be easy to calculate word frequency statistics. There seem to be some freeware tools around for this, which I haven't tried - I just ran a little script. I'm attaching a ZIP file containing the segmented text of your book, as well as a CSV file with the word frequencies that my script calculated. It's probably easier for you to try some of these freeware tools than to install the environment you'd need to run that script on Windows, but if you can't find any good tools, let me know and I'll get you a tutorial on how to get my script working By the way, you can also supply additional lists of words to the word segmenter. So if you've looked through this particular frequency list, and determined that some words were correctly segmented, while others weren't, you can create a new text file containing the correct segmentations for these words. For example, you could write 阿弥陀佛 释迦牟尼 and save that to a file called "extrawords.txt". Now if you want to tell the segmenter that in addition to the words it already knows and the heuristics for unknown words, it should also be aware that these words exist, all you need to do is edit "segment.bat" again and look for the same lines where you edited the memory allocation. A bit further down on the same lines, you should see "-serDictionary data/dict-chris6.ser.gz". Change this into "-serDictionary data/dict-chris6.ser.gz,extrawords.txt" and the segmenter should pick this up. So if you should ever come across a list of Buddhist terminology, such as this one or this one, you can just put these terms in an extra dictionary, which should lead to a considerable increase in accuracy for these words. Hope this helps a bit! Buddhist.zip 2 Quote
Daan Posted May 6, 2012 at 10:18 AM Report Posted May 6, 2012 at 10:18 AM Here's a text file with all words from the second dictionary I linked to. There's a link to an XML file containing the dictionary on that web site, but that link was dead, unfortunately, so I wrote a quick script to get the words from the HTML file instead. I haven't checked the entire file manually, so there may be some errors, but I think this should do nicely. It's got 16,312 Buddhist terms, so if you use this as an extra dictionary, you should be in good shape. Only problem is that this is in traditional characters and your texts are in simplified characters, so you'll want to convert this file into simplified characters before using it. But I guess there are plenty of tools available online to help you do that The dictionary is available under a Creative Commons license, so redistributing this list here should be fine. Someone even turned it into a custom Pleco dictionary, including Pinyin...you may want to write to him to see if you can get that file, should be useful Happy studies! buddhistdict.txt Quote
Silent Posted May 6, 2012 at 11:22 AM Report Posted May 6, 2012 at 11:22 AM that's exactly the point: no dictionary that I know of contains many rare expressions and transliterated terms and names (from Sanskrit, Pali, Tibetan etc.) in these documents. I use the tool mentioned before at : http://www.zhtoolkit...rd%20Extractor/ and that gives the characters and mentions unknown instead of a translation when it doesn't know the character. You still get the frequency. I first ran it on a segment of your text and a quick scan didn't reveal any unknowns. Then I ran it on the entire text which took a while, and it came up with only 13 unknowns, most of them very low frequency. The tool however does apply a filter as western script and special characters are not shown. So there is a possibility that rare characters are not recognised as chinese and filtered out. To me, at first sight, however it looks like it does the trick. Quote
Daan Posted May 6, 2012 at 12:11 PM Report Posted May 6, 2012 at 12:11 PM Silent, that's an interesting tool. I hadn't seen it before, thanks for sharing the link. Unfortunately, the documentation states that: The method for segmenting the words is simply to find the longest match within the dictionaries loaded into the program. This works well in general, but fails for some character combinations by splitting in the wrong place. it also fails to make words out of terms that aren't in the dictionary, and instead treats these as a series of single characters. I'm afraid that won't work very well if you're segmenting text with lots of terms that aren't in CC-CEDICT, which is the dictionary this tool uses by default. If the documentation is correct, unknown words will not be recognised by this tool - it'll just split them into single characters, which are indeed likely to be in CC-CEDICT and will thus be considered known. The Stanford word segmenter actively tries to recognise unknown words, and it's also a bit smarter in that it doesn't simply look for the longest match: it considers the entire context when deciding how to segment a bit of text. I haven't tried the tool myself, though, so perhaps the docs are outdated Quote
yaokong Posted May 13, 2012 at 03:01 PM Report Posted May 13, 2012 at 03:01 PM Daan, thanks a lot, with your help I could now run the Stanford segmenter. I will have to experiment with it though, for some strange reason, it does not recognize quite some words, even ones which appear many-many times in the document. Also thanks for the link of the Soothill dictionary, I had that already, I in fact converted the DDB wordlist into another Pleco-friendly dictionary, available on that same forum. I will try the segmenter again, using all these wordlists, plus my self created ones. Silent, I think Daan explained already, we were not talking about unknown characters, but words, bigrams, n-grams, character combinations that is. Such as 明点, which often appears in the document I attached and is not recognized as a word by most programs, and also appears in very few dictionaries only. But I made contact with the developer of that tool, I will also play around with that (using additional wordlists of course). So, thanks everybody for the help! I will experiment with these tools! Quote
Daan Posted May 14, 2012 at 12:18 PM Report Posted May 14, 2012 at 12:18 PM You're welcome. Let us know how you get on - this is bound to be interesting for people trying to read texts containing lots of specialised vocabulary, not necessarily Buddhist words but also, say, tech terms. Quote
cababunga Posted May 30, 2012 at 04:09 PM Report Posted May 30, 2012 at 04:09 PM Not sure if that's exactly what you need, but here I prepared a list of N-grams from your text. The attached file contains all N-grams 1 to 8 character long, which appear in the text more then four times. Right column is the number of occurrences. Encoding UTF-8. 狂密与真密-n-grams.txt Quote
yaokong Posted June 1, 2012 at 02:06 AM Report Posted June 1, 2012 at 02:06 AM cababunga, well, thank you very much, I think that is the best list I got so far, and I have tried quite some approaches. Please be so kind and let as know how you generated it. What I would like to finally achieve, and I really think this would be beneficial for many (or even most) learners, is to create an easy method to produce such document-based word frequency lists and let them be automatically be compared to one's own character/word learning progress, e.g. based on flashcard testing statistics. This way, when encountering a new document/book/movie (subtitle), one could study the most common (yet unlearned) words before trying to read the given text. I have practically no knowledge of programming, so putting this idea into practice would have to rely on others who do :-) btw, the link in your comment's footnote is actually two links, don't know if that was your intention Quote
cababunga Posted June 2, 2012 at 09:46 AM Report Posted June 2, 2012 at 09:46 AM It's really all DIY stuff and I don't see how I can make it easy to use without spending a lot of time, which I don't have these days. You still can try it if you want. This file is a Python 2 script that does the n-gram extraction. Given large input it will produce series of large files named freq-01.txt, freq-02.txt and so on. Files are unordered and suppose to be sorted alphabetically, merged and then multiple entries for same n-gram reduced to one. Then you have to get rid of infrequent entries and sort the result by the n-gram frequency. The tool was originally used for processing large corpus. For small inputs like in your case, you will just have one small file, which makes your life much easier. Just sort it by frequency and truncate it some point. In Unix-like system you can do it by running sort -rnk2 freq-01.txt | head -20000 >n-gram.txt 1 Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.