yaokong Posted June 17, 2012 at 03:46 AM Report Posted June 17, 2012 at 03:46 AM small update: catabunga's python script works really well and amazingly fast even on my windows netbook. I didn't know any other way to sort the resulting file but with Excel, which was the slowest part of it all. Now I wonder how to start using this sorted list for flashcard study. I'd love a way to tell Pleco (or any other flashcard app) to focus on the most frequent words from the top of the list, ignoring the items I have already learned (which can be seen from the internal statistics of the flashcard app anyway). Maybe I just cut it into smaller parts, first using the most frequent 100 and so on. I'll just give it a try and see how it works. Quote
Johnny20270 Posted August 22, 2013 at 04:09 PM Report Posted August 22, 2013 at 04:09 PM old topic, but I saw this Frequency List on the web. Its a Word and Character Frequency List "based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words)" Seems like a smart way to complie a list, nothing the slant of the corpus Quote
Rowley Posted January 2, 2016 at 05:57 PM Report Posted January 2, 2016 at 05:57 PM Thanks for these resources! I am using subtlex-ch and the two frequency lists available here http://corpus.leeds.ac.uk/list.html now to decide which words to study next. Slightly off-topic, but if you're interested, you can also check out the linguistically annotated corpus of Sina Weibo messages I built Thanks I've downloaded it. It looks like there are some 400000 missing lines? I've left a message on your contact form as well. I'm trying to make the Stanford Segmenter work. I know almost nothing of these kinds of things so it's not going smoothly. I've tried copying the files to my X:\Cygwin64__LinuxOnWindows\bin folder and running this in CygWin, but I get these error messages. Editing segment.sh to change the memory requirements for Java, as suggested by Daan above, didn't seem to work. $ segment.sh ctb test.simp.utf UTF-8 Usage: /usr/bin/segment.sh [ctb|pku] filename encoding kBest ctb : use Chinese Treebank segmentation pku : Beijing University segmentation kBest: print kBest best segmenations; 0 means kBest mode is off. Example: /usr/bin/segment.sh ctb test.simp.utf8 UTF-8 0 Example: /usr/bin/segment.sh pku test.simp.utf8 UTF-8 0 Another attempt, with a "0" added at the end: $ segment.sh ctb test.simp.utf UTF-8 0 (CTB): File: test.simp.utf Encoding: UTF-8 ------------------------------- Error occurred during initialization of VM Could not reserve enough space for 2097152KB object heap Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.