List of word frequency

June 17, 2012 at 03:46 AM

small update: catabunga's python script works really well and amazingly fast even on my windows netbook. I didn't know any other way to sort the resulting file but with Excel, which was the slowest part of it all. Now I wonder how to start using this sorted list for flashcard study. I'd love a way to tell Pleco (or any other flashcard app) to focus on the most frequent words from the top of the list, ignoring the items I have already learned (which can be seen from the internal statistics of the flashcard app anyway). Maybe I just cut it into smaller parts, first using the most frequent 100 and so on. I'll just give it a try and see how it works.

August 22, 2013 at 04:09 PM

old topic, but I saw this Frequency List on the web. Its a Word and Character Frequency List "based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words)"

Seems like a smart way to complie a list, nothing the slant of the corpus

January 2, 2016 at 05:57 PM

Thanks for these resources! I am using subtlex-ch and the two frequency lists available here http://corpus.leeds.ac.uk/list.html now to decide which words to study next.

Slightly off-topic, but if you're interested, you can also check out the linguistically annotated corpus of Sina Weibo messages I built

Thanks I've downloaded it. It looks like there are some 400000 missing lines? I've left a message on your contact form as well.

I'm trying to make the Stanford Segmenter work. I know almost nothing of these kinds of things so it's not going smoothly. I've tried copying the files to my X:\Cygwin64__LinuxOnWindows\bin folder and running this in CygWin, but I get these error messages. Editing segment.sh to change the memory requirements for Java, as suggested by Daan above, didn't seem to work.

$ segment.sh ctb test.simp.utf UTF-8
Usage: /usr/bin/segment.sh [ctb|pku] filename encoding kBest
  ctb : use Chinese Treebank segmentation
  pku : Beijing University segmentation
  kBest: print kBest best segmenations; 0 means kBest mode is off.

Example: /usr/bin/segment.sh ctb test.simp.utf8 UTF-8 0
Example: /usr/bin/segment.sh pku test.simp.utf8 UTF-8 0



Another attempt, with a "0" added at the end:



$ segment.sh ctb test.simp.utf UTF-8 0
(CTB):
File: test.simp.utf
Encoding: UTF-8
-------------------------------
Error occurred during initialization of VM
Could not reserve enough space for 2097152KB object heap

Sign In

List of word frequency

Recommended Posts

yaokong

Johnny20270

Rowley

Join the conversation