m.ellison Posted November 28, 2005 at 09:37 AM Report Posted November 28, 2005 at 09:37 AM Does anyone know whether the A, B, C and D wordlists for the HSK test are legally available in softcopy (eg .txt file)? Quote
roddy Posted November 28, 2005 at 11:47 AM Report Posted November 28, 2005 at 11:47 AM Assuming the ones scattered around the internet aren't 'legal', no. You get CD-ROMs attached to some HSK books, but I'd doubt they have the list in an easily accessible format. And assuming 'legally' means from the copyright holder, or licensee, I think it's very unlikely they would make the list available in such an easily copyable format. Roddy Quote
笨笨德 Posted November 28, 2005 at 01:34 PM Report Posted November 28, 2005 at 01:34 PM There is no reason why the ones on the internet would be illegal if some one has actually typed them out themselves. After all they are just words. Perhaps if the file itself had been copied then it would be a breach of copyright. Quote
hakkaboy Posted November 28, 2005 at 02:32 PM Report Posted November 28, 2005 at 02:32 PM Roddy's list is legal - he typed it himself. No one can copyright a list of words as such, only the presentation etc. Quote
roddy Posted November 28, 2005 at 02:54 PM Report Posted November 28, 2005 at 02:54 PM I've actually wondered about that though - a list, sure. But these lists are only valuable because of the information they contain - the gradings into four groups and maybe also part of speech info / english - although what there is of that in my database came from ADSO. However, it's not a simple list - it's a graded list. So where does that stand? Roddy Quote
trevelyan Posted November 28, 2005 at 04:12 PM Report Posted November 28, 2005 at 04:12 PM Law varies country by country. The European Union, Canada and many other countries offer significant protection for sweat-of-the-brow compilations, which means that you can get in serious trouble for copying compilations of materials even if individual entries are not copyrightable in and of themselves. American law is more permissive in specifying that facts are not copyrightable in compilated form, but the litmus test involved a phone book. American law does provide protection for creative compilations. English-language dictionary definitions are copyrightable, and there is clearly a higher element of creativity involved when one starts straddling languages and explaining foreign words and concepts. Assuming you live in the United States you're probably safe using lists of unambiguous proper nouns, but would be on difficult ground for everything else. We're trying to be conservative with Adso for exactly these reasons. The Linguistic Data Consortium normally charges an arm and a leg for their stuff and makes it available only to private subscribers at significant cost, but even they released a list containing some CEDICT material free of charge under the CEDICT licence. This suggests that CEDICT will hold up in court. And if it does, any other bilingual wordlist is almost certainly protected as well. The LDC also has other more elaborate lists its makes available for thousands of dollars. If it was legal to dump those lists and redistribute them you'd be able to find them on the Internet. The fact that you can't is probably answer enough to your question. So be careful. You're probably best off doing this sort of thing from scratch or joining one of the projects already doing this. Quote
trevelyan Posted November 28, 2005 at 04:28 PM Report Posted November 28, 2005 at 04:28 PM Then again, the HSK stuff is pretty basic so the definitions probably won't be an issue. Why not write Beida and ask for the list of all vocab? Quote
hakkaboy Posted November 28, 2005 at 05:53 PM Report Posted November 28, 2005 at 05:53 PM Well: what is China's copyright law? does anyone have a detailed knowledge of that? Quote
atitarev Posted November 28, 2005 at 09:29 PM Report Posted November 28, 2005 at 09:29 PM I saw lots of list for Japanese JLPT (Japanese Language Proficiency Test) available for download in many places. You can get graded lists of Chinese characters from Wenlin software ordered by frequency with readings, pronunciations and examples in 2 forms (trad. and simplified. E.g, just select the first 1,000 and study. IMO, those HSK lists should be available for free download. The first 20 most frequent characters from Wenlin: 1 的 [de] (grammatical particle) [dì] 目的 mùdì goal [dí] 的确 [dī] cab 2 一 [yī] one; 一定 certain; 一样 same; 一些 some 3 是 [shì] to be 4 不 [bù] not [bú] 5 了 [le] (particle) [liǎo] 了解 comprehend [liào] (=瞭) [liāo] [liáo] 6 人 [rén] person; 人类 rénlèi humankind; 有人吗? anybody here? 7 在 [zài] at; 现在 xiànzài now; 存在 cúnzài exist 8 我 [wǒ] I, me; 我们 wǒmen we 9 有 [yǒu] have; there is; 没有 haven't; 有的 some [yòu] (=又) 10 中 [zhōng] middle; in; 中国 Zhōngguó China [zhòng] hit (a target) 11 这(F這) [zhè] [zhèi] this 12 大 [dà] big; 大家 dàjiā everybody [dài] 大夫 dàifu doctor 13 国(F國) [guó] (国家) country; 中国 China; 美国 USA 14 上 [shàng] over; top; (go) up; last, previous [shǎng] 上声 [shang] 15 个(F個) [gè] [ge] (measure word); 个人 personal [gě] 自个 16 来(F來) [lái] come; 起来 get up; 原来 it turns out [lai] 17 他 [tā] he, him; she, her; it; (其他 qítā) other 18 为(F為) [wèi] for, on account of [wéi] be, become 19 到 [dào] to, towards, until 20 地 [dì] earth [de] -ly (adverbial particle) ... Quote
Celso Pin Posted November 28, 2005 at 10:17 PM Report Posted November 28, 2005 at 10:17 PM frequency analisys can be found at http://lingua.mtsu.edu/chinese-computing/statistics/index.html Quote
m.ellison Posted November 29, 2005 at 09:30 AM Author Report Posted November 29, 2005 at 09:30 AM Actually, I did find http://hmarty.free.fr/hanzi/, which has the lists. Quote
johnmck Posted November 29, 2005 at 10:00 AM Report Posted November 29, 2005 at 10:00 AM Wow! Very nice Quote
roddy Posted November 29, 2005 at 12:29 PM Report Posted November 29, 2005 at 12:29 PM Can I ask what made you decide that those particular ones are legal? Roddy Quote
stephanhodges Posted November 29, 2005 at 01:48 PM Report Posted November 29, 2005 at 01:48 PM Their link page says they got the HSK lists from this forum. (). They call it the "HSK database". I'm not saying that's legal or not. Quote
roddy Posted November 29, 2005 at 02:08 PM Report Posted November 29, 2005 at 02:08 PM The thieves! Wait till I get my lawyer!!!!!! I don't think they did get that from here, and if they did they've added a ton of further work on top. Not that it matters anyway, it's not like I am doing anything with them. I'm not sure about how they separate characters / words for the HSK lists though - they seem to do it just by putting all one-character words under 'characters', and everything else under 'words', which is a bit misleading. 马 and 用 are words in their own right, but you won't find them in the word list, while 初 appears on the word list in 最初, but not in the character list. I was trying to work around this at one point by having seperate indexes for characters and words, but characteristically I forgot about the project and started something else (I forget what now). I think I've got excel files for both 字 and 词 for the first level somewhere. Roddy Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.