Jump to content
Chinese-Forums
  • Sign Up

HSK wordlist softcopy?


Recommended Posts

Posted

Does anyone know whether the A, B, C and D wordlists for the HSK test are legally available in softcopy (eg .txt file)?

Posted

Assuming the ones scattered around the internet aren't 'legal', no. You get CD-ROMs attached to some HSK books, but I'd doubt they have the list in an easily accessible format. And assuming 'legally' means from the copyright holder, or licensee, I think it's very unlikely they would make the list available in such an easily copyable format.

Roddy

Posted

There is no reason why the ones on the internet would be illegal if some one has actually typed them out themselves. After all they are just words. Perhaps if the file itself had been copied then it would be a breach of copyright.

Posted

Roddy's list is legal - he typed it himself. No one can copyright a list of words as such, only the presentation etc.

Posted

I've actually wondered about that though - a list, sure. But these lists are only valuable because of the information they contain - the gradings into four groups and maybe also part of speech info / english - although what there is of that in my database came from ADSO. However, it's not a simple list - it's a graded list. So where does that stand?

Roddy

Posted

Law varies country by country. The European Union, Canada and many other countries offer significant protection for sweat-of-the-brow compilations, which means that you can get in serious trouble for copying compilations of materials even if individual entries are not copyrightable in and of themselves.

American law is more permissive in specifying that facts are not copyrightable in compilated form, but the litmus test involved a phone book. American law does provide protection for creative compilations. English-language dictionary definitions are copyrightable, and there is clearly a higher element of creativity involved when one starts straddling languages and explaining foreign words and concepts. Assuming you live in the United States you're probably safe using lists of unambiguous proper nouns, but would be on difficult ground for everything else.

We're trying to be conservative with Adso for exactly these reasons. The Linguistic Data Consortium normally charges an arm and a leg for their stuff and makes it available only to private subscribers at significant cost, but even they released a list containing some CEDICT material free of charge under the CEDICT licence. This suggests that CEDICT will hold up in court. And if it does, any other bilingual wordlist is almost certainly protected as well. The LDC also has other more elaborate lists its makes available for thousands of dollars. If it was legal to dump those lists and redistribute them you'd be able to find them on the Internet. The fact that you can't is probably answer enough to your question.

So be careful. You're probably best off doing this sort of thing from scratch or joining one of the projects already doing this.

Posted

Then again, the HSK stuff is pretty basic so the definitions probably won't be an issue. Why not write Beida and ask for the list of all vocab?

Posted

I saw lots of list for Japanese JLPT (Japanese Language Proficiency Test) available for download in many places.

You can get graded lists of Chinese characters from Wenlin software ordered by frequency with readings, pronunciations and examples in 2 forms (trad. and simplified. E.g, just select the first 1,000 and study.

IMO, those HSK lists should be available for free download.

The first 20 most frequent characters from Wenlin:

1 的 [de] (grammatical particle) [dì] 目的 mùdì goal [dí] 的确 [dī] cab

2 一 [yī] one; 一定 certain; 一样 same; 一些 some

3 是 [shì] to be

4 不 [bù] not [bú]

5 了 [le] (particle) [liǎo] 了解 comprehend [liào] (=瞭) [liāo] [liáo]

6 人 [rén] person; 人类 rénlèi humankind; 有人吗? anybody here?

7 在 [zài] at; 现在 xiànzài now; 存在 cúnzài exist

8 我 [wǒ] I, me; 我们 wǒmen we

9 有 [yǒu] have; there is; 没有 haven't; 有的 some [yòu] (=又)

10 中 [zhōng] middle; in; 中国 Zhōngguó China [zhòng] hit (a target)

11 这(F這) [zhè] [zhèi] this

12 大 [dà] big; 大家 dàjiā everybody [dài] 大夫 dàifu doctor

13 国(F國) [guó] (国家) country; 中国 China; 美国 USA

14 上 [shàng] over; top; (go) up; last, previous [shǎng] 上声 [shang]

15 个(F個) [gè] [ge] (measure word); 个人 personal [gě] 自个

16 来(F來) [lái] come; 起来 get up; 原来 it turns out [lai]

17 他 [tā] he, him; she, her; it; (其他 qítā) other

18 为(F為) [wèi] for, on account of [wéi] be, become

19 到 [dào] to, towards, until

20 地 [dì] earth [de] -ly (adverbial particle)

...

Posted

Their link page says they got the HSK lists from this forum. (). They call it the "HSK database".

I'm not saying that's legal or not.

Posted

The thieves! Wait till I get my lawyer!!!!!!

I don't think they did get that from here, and if they did they've added a ton of further work on top. Not that it matters anyway, it's not like I am doing anything with them.

I'm not sure about how they separate characters / words for the HSK lists though - they seem to do it just by putting all one-character words under 'characters', and everything else under 'words', which is a bit misleading. 马 and 用 are words in their own right, but you won't find them in the word list, while 初 appears on the word list in 最初, but not in the character list.

I was trying to work around this at one point by having seperate indexes for characters and words, but characteristically I forgot about the project and started something else (I forget what now). I think I've got excel files for both 字 and 词 for the first level somewhere.

Roddy

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...