Resources for developers

June 16, 2016 at 02:08 AM

I spent a fair bit of time scouring the Web looking for openly licensed / freely available resources in order to make HanBaoBao, my language learning/assistance app. Here I want to share some of those resources with others. Note that they are not all free or usably licensed. If you have other resources, please contribute

I'm far from a professional. My Chinese isn't even good - I'm very new to the language.

Dictionary Data:

CC-CEDICT (https://cc-cedict.org/wiki/) - this is the main dictionary used by most free dictionary apps, it's very good.

Adso (https://github.com/wtanaka/adso) - another free dictionary (check the license). I believe it's primarily intended for machine translation and not human consumption. Particularly good for Part-of-speech (PoS) tagging information.

Nan Tien Institute (NTI) Buddhist Dictionary (https://github.com/alexamies/buddhist-dictionary). Based on CC-CEDICT, but adds many PoS tags, definitions, topics, & categories. For example:



19897 美式咖啡 \N Měishì kāfēi cafe Americano set phrase 饮料 Beverage 饮食 Food and Drink \N \N \N \N \N 19897

Unihan (http://www.unicode.org/charts/unihan.html & http://www.unicode.org/reports/tr38/) - The Unihan database produced by The Unicode Consortium provides brief English definitions (note that it contains only character data, no words made of more than one character) but is more commonly used as a character reference and contains information such as stroke counts, simplified <-> traditional mappings, pinyin, and dictionary cross-references.

StarDict Dictionaries (http://download.huzheng.org/zh_CN/) - Even though the site claims that these dictionaries are GPL, I doubt it. Be wary of these.

Lingoes Dictionaries (http://www.lingoes.net/en/dictionary/index.html) - I cannot vouch for the license for any of these dictionaries.

Wiktionary (https://dumps.wikimedia.org/zhwiki/latest/ for dumps) - Wiktionary is only semi-structured data and therefore would require some processing to make it useable as a translation dictionary.

Linguistic Data Consortium (LDC) Chinese-English Translation Lexicon (https://catalog.ldc.upenn.edu/LDC2002L27) - I don't believe that this dictionary is freely useable, but it's worth noting its existence.

A List of Chinese Names (http://technology.chtsai.org/namelist/) - This list of over 700K unique chinese names was compiled from the Joint College Entrance Examination (JCEE) in Taiwan. I'm not certain how representative the names are of the greater Chinese population, but it may be useful information regardless.

CC-Canto (cantonese.org)

Linguistic Data Consortium (LDC) Chinese-to-English & English-to-Chinese Wordlists (https://github.com/ReubenBond/LdcChineseWordlists originally from https://web.archive.org/web/20130107081319/http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm)

Sentence Examples:

Tatoeba (https://tatoeba.org/eng/downloads) - I haven't actually put the Tatoeba sentences to good use yet. To be honest there are quite a few which would need filtering & touching up. Some sentences are just strange, some are quite vulgar, some seem to be extracts from books, but most are earnest & good.

Jukuu (http://www.jukuu.com/help/hezuo.htm) - Has a large data set, but it only accessible as a Web service as far as I'm aware. They seem to be open to collaborative partnerships, however.

Audio:

Projet Schtooka (http://shtooka.net/index.php) - Online collection of pronunciations for many thousands of words in multiple languages, including over 9000 Chinese words.

Forvo (http://forvo.com/languages/zh/) - Forvo is paid

Speak Good Chinese (http://www.speakgoodchinese.org/wordlists.html) - Farily small data set of pronounciations for individual syllables.

HSK & Other Word Lists:

Popup Chinese (http://www.popupchinese.com/hsk/test)

hskhsk.com (http://www.hskhsk.com/word-lists.html)

Wiktionary: https://en.wiktionary.org/wiki/Appendix:HSK_list_of_Mandarin_words

TOCFL Word list (http://www.sc-top.or...sh/download.php)

Word/Character Frequency & Corpus:

Word & Character frequency data is useful for performing text segmentation (中文分词), like in HanBaoBao. Text segmentation will never be 100% accurate, particularly when performed on a mobile device. Therefore you will most likely want to include some option to show users the alternatives. In HanBaoBao users can tap a word multiple times to split it or join it with its neighbors (but only if there's a dictionary definition for that word). The way this works internally is by 'banning' the span of characters which you tap. Once all possibilities are banned, I remove all the bans and the cycle repeats. I use a weighted directed acyclic graph of the valid segmentation paths through the sentence and determine the most probable sentence based on that graph (removing the 'banned' spans). In order to speed things up (it's a slow process), I pre-split the input on punctuation and process each split separately. This could be optimized more, but it's within the acceptable performance bounds for now.

Frequency data also helps sorting definitions so that the most relevant definitions come first. The well established dictionary apps almost certainly do a better job in the relevance department and I haven't put much work into that yet.

Worth noting that much of this data cannot be trusted to be accurate, since often text segmentation software is used to segment each corpus, so there's potential for a positive feedback loop.

Open Subs 2016 data set (http://opus.lingfil.uu.se/OpenSubtitles2016.php) - A huge corpus of auto-segmented subtitles (~8Gbs uncompressed xml)

Leiden Weibo Corpus (http://lwc.daanvanesch.nl/openaccess.php)

Jun Da (http://lingua.mtsu.edu/chinese-computing/)

Jieba Analysis (https://github.com/huaban/jieba-analysis) - I'm not sure where their data comes from.

Lancaster Corpus of Mandarin Chinese (http://www.lancaster.ac.uk/fass/projects/corpus/LCMC/)

Chinese WordNet (http://lope.linguistics.ntu.edu.tw/cwn2/)

SIGHAN Second International Chinese Word Segmentation Bakeoff (http://www.sighan.org/bakeoff2005/) - Contains hand-segmented/verified texts (thanks to Imron)

SUBTLEX-CH (http://www.ugent.be/...ents/subtlexch/)

Character Composition/Data:

Make Me a Hanzi (https://skishore.github.io/makemeahanzi/) - very cool stroke animation tool & data.

Wikipedia (https://commons.wikimedia.org/wiki/Commons:Chinese_characters_decomposition)

CJKLib (https://github.com/cburgmer/cjklib)

Unihan (see above) - Contains some character composition information, such as stroke counts

CJKDecomp (http://cjkdecomp.codeplex.com/)

Miscellaneous

Remembering Simplified Hanzi (https://github.com/rouseabout/heisig)

Let me know if I've missed any useful data. I'd love to find similar resources for other languages, or E-C instead - so if you have any, please let me know

June 16, 2016 at 02:24 AM

That's a good list of resources. For corpora, I'd also add the those from the SIGHAN word segmentation bakeoff, which contains corpora from the following organizations:

CKIP, Academia Sinica, Taiwan
City University of Hong Kong, Hong Kong SAR
Beijing Universty, China
Microsoft Research, China

June 16, 2016 at 02:59 AM

Thank you, Imron, that data looks great! Am I mistaken in believing that the corpus data there is hand-segmented, and therefore is a fairly reliable gold-standard?

June 16, 2016 at 04:00 AM

You are not mistaken. It has been hand-segmented (or at least hand verified), and the data has been used in multiple competitions for Chinese segmentation so one would hope that 'many eyes' would have caught most if any remaining mistakes.

Regarding 'gold standard', there are multiple arguments for what constitutes a 'word' in Chinese and different corpora have different standards/definitions.

I don't think there's ever going to be one authoritative 'gold standard' for the entire language, however I think it's fair to say that the segmented data you can download for each corpus is the 'gold standard' for the definition of 'word' as used by each corpus (the page above has links to a document for each corpus that defines the standard used for determining a 'word').

June 16, 2016 at 12:13 PM

SUBTLEX-CH: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexch/

Remembering Simplified Hanzi: https://github.com/rouseabout/heisig

June 16, 2016 at 03:24 PM

I should add our CC-Canto project here - cantonese.org, CC-BY-SA-licensed Cantonese-to-English dictionary along with Cantonese readings for CC-CEDICT.

Old version of LDC (which we offer in Pleco) was made freely available (reluctantly) by LDC on account of its being derived from CEDICT. Not sure if it's still on their public website but you can get it from archive.org. Stock Adso does have a lot of Pinyin issues, we've re-generated Pinyin in our version of it from better sources. Both dictionaries were IIRC problematic licensing-wise to build into an app - non-commercial restrictions make us nervous - but we offer them as separate add-on downloads. StarDict dictionaries definitely suspect licensing-wise.

Re segmentation, as imron suggests, the lack of standardization among definitions of what constitutes a word is a big problem - bigger still for us because those standards also vary among dictionaries; some dictionaries even include entire phrases, and if one of those is an accurate match we probably want to use it rather than the individual words as it's more likely to offer an accurate meaning. Our "intelligent segmentation" feature does something a bit similar to the graph you describe, but along with frequency and a little bit of grammatical analysis it also takes into account how many dictionaries believe a particular string of characters is in fact a word (with some extra weighting based on which dictionaries we trust more for determining that) - we're working on adding some other factors too.

June 16, 2016 at 03:56 PM

http://cjkdecomp.codeplex.com/ for decompositions. If anyone wants to do something useful with decompositions I would also share the ones I use in my app, they are based on cjkdecomp but with plenty of corrections etc..

Edit:

http://www.sc-top.org.tw/english/download.php TOCFL word list

Nice list btw!

June 16, 2016 at 11:29 PM

Thank you, mikelove, wibr, and pross for your input! I've updated the top post. I've uploaded the LDC lists to GitHub here: https://github.com/ReubenBond/LdcChineseWordlists to make them easier to find.

June 17, 2016 at 03:39 AM

Wow! Bookmarked. Thank you for putting this together.

June 18, 2016 at 01:23 PM

Yeah, thanks for putting all this together. Nice to have it all in one place.

July 12, 2016 at 10:17 PM

Two links with useful resources:

PC speak Chinese
PC speak Chinese #2

July 29, 2020 at 10:47 PM

Thank you for this. It's very helpful.

On another note: Has anyone here dabbled in NLP (Natural Language Processing)?

July 29, 2020 at 11:03 PM

Great thread, thanks for bumping it @Weyland

July 30, 2020 at 02:01 PM

14 hours ago, Weyland said:

On another note: Has anyone here dabbled in NLP (Natural Language Processing)?

I use a free online speech-to-text service to create Anki flashcard from audio sources .

I used the online service to generate json transcripts from the audio sources in batches of 100MB at a time. It is fairly accurate for complete sentences. I then use a GNU bash (sed/grep/gawk/ffmpeg) script to generate the .tsv file and .mp3 segments, all in one script for 100Mb of audio and its json transcript (no need to manually run subs2srs).

I've also used Apache Lucene and C# to build a Winforms search application on a domain-specifc English-language text database which I gathered

July 30, 2020 at 02:24 PM

@philwhite have you ever looked into spaCy(BERT) or ERNIE?

July 31, 2020 at 03:49 PM

On 7/30/2020 at 3:24 PM, Weyland said:

have you ever looked into spaCy(BERT) or ERNIE?

Thanks for that spaCy looks interesting for POS tagging and NER. Not sure it would help with my Speech-to-Text projects except, perhaps, for flagging incorrect transcriptions. Ideally, the neural Speech-to-Text services should be using knowledge from BERT to improve their neural nets and improve the accuracy of their transcrptions. After all, BERT is trained to fill in a gap in text with the most likely word.

Did you have a specific project or purpose in mind for NLP?

I'd read a little about BERT and ERNIE. BERT has been a real game-change, it seems, though I don't know how much training has been done on Chinese.text.

July 31, 2020 at 09:38 PM

5 hours ago, philwhite said:

Did you have a specific project or purpose in mind for NLP?

Grade the top 100k most common words by difficult, as to give an indications as to how difficult/texts are for foreign language students (based on the new HSK3.0 levels).

Here is another, more abundant, list of resources.

August 1, 2020 at 02:12 AM

10 hours ago, philwhite said:

though I don't know how much training has been done on Chinese.text.

A lot of the top AI researchers (even in the US) are Chinese, and one of the benefits of that is that alongside English, Mandarin is often a first-class language for a lot of research (not sure if this is relevant to BERT and ERNIE, but a lot of NLP research has good Mandarin support).

August 1, 2020 at 10:55 AM

8 hours ago, imron said:

A lot of the top AI researchers (even in the US) are Chinese, and one of the benefits of that is that alongside English, Mandarin is often a first-class language for a lot of research (not sure if this is relevant to BERT and ERNIE, but a lot of NLP research has good Mandarin support).

Yes, one of the biggest breakthroughs in deep learning in recent years was Kaiming He's paper on on deep residual nets when at Microsoft China. Baidu Research in the China and the US have published a lot. Unfortunately, my Chinese is not good enough to read the API of Baidu's web service. In the UK, Prof Guo at Imperial is well known.

I was thinking specifically of BERT, I should have written "I don't know how much training of BERT has been done by Google on Chinese text"

August 1, 2020 at 11:10 AM

13 hours ago, Weyland said:

Grade the top 100k most common words by difficult, as to give an indications as to how difficult/texts are for foreign language students (based on the new HSK3.0 levels).

Here is another, more abundant, list of resources.

Many thanks Weyland for the link to the list of resources. Unfortunately, my reading skills are nowhere near good enough for most of them.

Please forgive me, I'm slow on the uptake here. Are you talking about:

Grading common words and (grammar constructs) for difficulty from scratch,
Grading a texts for difficulty, given a grading of vocab and grammar constructs for difficulty?

Sign In

Resources for developers

Recommended Posts

ReubenBond

imron

ReubenBond

imron

pross

mikelove

wibr

ReubenBond

大块头

kangrepublic

大块头

Weyland

mungouk

philwhite

Weyland

philwhite

Weyland

imron

philwhite

philwhite

Join the conversation