Popular Post ReubenBond Posted June 16, 2016 at 02:08 AM Popular Post Report Posted June 16, 2016 at 02:08 AM I spent a fair bit of time scouring the Web looking for openly licensed / freely available resources in order to make HanBaoBao, my language learning/assistance app. Here I want to share some of those resources with others. Note that they are not all free or usably licensed. If you have other resources, please contribute I'm far from a professional. My Chinese isn't even good - I'm very new to the language. Dictionary Data: CC-CEDICT (https://cc-cedict.org/wiki/) - this is the main dictionary used by most free dictionary apps, it's very good. Adso (https://github.com/wtanaka/adso) - another free dictionary (check the license). I believe it's primarily intended for machine translation and not human consumption. Particularly good for Part-of-speech (PoS) tagging information. Nan Tien Institute (NTI) Buddhist Dictionary (https://github.com/alexamies/buddhist-dictionary). Based on CC-CEDICT, but adds many PoS tags, definitions, topics, & categories. For example: 19897 美式咖啡 \N Měishì kāfēi cafe Americano set phrase 饮料 Beverage 饮食 Food and Drink \N \N \N \N \N 19897 Unihan (http://www.unicode.org/charts/unihan.html & http://www.unicode.org/reports/tr38/) - The Unihan database produced by The Unicode Consortium provides brief English definitions (note that it contains only character data, no words made of more than one character) but is more commonly used as a character reference and contains information such as stroke counts, simplified <-> traditional mappings, pinyin, and dictionary cross-references. StarDict Dictionaries (http://download.huzheng.org/zh_CN/) - Even though the site claims that these dictionaries are GPL, I doubt it. Be wary of these. Lingoes Dictionaries (http://www.lingoes.net/en/dictionary/index.html) - I cannot vouch for the license for any of these dictionaries. Wiktionary (https://dumps.wikimedia.org/zhwiki/latest/ for dumps) - Wiktionary is only semi-structured data and therefore would require some processing to make it useable as a translation dictionary. Linguistic Data Consortium (LDC) Chinese-English Translation Lexicon (https://catalog.ldc.upenn.edu/LDC2002L27) - I don't believe that this dictionary is freely useable, but it's worth noting its existence. A List of Chinese Names (http://technology.chtsai.org/namelist/) - This list of over 700K unique chinese names was compiled from the Joint College Entrance Examination (JCEE) in Taiwan. I'm not certain how representative the names are of the greater Chinese population, but it may be useful information regardless. CC-Canto (cantonese.org) Linguistic Data Consortium (LDC) Chinese-to-English & English-to-Chinese Wordlists (https://github.com/ReubenBond/LdcChineseWordlists originally from https://web.archive.org/web/20130107081319/http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm) Sentence Examples: Tatoeba (https://tatoeba.org/eng/downloads) - I haven't actually put the Tatoeba sentences to good use yet. To be honest there are quite a few which would need filtering & touching up. Some sentences are just strange, some are quite vulgar, some seem to be extracts from books, but most are earnest & good. Jukuu (http://www.jukuu.com/help/hezuo.htm) - Has a large data set, but it only accessible as a Web service as far as I'm aware. They seem to be open to collaborative partnerships, however. Audio: Projet Schtooka (http://shtooka.net/index.php) - Online collection of pronunciations for many thousands of words in multiple languages, including over 9000 Chinese words. Forvo (http://forvo.com/languages/zh/) - Forvo is paid Speak Good Chinese (http://www.speakgoodchinese.org/wordlists.html) - Farily small data set of pronounciations for individual syllables. HSK & Other Word Lists: Popup Chinese (http://www.popupchinese.com/hsk/test) hskhsk.com (http://www.hskhsk.com/word-lists.html) Wiktionary: https://en.wiktionary.org/wiki/Appendix:HSK_list_of_Mandarin_words TOCFL Word list (http://www.sc-top.or...sh/download.php) Word/Character Frequency & Corpus: Word & Character frequency data is useful for performing text segmentation (中文分词), like in HanBaoBao. Text segmentation will never be 100% accurate, particularly when performed on a mobile device. Therefore you will most likely want to include some option to show users the alternatives. In HanBaoBao users can tap a word multiple times to split it or join it with its neighbors (but only if there's a dictionary definition for that word). The way this works internally is by 'banning' the span of characters which you tap. Once all possibilities are banned, I remove all the bans and the cycle repeats. I use a weighted directed acyclic graph of the valid segmentation paths through the sentence and determine the most probable sentence based on that graph (removing the 'banned' spans). In order to speed things up (it's a slow process), I pre-split the input on punctuation and process each split separately. This could be optimized more, but it's within the acceptable performance bounds for now. Frequency data also helps sorting definitions so that the most relevant definitions come first. The well established dictionary apps almost certainly do a better job in the relevance department and I haven't put much work into that yet. Worth noting that much of this data cannot be trusted to be accurate, since often text segmentation software is used to segment each corpus, so there's potential for a positive feedback loop. Open Subs 2016 data set (http://opus.lingfil.uu.se/OpenSubtitles2016.php) - A huge corpus of auto-segmented subtitles (~8Gbs uncompressed xml) Leiden Weibo Corpus (http://lwc.daanvanesch.nl/openaccess.php) Jun Da (http://lingua.mtsu.edu/chinese-computing/) Jieba Analysis (https://github.com/huaban/jieba-analysis) - I'm not sure where their data comes from. Lancaster Corpus of Mandarin Chinese (http://www.lancaster.ac.uk/fass/projects/corpus/LCMC/) Chinese WordNet (http://lope.linguistics.ntu.edu.tw/cwn2/) SIGHAN Second International Chinese Word Segmentation Bakeoff (http://www.sighan.org/bakeoff2005/) - Contains hand-segmented/verified texts (thanks to Imron) SUBTLEX-CH (http://www.ugent.be/...ents/subtlexch/) Character Composition/Data: Make Me a Hanzi (https://skishore.github.io/makemeahanzi/) - very cool stroke animation tool & data. Wikipedia (https://commons.wikimedia.org/wiki/Commons:Chinese_characters_decomposition) CJKLib (https://github.com/cburgmer/cjklib) Unihan (see above) - Contains some character composition information, such as stroke counts CJKDecomp (http://cjkdecomp.codeplex.com/) Miscellaneous Remembering Simplified Hanzi (https://github.com/rouseabout/heisig) Let me know if I've missed any useful data. I'd love to find similar resources for other languages, or E-C instead - so if you have any, please let me know 11 1 2 Quote
imron Posted June 16, 2016 at 02:24 AM Report Posted June 16, 2016 at 02:24 AM That's a good list of resources. For corpora, I'd also add the those from the SIGHAN word segmentation bakeoff, which contains corpora from the following organizations: CKIP, Academia Sinica, Taiwan City University of Hong Kong, Hong Kong SAR Beijing Universty, China Microsoft Research, China 3 Quote
ReubenBond Posted June 16, 2016 at 02:59 AM Author Report Posted June 16, 2016 at 02:59 AM Thank you, Imron, that data looks great! Am I mistaken in believing that the corpus data there is hand-segmented, and therefore is a fairly reliable gold-standard? Quote
imron Posted June 16, 2016 at 04:00 AM Report Posted June 16, 2016 at 04:00 AM You are not mistaken. It has been hand-segmented (or at least hand verified), and the data has been used in multiple competitions for Chinese segmentation so one would hope that 'many eyes' would have caught most if any remaining mistakes. Regarding 'gold standard', there are multiple arguments for what constitutes a 'word' in Chinese and different corpora have different standards/definitions. I don't think there's ever going to be one authoritative 'gold standard' for the entire language, however I think it's fair to say that the segmented data you can download for each corpus is the 'gold standard' for the definition of 'word' as used by each corpus (the page above has links to a document for each corpus that defines the standard used for determining a 'word'). Quote
pross Posted June 16, 2016 at 12:13 PM Report Posted June 16, 2016 at 12:13 PM SUBTLEX-CH: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexch/ Remembering Simplified Hanzi: https://github.com/rouseabout/heisig Quote
mikelove Posted June 16, 2016 at 03:24 PM Report Posted June 16, 2016 at 03:24 PM I should add our CC-Canto project here - cantonese.org, CC-BY-SA-licensed Cantonese-to-English dictionary along with Cantonese readings for CC-CEDICT. Old version of LDC (which we offer in Pleco) was made freely available (reluctantly) by LDC on account of its being derived from CEDICT. Not sure if it's still on their public website but you can get it from archive.org. Stock Adso does have a lot of Pinyin issues, we've re-generated Pinyin in our version of it from better sources. Both dictionaries were IIRC problematic licensing-wise to build into an app - non-commercial restrictions make us nervous - but we offer them as separate add-on downloads. StarDict dictionaries definitely suspect licensing-wise. Re segmentation, as imron suggests, the lack of standardization among definitions of what constitutes a word is a big problem - bigger still for us because those standards also vary among dictionaries; some dictionaries even include entire phrases, and if one of those is an accurate match we probably want to use it rather than the individual words as it's more likely to offer an accurate meaning. Our "intelligent segmentation" feature does something a bit similar to the graph you describe, but along with frequency and a little bit of grammatical analysis it also takes into account how many dictionaries believe a particular string of characters is in fact a word (with some extra weighting based on which dictionaries we trust more for determining that) - we're working on adding some other factors too. Quote
wibr Posted June 16, 2016 at 03:56 PM Report Posted June 16, 2016 at 03:56 PM http://cjkdecomp.codeplex.com/ for decompositions. If anyone wants to do something useful with decompositions I would also share the ones I use in my app, they are based on cjkdecomp but with plenty of corrections etc.. Edit: http://www.sc-top.org.tw/english/download.php TOCFL word list Nice list btw! Quote
ReubenBond Posted June 16, 2016 at 11:29 PM Author Report Posted June 16, 2016 at 11:29 PM Thank you, mikelove, wibr, and pross for your input! I've updated the top post. I've uploaded the LDC lists to GitHub here: https://github.com/ReubenBond/LdcChineseWordlists to make them easier to find. 1 Quote
大块头 Posted June 17, 2016 at 03:39 AM Report Posted June 17, 2016 at 03:39 AM Wow! Bookmarked. Thank you for putting this together. Quote
kangrepublic Posted June 18, 2016 at 01:23 PM Report Posted June 18, 2016 at 01:23 PM Yeah, thanks for putting all this together. Nice to have it all in one place. Quote
大块头 Posted July 12, 2016 at 10:17 PM Report Posted July 12, 2016 at 10:17 PM Two links with useful resources: PC speak ChinesePC speak Chinese #2 1 Quote
Weyland Posted July 29, 2020 at 10:47 PM Report Posted July 29, 2020 at 10:47 PM Thank you for this. It's very helpful. On another note: Has anyone here dabbled in NLP (Natural Language Processing)? Quote
mungouk Posted July 29, 2020 at 11:03 PM Report Posted July 29, 2020 at 11:03 PM Great thread, thanks for bumping it @Weyland Quote
philwhite Posted July 30, 2020 at 02:01 PM Report Posted July 30, 2020 at 02:01 PM 14 hours ago, Weyland said: On another note: Has anyone here dabbled in NLP (Natural Language Processing)? I use a free online speech-to-text service to create Anki flashcard from audio sources . I used the online service to generate json transcripts from the audio sources in batches of 100MB at a time. It is fairly accurate for complete sentences. I then use a GNU bash (sed/grep/gawk/ffmpeg) script to generate the .tsv file and .mp3 segments, all in one script for 100Mb of audio and its json transcript (no need to manually run subs2srs). I've also used Apache Lucene and C# to build a Winforms search application on a domain-specifc English-language text database which I gathered 2 Quote
Weyland Posted July 30, 2020 at 02:24 PM Report Posted July 30, 2020 at 02:24 PM @philwhite have you ever looked into spaCy(BERT) or ERNIE? Quote
philwhite Posted July 31, 2020 at 03:49 PM Report Posted July 31, 2020 at 03:49 PM On 7/30/2020 at 3:24 PM, Weyland said: have you ever looked into spaCy(BERT) or ERNIE? Thanks for that spaCy looks interesting for POS tagging and NER. Not sure it would help with my Speech-to-Text projects except, perhaps, for flagging incorrect transcriptions. Ideally, the neural Speech-to-Text services should be using knowledge from BERT to improve their neural nets and improve the accuracy of their transcrptions. After all, BERT is trained to fill in a gap in text with the most likely word. Did you have a specific project or purpose in mind for NLP? I'd read a little about BERT and ERNIE. BERT has been a real game-change, it seems, though I don't know how much training has been done on Chinese.text. Quote
Weyland Posted July 31, 2020 at 09:38 PM Report Posted July 31, 2020 at 09:38 PM 5 hours ago, philwhite said: Did you have a specific project or purpose in mind for NLP? Grade the top 100k most common words by difficult, as to give an indications as to how difficult/texts are for foreign language students (based on the new HSK3.0 levels). Here is another, more abundant, list of resources. 1 Quote
imron Posted August 1, 2020 at 02:12 AM Report Posted August 1, 2020 at 02:12 AM 10 hours ago, philwhite said: though I don't know how much training has been done on Chinese.text. A lot of the top AI researchers (even in the US) are Chinese, and one of the benefits of that is that alongside English, Mandarin is often a first-class language for a lot of research (not sure if this is relevant to BERT and ERNIE, but a lot of NLP research has good Mandarin support). Quote
philwhite Posted August 1, 2020 at 10:55 AM Report Posted August 1, 2020 at 10:55 AM 8 hours ago, imron said: A lot of the top AI researchers (even in the US) are Chinese, and one of the benefits of that is that alongside English, Mandarin is often a first-class language for a lot of research (not sure if this is relevant to BERT and ERNIE, but a lot of NLP research has good Mandarin support). Yes, one of the biggest breakthroughs in deep learning in recent years was Kaiming He's paper on on deep residual nets when at Microsoft China. Baidu Research in the China and the US have published a lot. Unfortunately, my Chinese is not good enough to read the API of Baidu's web service. In the UK, Prof Guo at Imperial is well known. I was thinking specifically of BERT, I should have written "I don't know how much training of BERT has been done by Google on Chinese text" Quote
philwhite Posted August 1, 2020 at 11:10 AM Report Posted August 1, 2020 at 11:10 AM 13 hours ago, Weyland said: Grade the top 100k most common words by difficult, as to give an indications as to how difficult/texts are for foreign language students (based on the new HSK3.0 levels). Here is another, more abundant, list of resources. Many thanks Weyland for the link to the list of resources. Unfortunately, my reading skills are nowhere near good enough for most of them. Please forgive me, I'm slow on the uptake here. Are you talking about: Grading common words and (grammar constructs) for difficulty from scratch, Grading a texts for difficulty, given a grading of vocab and grammar constructs for difficulty? Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.