Jump to content
Chinese-Forums
  • Sign Up

Resources for developers


Recommended Posts

Posted

That's a good list of resources.  For corpora, I'd also add the those from the SIGHAN word segmentation bakeoff, which contains corpora from the following organizations:

  • CKIP, Academia Sinica, Taiwan
  • City University of Hong Kong, Hong Kong SAR
  • Beijing Universty, China
  • Microsoft Research, China
  • Like 3
Posted

Thank you, Imron, that data looks great! Am I mistaken in believing that the corpus data there is hand-segmented, and therefore is a fairly reliable gold-standard?

Posted

You are not mistaken.  It has been hand-segmented (or at least hand verified), and the data has been used in multiple competitions for Chinese segmentation so one would hope that 'many eyes' would have caught most if any remaining mistakes.

 

Regarding 'gold standard', there are multiple arguments for what constitutes a 'word' in Chinese and different corpora have different standards/definitions.

 

I don't think there's ever going to be one authoritative 'gold standard' for the entire language, however I think it's fair to say that the segmented data you can download for each corpus is the 'gold standard' for the definition of 'word' as used by each corpus (the page above has links to a document for each corpus that defines the standard used for determining a 'word').

Posted

I should add our CC-Canto project here - cantonese.org, CC-BY-SA-licensed Cantonese-to-English dictionary along with Cantonese readings for CC-CEDICT.

 

Old version of LDC (which we offer in Pleco) was made freely available (reluctantly) by LDC on account of its being derived from CEDICT. Not sure if it's still on their public website but you can get it from archive.org. Stock Adso does have a lot of Pinyin issues, we've re-generated Pinyin in our version of it from better sources. Both dictionaries were IIRC problematic licensing-wise to build into an app - non-commercial restrictions make us nervous - but we offer them as separate add-on downloads. StarDict dictionaries definitely suspect licensing-wise.

 

Re segmentation, as imron suggests, the lack of standardization among definitions of what constitutes a word is a big problem - bigger still for us because those standards also vary among dictionaries; some dictionaries even include entire phrases, and if one of those is an accurate match we probably want to use it rather than the individual words as it's more likely to offer an accurate meaning. Our "intelligent segmentation" feature does something a bit similar to the graph you describe, but along with frequency and a little bit of grammatical analysis it also takes into account how many dictionaries believe a particular string of characters is in fact a word (with some extra weighting based on which dictionaries we trust more for determining that) - we're working on adding some other factors too.

Posted

Yeah, thanks for putting all this together. Nice to have it all in one place.

  • 4 weeks later...
  • 4 years later...
Posted

Thank you for this. It's very helpful.

On another note: Has anyone here dabbled in NLP (Natural Language Processing)?

Posted
14 hours ago, Weyland said:

On another note: Has anyone here dabbled in NLP (Natural Language Processing)?

 

I use a free online speech-to-text service to create Anki flashcard from audio sources .

 

I used the online service to generate json transcripts from the audio sources in batches of 100MB at a time. It is fairly accurate for complete sentences. I then use a GNU bash (sed/grep/gawk/ffmpeg) script to generate the .tsv file and .mp3 segments, all in one script for 100Mb of audio and its json transcript (no need to manually run subs2srs).

 

I've also used Apache Lucene and C# to build a Winforms search application on a domain-specifc English-language text database which I gathered

  • Like 2
Posted
On 7/30/2020 at 3:24 PM, Weyland said:

have you ever looked into spaCy(BERT) or ERNIE?

 

Thanks for that spaCy looks interesting for POS tagging and NER. Not sure it would help with my Speech-to-Text projects except, perhaps, for flagging incorrect transcriptions. Ideally, the neural Speech-to-Text services should be using knowledge from BERT to improve their neural nets and improve the accuracy of their transcrptions. After all, BERT is trained to fill in a gap in text with the most likely word.

 

Did you have a specific project or purpose in mind for NLP?

 

I'd read a little about BERT and ERNIE. BERT has been a real game-change, it seems, though I don't know how much training has been done on Chinese.text.

Posted
10 hours ago, philwhite said:

though I don't know how much training has been done on Chinese.text.

A lot of the top AI researchers (even in the US) are Chinese, and one of the benefits of that is that alongside English, Mandarin is often a first-class language for a lot of research (not sure if this is relevant to BERT and ERNIE, but a lot of NLP research has good Mandarin support).

Posted
8 hours ago, imron said:

A lot of the top AI researchers (even in the US) are Chinese, and one of the benefits of that is that alongside English, Mandarin is often a first-class language for a lot of research (not sure if this is relevant to BERT and ERNIE, but a lot of NLP research has good Mandarin support).

 

Yes, one of the biggest breakthroughs in deep learning in recent years was Kaiming He's paper on on deep residual nets when at Microsoft China. Baidu Research in the China and the US have published a lot. Unfortunately, my Chinese is not good enough to read the API of Baidu's web service. In the UK, Prof Guo at Imperial is well known.

 

I was thinking specifically of BERT, I should have written "I don't know how much training of BERT has been done by Google on Chinese text"

Posted
13 hours ago, Weyland said:

Grade the top 100k most common words by difficult, as to give an indications as to how difficult/texts are for foreign language students (based on the new HSK3.0 levels).

Here is another, more abundant, list of resources.

 

Many thanks Weyland for the link to the list of resources. Unfortunately, my reading skills are nowhere near good enough for most of them.

 

Please forgive me, I'm slow on the uptake here. Are you talking about:

  1. Grading common words and (grammar constructs) for difficulty from scratch,
  2. Grading a texts for difficulty, given a grading of vocab and grammar constructs for difficulty?

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...