Make Me a Hanzi: free, open-source Chinese character data

February 17, 2016 at 05:07 AM

Hi everyone!

There are lot of good proprietary apps and sites out there for learning Chinese - Pleco, Skritter, Wenlin, YellowBridge, etc, just to name a few. However, I've always felt that language-learning should be free, like Duolingo is for so many European languages. With that in mind, I started the Make Me a Hanzi project a few months ago to produce the core data and libraries needed to write Chinese language learning software.

My eventual goal for Make Me a Hanzi is to produce high-quality free and open-source versions of all of the following:

- Dictionary w/ etymology stroke order information

- Handwriting recognition software (demo up!)

- "Learn to write characters" apps

The data that I'm producing with this project could also be useful for linguistic research. For example, for characters with a left-to-right structure, do you know whether the left or right component is more likely to be the phonetic component? The etymology data I've produced can easily answer that! Out of 4831 such characters in my data, 4257 (88%) have the phonetic component on the right.

I'm very interested in having a large impact, so any feedback you have or suggestions for next steps would be great! Currently, I plan to build a simple "learn to write 100 common Chinese character components" app for Android and iOS with this data.

Thanks,

Shaunak

February 17, 2016 at 06:20 AM

Looks fantastic!

February 17, 2016 at 06:39 AM

I agree, good start! I believe the html5 skritter client is actually open source? https://github.com/skritter/skritter-html5

February 17, 2016 at 10:48 AM

Looks fantastic, I don't understand the programming side of things but I do understand free and open source And I would be very keen on a character learning app that was free

One of the things about chinese is the need to spend a lot of time practising writing and learning characters and anything that can contribute to this has got to be good.

February 17, 2016 at 12:07 PM

@Shaunak Amazing! Keep up the good work.

I found your demo recognition web could run offline! Can you explain a bit more on how the algorithm works? There are around 9000 characters, how can you recognize user inputs without querying the server side?

February 17, 2016 at 02:07 PM

I found that i have to be very accurate with my proportions for the character to be recognized. This isn't the case in Pleco. Is that the point - it only recognizes it if you write it very well?

February 17, 2016 at 02:56 PM

Great project!

Is this a kind of PhD thesis ?

February 17, 2016 at 05:15 PM

Thanks for the kind words, everyone!

@wibr Wow, I didn't know about that Skritter repository. I spent a while looking through it. To make use of it directly, I'd have to reverse-engineer their network protocol as all their data comes from the server. Maybe I can just pull out the character learning library, though.

@kawakusong Sure, I can explain! The character recognition uses the "medians" field from the graphics data file. I wrote some code to compress that data down to 1Mb. When the demo site is first loaded, it loads and decompresses the medians data, then uses it for matching. After the data is loaded, matching is entirely client-side. The matching algorithm is extremely simple at the moment - it expects inputs to have correct stroke order and compares it against different characters' medians by angle and position.

@XiaoKui For this demo site, I want the handwriting recognition algorithm to be more liberal. It's pretty basic at the moment and I would definitely like to improve it. One major problem with the recognition that I already know of is that for components like 女, 艹, 骨, it only recognizes the variant used by the Arphic font in each character. If you have examples where it failed to recognize your writing, it would be great if you could share some screenshots!

@boctulus No, it's purely a hobby project!

It sounds like getting Skritter's client or some similar app working is a good next step. I'll work towards that.

February 18, 2016 at 01:13 PM

@skishore: you found "88% characters have the phonetic component on the right" (very interesting!)

I'd like know something about the complexity (related with number of strokes) of phonetic component: is there any relationship ? (they have less strokes than the semantic component -in average- ?)

February 18, 2016 at 02:39 PM

Hate to be a wet blanket here, but a legal concern: isn't the Arphic Public License for the fonts non-commercial? (this is why we haven't tried to do something similar at Pleco to supply stroke order in 楷体 based on those fonts - don't want to get sued) Wouldn't seem to impact your project but might affect others making derivatives of it.

February 18, 2016 at 03:45 PM

According to wikipedia there is one from 2010 which doesn't allow commercial use, the one before is more in the spirit of GPL the way I read it.

Edit: https://en.wikipedia.org/wiki/Arphic_Technology The fonts in question still use the old license from 1999.

February 18, 2016 at 05:12 PM

That's right, the fonts are distributed with Ubuntu and are licensed under the older, free-software APL. Perhaps I should build another round of the stroke order data with the new fonts too, though - depending on the details, the non-commercial aspect may not a problem for me.

@boctulus Interesting question! Now, I can guess the result ahead of time (think of all the common radicals like 亻, 讠, 辶, 艹, 女, etc), but it's good to confirm:

Out of 6602 examples in the dataset with a pictophonetic etymology, in 76% the phonetic component had more strokes, and in 16% it had fewer.
The average number of strokes in the phonetic component was 7.8, while in the semantic component it was 4.5.

This analysis is not that great. A better one would take character frequency into account and also compute simplified and traditional statistics separately. The numbers for traditional characters would likely be a lot closer. I'm leaving that to you all for now, though! The quick script I wrote to get these numbers is here.

May 19, 2019 at 11:38 AM

How do I add character stroke data to the project? I want to use this to learn Cantonese characters!

Sign In

Make Me a Hanzi: free, open-source Chinese character data

Recommended Posts

skishore

imron

wibr

Shelley

kawakusong

Xiao Kui

boctulus

skishore

boctulus

mikelove

wibr

skishore

Dedaliadon

Join the conversation