Jump to content
Chinese-Forums
  • Sign Up

Make Me a Hanzi: free, open-source Chinese character data


Recommended Posts

Posted

Looks fantastic, I don't understand the programming side of things but I do understand free and open source :)  And I would be very keen on a character learning app that was free :)

 

One of the things about chinese is the need to spend a lot of time practising writing and learning characters and anything that can contribute to this has got to be good.

Posted

@Shaunak  Amazing! Keep up the good work.

 

I found your demo recognition web could run offline!  Can you explain a bit more on how the algorithm works?  There are around 9000 characters, how can you recognize user inputs without querying the server side?

Posted

I found that i have to be very accurate with my proportions for the character to be recognized. This isn't the case in Pleco. Is that the point - it only recognizes it if you write it very well?

Posted

Thanks for the kind words, everyone!

 

@wibr Wow, I didn't know about that Skritter repository. I spent a while looking through it. To make use of it directly, I'd have to reverse-engineer their network protocol as all their data comes from the server. Maybe I can just pull out the character learning library, though.

 

@kawakusong Sure, I can explain! The character recognition uses the "medians" field from the graphics data file. I wrote some code to compress that data down to 1Mb. When the demo site is first loaded, it loads and decompresses the medians data, then uses it for matching. After the data is loaded, matching is entirely client-side. The matching algorithm is extremely simple at the moment - it expects inputs to have correct stroke order and compares it against different characters' medians by angle and position.

 

@XiaoKui For this demo site, I want the handwriting recognition algorithm to be more liberal. It's pretty basic at the moment and I would definitely like to improve it. One major problem with the recognition that I already know of is that for components like 女, 艹, 骨, it only recognizes the variant used by the Arphic font in each character. If you have examples where it failed to recognize your writing, it would be great if you could share some screenshots!

 

@boctulus No, it's purely a hobby project!

 

It sounds like getting Skritter's client or some similar app working is a good next step. I'll work towards that.

  • Like 1
Posted

@skishore:  you found "88% characters have the phonetic component on the right"  (very interesting!)

 

I'd like know something about the complexity (related with number of strokes) of phonetic component:  is there any relationship ? (they have less strokes than the semantic component -in average- ?)

Posted

Hate to be a wet blanket here, but a legal concern: isn't the Arphic Public License for the fonts non-commercial? (this is why we haven't tried to do something similar at Pleco to supply stroke order in 楷体 based on those fonts - don't want to get sued) Wouldn't seem to impact your project but might affect others making derivatives of it.

Posted

That's right, the fonts are distributed with Ubuntu and are licensed under the older, free-software APL. Perhaps I should build another round of the stroke order data with the new fonts too, though - depending on the details, the non-commercial aspect may not a problem for me.

 

@boctulus Interesting question! Now, I can guess the result ahead of time (think of all the common radicals like 亻, 讠, 辶, 艹, 女, etc), but it's good to confirm:

  • Out of 6602 examples in the dataset with a pictophonetic etymology, in 76% the phonetic component had more strokes, and in 16% it had fewer.
  • The average number of strokes in the phonetic component was 7.8, while in the semantic component it was 4.5.

This analysis is not that great. A better one would take character frequency into account and also compute simplified and traditional statistics separately. The numbers for traditional characters would likely be a lot closer. I'm leaving that to you all for now, though! The quick script I wrote to get these numbers is here.

  • 3 years later...
  • New Members
Posted

How do I add character stroke data to the project? I want to use this to learn Cantonese characters!

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...