Jump to content
Chinese-Forums
  • Sign Up

Jihanzi: vocabulary extraction and content recommendation


Recommended Posts

Posted

Hi everyone,
I would like to introduce you to my little project that some people may find useful.
It is a free website (www.jihanzi.com) with 2 functions:

 

1. Extracting vocabulary (simplified/traditional) from epub (DRM free), pdf or text files

 

You can also match it against your known words in case you have those at hand, so you can find all unknown words in a text. Further information is frequency of occurrence and where a word first occurs (chapter for epub, page for pdf, relative for plaintext). For convenience I added a filter for minimum amount of occurrence.

 

2. Recommendation of books based on your known words

 

If you upload your known words, you can again filter for a minimum amount of occurrence and you can download a list of the books ordered by least unknown words to most unknown words.

This for now is only matching against 155 books which are all recommendations from this popular thread: https://www.chinese-forums.com/forums/topic/2034-what-are-you-reading/ . I actually found around 350 titles that exist on douban and were mentioned, but I could only find text files for 155 books of those.
I am planning to add more books and also movies.


If you have any suggestions, please let me know.

Cheers

 

  • Helpful 1
Posted

This sounds like a useful tool! :)

 

(Edited) at first I did not get any useful characters but this seems to be due to some font issues on my Excel. Uploading the .csv to Googledocs everything looks fine. Thanks! :)

 

Posted

Feedback

 

I just tried it and uploaded a epub file but a message saying "please select a source file with a valid filetype"  appeared. The epub file is fine in several epub readers though. 

 

When I copy a page of text from the file and just paste into a notepad file (Unicode format) , I still the the same message 

Chrome, microsoft edge (VPN on , off ) same response

 

however saved as UTF-8 seems fine but when i download the CSV file it comes out as gobbledygook, changed extension to .txt and opens fine ion Notepad/word etc

 

 

Posted

There seems to be an issue with opening .csv files in Excel if they have Hanzi in them — it doesn't recognise the unicode.  (At least that's the case on Excel for Mac 16.32)

 

If you're using a Mac, the Numbers app manages it no problem. 

Posted

The CSV file is UTF-8 encoded, I guess Excel opens it with some other encoding.

I have no experience with Excel, but it should be possible to specify a different (than default) encoding when importing a file.

 

9 hours ago, DavyJonesLocker said:

I just tried it and uploaded a epub file but a message saying "please select a source file with a valid filetype"  appeared. The epub file is fine in several epub readers though. 

 

Is the epub DRM protected? I suspect this may be the reason, I can only extract text from DRM free epubs. I should add this to the description.

 

I am using https://calibre-ebook.com/ to manage my ebooks and convert between different formats. You can install a plugin (https://apprenticealf.wordpress.com/2012/09/10/calibre-plugins-the-simplest-option-for-removing-most-ebook-drm/) to get rid of DRM and thus really own your ebooks.

 

9 hours ago, DavyJonesLocker said:

When I copy a page of text from the file and just paste into a notepad file (Unicode format) , I still the the same message 

Chrome, microsoft edge (VPN on , off ) same response

 

You mean when you upload as plaintext? This is strange.

Posted
1 hour ago, jannesan said:

Is the epub DRM protected? I suspect this may be the reason, I can only extract text from DRM free epubs. I should add this to the description.

 

 

no idea, i just downloaded it off baidu and opened it in a free Epub reader and seems fine. I can send it to you if you like, small file 

 

1 hour ago, jannesan said:

You mean when you upload as plaintext? This is strange.

 

 

yup, just called file.txt then saved the file again as UT8 and works fine 

 

I'd say its work adding in more information on error codes and general usage on you website, e.g. can't use kindle books etc (I presume)

Posted
54 minutes ago, DavyJonesLocker said:

I can send it to you if you like, small file 

 

Yes, please!

I am using a very common library to check for filetypes, but maybe it doesn't work as well as I thought. I can change it to accept all files with .epub extension and then try to read it on the server.

 

56 minutes ago, DavyJonesLocker said:

I'd say its work adding in more information on error codes and general usage on you website, e.g. can't use kindle books etc (I presume)

 

Yes, I will do that. It is easy to convert to epub with Calibre, but yea that's an extra step to take as a user.

 

Thanks for the feedback:)

Posted

Cool project, can you add a hsk level words to select as know?

- hsk1 ( if I know all words from this level)

- hsk2

 

Etc...

Posted
10 hours ago, dougwar said:

Cool project, can you add a hsk level words to select as know?

 

Yes, good idea:)

I'll let you know when I added this

  • Like 1
  • 4 weeks later...
Posted

hey @jannesan, i tried another file on your website and it comes up with "bad requests"

 Its just a plain Unicode  text file , it opens fine in MS notepad

 

Actually , I'll pm you the file

  • Helpful 1

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...