jannesan Posted December 16, 2019 at 10:23 PM Report Posted December 16, 2019 at 10:23 PM Hi everyone, I would like to introduce you to my little project that some people may find useful. It is a free website (www.jihanzi.com) with 2 functions: 1. Extracting vocabulary (simplified/traditional) from epub (DRM free), pdf or text files You can also match it against your known words in case you have those at hand, so you can find all unknown words in a text. Further information is frequency of occurrence and where a word first occurs (chapter for epub, page for pdf, relative for plaintext). For convenience I added a filter for minimum amount of occurrence. 2. Recommendation of books based on your known words If you upload your known words, you can again filter for a minimum amount of occurrence and you can download a list of the books ordered by least unknown words to most unknown words. This for now is only matching against 155 books which are all recommendations from this popular thread: https://www.chinese-forums.com/forums/topic/2034-what-are-you-reading/ . I actually found around 350 titles that exist on douban and were mentioned, but I could only find text files for 155 books of those. I am planning to add more books and also movies. If you have any suggestions, please let me know. Cheers 1 Quote
Jan Finster Posted December 17, 2019 at 01:21 AM Report Posted December 17, 2019 at 01:21 AM This sounds like a useful tool! (Edited) at first I did not get any useful characters but this seems to be due to some font issues on my Excel. Uploading the .csv to Googledocs everything looks fine. Thanks! Quote
DavyJonesLocker Posted December 17, 2019 at 02:04 AM Report Posted December 17, 2019 at 02:04 AM Feedback I just tried it and uploaded a epub file but a message saying "please select a source file with a valid filetype" appeared. The epub file is fine in several epub readers though. When I copy a page of text from the file and just paste into a notepad file (Unicode format) , I still the the same message Chrome, microsoft edge (VPN on , off ) same response however saved as UTF-8 seems fine but when i download the CSV file it comes out as gobbledygook, changed extension to .txt and opens fine ion Notepad/word etc Quote
mungouk Posted December 17, 2019 at 06:42 AM Report Posted December 17, 2019 at 06:42 AM There seems to be an issue with opening .csv files in Excel if they have Hanzi in them — it doesn't recognise the unicode. (At least that's the case on Excel for Mac 16.32) If you're using a Mac, the Numbers app manages it no problem. Quote
jannesan Posted December 17, 2019 at 11:30 AM Author Report Posted December 17, 2019 at 11:30 AM The CSV file is UTF-8 encoded, I guess Excel opens it with some other encoding. I have no experience with Excel, but it should be possible to specify a different (than default) encoding when importing a file. 9 hours ago, DavyJonesLocker said: I just tried it and uploaded a epub file but a message saying "please select a source file with a valid filetype" appeared. The epub file is fine in several epub readers though. Is the epub DRM protected? I suspect this may be the reason, I can only extract text from DRM free epubs. I should add this to the description. I am using https://calibre-ebook.com/ to manage my ebooks and convert between different formats. You can install a plugin (https://apprenticealf.wordpress.com/2012/09/10/calibre-plugins-the-simplest-option-for-removing-most-ebook-drm/) to get rid of DRM and thus really own your ebooks. 9 hours ago, DavyJonesLocker said: When I copy a page of text from the file and just paste into a notepad file (Unicode format) , I still the the same message Chrome, microsoft edge (VPN on , off ) same response You mean when you upload as plaintext? This is strange. Quote
DavyJonesLocker Posted December 17, 2019 at 01:12 PM Report Posted December 17, 2019 at 01:12 PM 1 hour ago, jannesan said: Is the epub DRM protected? I suspect this may be the reason, I can only extract text from DRM free epubs. I should add this to the description. no idea, i just downloaded it off baidu and opened it in a free Epub reader and seems fine. I can send it to you if you like, small file 1 hour ago, jannesan said: You mean when you upload as plaintext? This is strange. yup, just called file.txt then saved the file again as UT8 and works fine I'd say its work adding in more information on error codes and general usage on you website, e.g. can't use kindle books etc (I presume) Quote
jannesan Posted December 17, 2019 at 02:11 PM Author Report Posted December 17, 2019 at 02:11 PM 54 minutes ago, DavyJonesLocker said: I can send it to you if you like, small file Yes, please! I am using a very common library to check for filetypes, but maybe it doesn't work as well as I thought. I can change it to accept all files with .epub extension and then try to read it on the server. 56 minutes ago, DavyJonesLocker said: I'd say its work adding in more information on error codes and general usage on you website, e.g. can't use kindle books etc (I presume) Yes, I will do that. It is easy to convert to epub with Calibre, but yea that's an extra step to take as a user. Thanks for the feedback:) Quote
dougwar Posted December 18, 2019 at 12:28 AM Report Posted December 18, 2019 at 12:28 AM Cool project, can you add a hsk level words to select as know? - hsk1 ( if I know all words from this level) - hsk2 Etc... Quote
jannesan Posted December 18, 2019 at 11:25 AM Author Report Posted December 18, 2019 at 11:25 AM 10 hours ago, dougwar said: Cool project, can you add a hsk level words to select as know? Yes, good idea:) I'll let you know when I added this 1 Quote
DavyJonesLocker Posted January 12, 2020 at 04:14 AM Report Posted January 12, 2020 at 04:14 AM hey @jannesan, i tried another file on your website and it comes up with "bad requests" Its just a plain Unicode text file , it opens fine in MS notepad Actually , I'll pm you the file 1 Quote
jannesan Posted January 12, 2020 at 02:09 PM Author Report Posted January 12, 2020 at 02:09 PM @DavyJonesLocker Thanks for letting me know, I'll look into it and PM you when it's resolved:) Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.