querido Posted November 19, 2009 at 04:03 PM Report Share Posted November 19, 2009 at 04:03 PM Could someone please point me to something no doubt already written? It is a simple function that must be of common interest. Grep + regex + scripting = I can't do it right now. File A is known words of one or more chars, let's say one per line. File B is text parsed into words, one word per line. Find words in B not in A, print to file C = new words. If noone answers I'll do it myself when I can and post it here. Thank you. Lessons getting much longer, must automate this. Quote Link to comment Share on other sites More sharing options...
wrbt Posted November 19, 2009 at 05:43 PM Report Share Posted November 19, 2009 at 05:43 PM It's pretty trivial if you've got python installed. Assuming a.txt is: Benal Tiger Siberian Tiger Lion Elephant Cat Dog Mouse Leopard Assuming b.txt is: Brown Bear Cat Alligator Spotted Hyena Coyote Dog Open python and at command line and create two sets from files split along LFs, then use difference method of set object: Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> a=set(open("c:/a.txt").read().split("n")) >>> b=set(open("c:/b.txt").read().split("n")) >>> for d in b.difference(a): print d ... Alligator Coyote Spotted Hyena Brown Bear >>> Should be pretty easy to turn that into a script you could run against whatever parameters. Quote Link to comment Share on other sites More sharing options...
querido Posted November 19, 2009 at 06:38 PM Author Report Share Posted November 19, 2009 at 06:38 PM You are very generous. Thank you. Quote Link to comment Share on other sites More sharing options...
jbradfor Posted November 19, 2009 at 07:55 PM Report Share Posted November 19, 2009 at 07:55 PM When you get it working, please post. I've been wanting to write something like this as well, for finding new vocab words in a text to read. [in this case, A would be all the words I know, B would be the new text, already divided into words by a different program, and C is the list of words I should learn before I read the text.] Quote Link to comment Share on other sites More sharing options...
querido Posted November 19, 2009 at 08:30 PM Author Report Share Posted November 19, 2009 at 08:30 PM To jbradfor- This first try is actually useable, BUT: 1. It doesn't avoid duplicates; the new word is listed in new_words multiple times 2. It puts a ">" char at the beginning of every line in new_words The above can be fixed easily with a text editor, but the next one could be annoying- 3. It requires that both input files be one word per line command: diff known_words new_lesson | grep '>' > new_words I will work on something better, but I'm very busy studying right now. Learning this stuff from scratch I can't do right now, but I can modify examples. Here is a good source of ideas. Little tools like this must already exist but haven't found them yet. Quote Link to comment Share on other sites More sharing options...
jbradfor Posted November 19, 2009 at 09:57 PM Report Share Posted November 19, 2009 at 09:57 PM And it requires that both lists are sorted using the same algorithm. I didn't want to lose the original order for list 'B'. [You can remove the initial '> ' with a simple sed line, something like sed -e 's/> //' I think will work] Quote Link to comment Share on other sites More sharing options...
querido Posted November 19, 2009 at 10:25 PM Author Report Share Posted November 19, 2009 at 10:25 PM (edited) And it requires that both lists are sorted using the same algorithm. Whoops... that's no good. Sorry. If you want to see it, here is something very close. I just converted my Wenlin-parsed new_lesson to one-word-per-line with this: sed 's/ | /r/g' new_lesson > temp then this: sed 's/, /r/g' temp > new_one-word-per-line I should have figured this out months ago. I still might use what wrbt gave me. Edited November 19, 2009 at 10:40 PM by querido Quote Link to comment Share on other sites More sharing options...
wrbt Posted November 19, 2009 at 11:16 PM Report Share Posted November 19, 2009 at 11:16 PM This will do it in one line, dumping the results of a.txt and b.txt to a third file called c.txt: open("c:/c.txt","w").write("n".join(set(open("c:/b.txt").read().split("n")).difference(set(open("c:/a.txt").read().split("n"))))) Technically I think that's leaving a dandling resource not closing the objects we never catch returning from the open, but should be okay for quick and dirty dump into a python command line window. Quote Link to comment Share on other sites More sharing options...
querido Posted November 20, 2009 at 12:08 AM Author Report Share Posted November 20, 2009 at 12:08 AM The commands given by wrbt work perfectly. Beautiful. Thank you. Quote Link to comment Share on other sites More sharing options...
querido Posted November 20, 2009 at 10:38 PM Author Report Share Posted November 20, 2009 at 10:38 PM (edited) Total success: its automated now. Disclaimer: You must study the documentation for these commands before trying them; your operating system might differ. (My system is Debian Linux.) 1. Since I haven't kept a list of *words*, I had to extract them from mnemosyne: a. Exported from mnemosyne all cards having hanzi on the front (for me at this time, all of them), as tab-separated, into file "exported_cards". b. For each line, stripped off everything after the first tab (leaving just "my_wordlist") with this: sed 's/t.*//' exported_cards > my_wordlist I won't have to do #1 again. 2. Took a file with segmented hanzi looking like this snippet, in file "new_lesson": 01 *************************************************** #1 小 | 白兔 | 和 | 小 | 灰 | 兔 老 | 山羊 | 在 | 地 | 里 | 收 | 白菜, 小 | 白兔 | 和 | 小 | 灰 | 兔 | 来 | 帮忙. and rearranged it into a one-word-per-line wordlist, with symbols and duplicates removed, with this: #!/bin/sh sed 's/[.,!?:"#()1-9 |]/n/g' new_lesson | sort -u > new_one_per 3. Extracted the *new* words from the above with this, credit wrbt (If you've mixed file types, use dos2unix to ensure your eol's are the same in all files. I think there's a way to write the following to be eol-tolerant.): #!/usr/bin/python open("new_words","w").write("n".join(set(open("new_one_per").read().split("n")).difference(set(open("my_wordlist").read().split("n"))))) 4. The file "new_words" looks like this snippet, ready for making flashcards: 一家 到老 取下 吃不完 吃完 吵闹 帮忙 干活儿 我自己 才有 拔草 换下来 施肥 If I knew how to call the python script from the bash script it would be even neater. [edit] If that python script is named "lexicon_tool.py", to call it from the bash script just add as the last line of the bash script: /full/path/to/lexicon_tool.py Now one command (the name of the bash script) takes "my_wordlist" (all of my known words) + "new_lesson" (Wenlin segmented text) and produces "new_words". [end edit] In a lesson about 500 chars long, I did find some words I had overlooked as new. The volume I'm working on now is 5000 chars long, but now a lot of the recordkeeping/flashcarding overhead will be spared, with most of that time now spent *in* Wenlin and ABC, deciding which definitions apply. I'm not sure how this relates to the "Yet Another Word List Generator" thread, but 1. I'm committed to Wenlin/ABC (including its segmentation decisions) and 2. I don't want to play with Java right now. However, you could combine this with that and get the above with definitions too... Disclaimer: You must study the documentation for these commands before trying them; your operating system might differ. (My system is Debian Linux.) Edited November 20, 2009 at 11:30 PM by querido Quote Link to comment Share on other sites More sharing options...
jbradfor Posted November 21, 2009 at 12:54 AM Report Share Posted November 21, 2009 at 12:54 AM I'm not sure how this relates to the "Yet Another Word List Generator" thread, Different purpose. The point of YAWLG is to generate a word list from text, not from an existing list of words. That is, the primary "special sauce" in YAWLG is the java tool to deliminate the words in a sequence of characters. Quote Link to comment Share on other sites More sharing options...
m_k_e Posted November 21, 2009 at 09:10 AM Report Share Posted November 21, 2009 at 09:10 AM (edited) You probably already solved your problem, but just in case someone else is after an alternative solution: sea@cal:/tmp/y$ cat FileA 阿 邮件地址 下 所以 爸爸 妈妈 sea@cal:/tmp/y$ cat FileB 阿 不 从 的 俄 发 俄 个 很快 大的 朋友 邮件地址 下 需要 sea@cal:/tmp/y$ grep -vf FileA FileB 不 从 的 俄 发 俄 个 很快 大的 朋友 需要 Or, in Ruby sea@cal:/tmp/y$ ruby -e 'puts (ARGF.to_a - File.readlines("FileA")).join' FileB 不 从 的 俄 发 俄 个 很快 大的 朋友 需要 Of course, you can also do it all in one. Say, in bash: #!/bin/sh sed 's/[.,!?:"#()1-9 |]/n/g' new_lesson | sort -u > new_one_per && python -c 'open("new_words","w").write("n".join(set(open("new_one_per").readlines()).difference(set(open("my_wordlist").readlines()))))' || echo 'Error!' Or in Ruby (I'm sorry, it's such a joy to code in Ruby) #!/usr/bin/env ruby out=[];ARGF.each_line {|line| out << line.gsub(/[Wd]/,"n").squeeze};puts (out-File.readlines("FileA")).join Edited November 21, 2009 at 10:08 AM by m_k_e Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.