Need a tool: list strings in B not in A

November 19, 2009 at 04:03 PM

Could someone please point me to something no doubt already written? It is a simple function that must be of common interest. Grep + regex + scripting = I can't do it right now.

File A is known words of one or more chars, let's say one per line.

File B is text parsed into words, one word per line.

Find words in B not in A, print to file C = new words.

If noone answers I'll do it myself when I can and post it here.

Thank you. Lessons getting much longer, must automate this.

November 19, 2009 at 05:43 PM

It's pretty trivial if you've got python installed.

Assuming a.txt is:

Benal Tiger

Siberian Tiger

Lion

Elephant

Cat

Dog

Mouse

Leopard

Assuming b.txt is:

Brown Bear

Cat

Alligator

Spotted Hyena

Coyote

Dog

Open python and at command line and create two sets from files split along LFs, then use difference method of set object:

Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>> a=set(open("c:/a.txt").read().split("n"))

>>> b=set(open("c:/b.txt").read().split("n"))

>>> for d in b.difference(a): print d

...

Alligator

Coyote

Spotted Hyena

Brown Bear

>>>

Should be pretty easy to turn that into a script you could run against whatever parameters.

November 19, 2009 at 06:38 PM

You are very generous. Thank you.

November 19, 2009 at 07:55 PM

When you get it working, please post. I've been wanting to write something like this as well, for finding new vocab words in a text to read. [in this case, A would be all the words I know, B would be the new text, already divided into words by a different program, and C is the list of words I should learn before I read the text.]

November 19, 2009 at 08:30 PM

To jbradfor- This first try is actually useable, BUT:

1. It doesn't avoid duplicates; the new word is listed in new_words multiple times

2. It puts a ">" char at the beginning of every line in new_words

The above can be fixed easily with a text editor, but the next one could be annoying-

3. It requires that both input files be one word per line

command:

diff known_words new_lesson | grep '>' > new_words

I will work on something better, but I'm very busy studying right now.

Learning this stuff from scratch I can't do right now, but I can modify examples. Here is a good source of ideas.

Little tools like this must already exist but haven't found them yet.

November 19, 2009 at 09:57 PM

And it requires that both lists are sorted using the same algorithm. I didn't want to lose the original order for list 'B'.

[You can remove the initial '> ' with a simple sed line, something like

sed -e 's/> //'

I think will work]

November 19, 2009 at 10:25 PM

And it requires that both lists are sorted using the same algorithm.

Whoops... that's no good. Sorry.

If you want to see it, here is something very close.

I just converted my Wenlin-parsed new_lesson to one-word-per-line with this:

sed 's/ | /r/g' new_lesson > temp

then this:

sed 's/, /r/g' temp > new_one-word-per-line

I should have figured this out months ago. I still might use what wrbt gave me.

Edited November 19, 2009 at 10:40 PM by querido

November 19, 2009 at 11:16 PM

This will do it in one line, dumping the results of a.txt and b.txt to a third file called c.txt:

open("c:/c.txt","w").write("n".join(set(open("c:/b.txt").read().split("n")).difference(set(open("c:/a.txt").read().split("n")))))

Technically I think that's leaving a dandling resource not closing the objects we never catch returning from the open, but should be okay for quick and dirty dump into a python command line window.

November 20, 2009 at 12:08 AM

The commands given by wrbt work perfectly.

Beautiful. Thank you.

November 20, 2009 at 10:38 PM

Total success: its automated now.

Disclaimer: You must study the documentation for these commands before trying them; your operating system might differ. (My system is Debian Linux.)

1. Since I haven't kept a list of *words*, I had to extract them from mnemosyne:

a. Exported from mnemosyne all cards having hanzi on the front (for me at this time, all of them), as tab-separated, into file "exported_cards".

b. For each line, stripped off everything after the first tab (leaving just "my_wordlist") with this:

sed 's/t.*//' exported_cards > my_wordlist

I won't have to do #1 again.

2. Took a file with segmented hanzi looking like this snippet, in file "new_lesson":

01 ***************************************************
#1 小 | 白兔 | 和 | 小 | 灰 | 兔 
老 | 山羊 | 在 | 地 | 里 | 收 | 白菜, 小 | 白兔 | 和 | 小 | 灰 | 兔 | 来 | 帮忙.

and rearranged it into a one-word-per-line wordlist, with symbols and duplicates removed, with this:

#!/bin/sh
sed 's/[.,!?:"#()1-9 |]/n/g' new_lesson | sort -u > new_one_per

3. Extracted the *new* words from the above with this, credit wrbt

(If you've mixed file types, use dos2unix to ensure your eol's are the same in all files. I think there's a way to write the following to be eol-tolerant.):

#!/usr/bin/python
open("new_words","w").write("n".join(set(open("new_one_per").read().split("n")).difference(set(open("my_wordlist").read().split("n")))))

4. The file "new_words" looks like this snippet, ready for making flashcards:

一家
到老
取下
吃不完
吃完
吵闹
帮忙
干活儿
我自己
才有
拔草
换下来
施肥

If I knew how to call the python script from the bash script it would be even neater.

[edit]

If that python script is named "lexicon_tool.py", to call it from the bash script just add as the last line of the bash script:

/full/path/to/lexicon_tool.py

Now one command (the name of the bash script) takes "my_wordlist" (all of my known words) + "new_lesson" (Wenlin segmented text) and produces "new_words".

[end edit]

In a lesson about 500 chars long, I did find some words I had overlooked as new. The volume I'm working on now is 5000 chars long, but now a lot of the recordkeeping/flashcarding overhead will be spared, with most of that time now spent *in* Wenlin and ABC, deciding which definitions apply.

I'm not sure how this relates to the "Yet Another Word List Generator" thread, but 1. I'm committed to Wenlin/ABC (including its segmentation decisions) and 2. I don't want to play with Java right now. However, you could combine this with that and get the above with definitions too...

Disclaimer: You must study the documentation for these commands before trying them; your operating system might differ. (My system is Debian Linux.)

Edited November 20, 2009 at 11:30 PM by querido

November 21, 2009 at 12:54 AM

I'm not sure how this relates to the "Yet Another Word List Generator" thread,

Different purpose. The point of YAWLG is to generate a word list from text, not from an existing list of words. That is, the primary "special sauce" in YAWLG is the java tool to deliminate the words in a sequence of characters.

November 21, 2009 at 09:10 AM

You probably already solved your problem, but just in case someone else is after an alternative solution:

sea@cal:/tmp/y$ cat FileA
阿
邮件地址
下
所以
爸爸
妈妈
sea@cal:/tmp/y$ cat FileB
阿
不
从
的
俄
发
俄
个
很快
大的
朋友
邮件地址
下
需要
sea@cal:/tmp/y$ grep -vf FileA FileB
不
从
的
俄
发
俄
个
很快
大的
朋友
需要

Or, in Ruby

sea@cal:/tmp/y$ ruby -e 'puts (ARGF.to_a - File.readlines("FileA")).join' FileB
不
从
的
俄
发
俄
个
很快
大的
朋友
需要

Of course, you can also do it all in one. Say, in bash:

#!/bin/sh
sed 's/[.,!?:"#()1-9 |]/n/g' new_lesson | sort -u > new_one_per &&
python -c 'open("new_words","w").write("n".join(set(open("new_one_per").readlines()).difference(set(open("my_wordlist").readlines()))))' || 
echo 'Error!'

Or in Ruby (I'm sorry, it's such a joy to code in Ruby)

#!/usr/bin/env ruby
out=[];ARGF.each_line {|line| out << line.gsub(/[Wd]/,"n").squeeze};puts (out-File.readlines("FileA")).join

Edited November 21, 2009 at 10:08 AM by m_k_e

Sign In

Need a tool: list strings in B not in A

Recommended Posts

querido

wrbt

querido

jbradfor

querido

jbradfor

querido

wrbt

querido

querido

jbradfor

m_k_e

Join the conversation