Jump to content
Chinese-Forums
  • Sign Up

Need a tool: list strings in B not in A


Recommended Posts

Posted

Could someone please point me to something no doubt already written? It is a simple function that must be of common interest. Grep + regex + scripting = I can't do it right now.

File A is known words of one or more chars, let's say one per line.

File B is text parsed into words, one word per line.

Find words in B not in A, print to file C = new words.

If noone answers I'll do it myself when I can and post it here.

Thank you. Lessons getting much longer, must automate this.

Posted

It's pretty trivial if you've got python installed.

Assuming a.txt is:

Benal Tiger

Siberian Tiger

Lion

Elephant

Cat

Dog

Mouse

Leopard

Assuming b.txt is:

Brown Bear

Cat

Alligator

Spotted Hyena

Coyote

Dog

Open python and at command line and create two sets from files split along LFs, then use difference method of set object:

Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>> a=set(open("c:/a.txt").read().split("n"))

>>> b=set(open("c:/b.txt").read().split("n"))

>>> for d in b.difference(a): print d

...

Alligator

Coyote

Spotted Hyena

Brown Bear

>>>

Should be pretty easy to turn that into a script you could run against whatever parameters.

Posted

When you get it working, please post. I've been wanting to write something like this as well, for finding new vocab words in a text to read. [in this case, A would be all the words I know, B would be the new text, already divided into words by a different program, and C is the list of words I should learn before I read the text.]

Posted

To jbradfor- This first try is actually useable, BUT:

1. It doesn't avoid duplicates; the new word is listed in new_words multiple times

2. It puts a ">" char at the beginning of every line in new_words

The above can be fixed easily with a text editor, but the next one could be annoying-

3. It requires that both input files be one word per line

command:

diff known_words new_lesson | grep '>' > new_words

I will work on something better, but I'm very busy studying right now.

Learning this stuff from scratch I can't do right now, but I can modify examples. Here is a good source of ideas.

Little tools like this must already exist but haven't found them yet.

Posted

And it requires that both lists are sorted using the same algorithm. I didn't want to lose the original order for list 'B'.

[You can remove the initial '> ' with a simple sed line, something like

sed -e 's/> //'

I think will work]

Posted (edited)
And it requires that both lists are sorted using the same algorithm.

Whoops... that's no good. Sorry.

If you want to see it, here is something very close.

I just converted my Wenlin-parsed new_lesson to one-word-per-line with this:

sed 's/ | /r/g' new_lesson > temp

then this:

sed 's/, /r/g' temp > new_one-word-per-line

I should have figured this out months ago. I still might use what wrbt gave me.

Edited by querido
Posted

This will do it in one line, dumping the results of a.txt and b.txt to a third file called c.txt:

open("c:/c.txt","w").write("n".join(set(open("c:/b.txt").read().split("n")).difference(set(open("c:/a.txt").read().split("n")))))

Technically I think that's leaving a dandling resource not closing the objects we never catch returning from the open, but should be okay for quick and dirty dump into a python command line window.

Posted (edited)

Total success: its automated now. :D

Disclaimer: You must study the documentation for these commands before trying them; your operating system might differ. (My system is Debian Linux.)

1. Since I haven't kept a list of *words*, I had to extract them from mnemosyne:

a. Exported from mnemosyne all cards having hanzi on the front (for me at this time, all of them), as tab-separated, into file "exported_cards".

b. For each line, stripped off everything after the first tab (leaving just "my_wordlist") with this:

sed 's/t.*//' exported_cards > my_wordlist

I won't have to do #1 again.

2. Took a file with segmented hanzi looking like this snippet, in file "new_lesson":

01 ***************************************************
#1 小 | 白兔 | 和 | 小 | 灰 | 兔 
老 | 山羊 | 在 | 地 | 里 | 收 | 白菜, 小 | 白兔 | 和 | 小 | 灰 | 兔 | 来 | 帮忙.

and rearranged it into a one-word-per-line wordlist, with symbols and duplicates removed, with this:

#!/bin/sh
sed 's/[.,!?:"#()1-9 |]/n/g' new_lesson | sort -u > new_one_per

3. Extracted the *new* words from the above with this, credit wrbt

(If you've mixed file types, use dos2unix to ensure your eol's are the same in all files. I think there's a way to write the following to be eol-tolerant.):

#!/usr/bin/python
open("new_words","w").write("n".join(set(open("new_one_per").read().split("n")).difference(set(open("my_wordlist").read().split("n")))))

4. The file "new_words" looks like this snippet, ready for making flashcards:

一家
到老
取下
吃不完
吃完
吵闹
帮忙
干活儿
我自己
才有
拔草
换下来
施肥

If I knew how to call the python script from the bash script it would be even neater.

[edit]

If that python script is named "lexicon_tool.py", to call it from the bash script just add as the last line of the bash script:

/full/path/to/lexicon_tool.py

Now one command (the name of the bash script) takes "my_wordlist" (all of my known words) + "new_lesson" (Wenlin segmented text) and produces "new_words". :D

[end edit]

In a lesson about 500 chars long, I did find some words I had overlooked as new. The volume I'm working on now is 5000 chars long, but now a lot of the recordkeeping/flashcarding overhead will be spared, with most of that time now spent *in* Wenlin and ABC, deciding which definitions apply.

I'm not sure how this relates to the "Yet Another Word List Generator" thread, but 1. I'm committed to Wenlin/ABC (including its segmentation decisions) and 2. I don't want to play with Java right now. However, you could combine this with that and get the above with definitions too...

Disclaimer: You must study the documentation for these commands before trying them; your operating system might differ. (My system is Debian Linux.)

Edited by querido
Posted
I'm not sure how this relates to the "Yet Another Word List Generator" thread,

Different purpose. The point of YAWLG is to generate a word list from text, not from an existing list of words. That is, the primary "special sauce" in YAWLG is the java tool to deliminate the words in a sequence of characters.

Posted (edited)

You probably already solved your problem, but just in case someone else is after an alternative solution:

sea@cal:/tmp/y$ cat FileA
阿
邮件地址
下
所以
爸爸
妈妈
sea@cal:/tmp/y$ cat FileB
阿
不
从
的
俄
发
俄
个
很快
大的
朋友
邮件地址
下
需要
sea@cal:/tmp/y$ grep -vf FileA FileB
不
从
的
俄
发
俄
个
很快
大的
朋友
需要

Or, in Ruby

sea@cal:/tmp/y$ ruby -e 'puts (ARGF.to_a - File.readlines("FileA")).join' FileB
不
从
的
俄
发
俄
个
很快
大的
朋友
需要

Of course, you can also do it all in one. Say, in bash:

#!/bin/sh
sed 's/[.,!?:"#()1-9 |]/n/g' new_lesson | sort -u > new_one_per &&
python -c 'open("new_words","w").write("n".join(set(open("new_one_per").readlines()).difference(set(open("my_wordlist").readlines()))))' || 
echo 'Error!'

Or in Ruby (I'm sorry, it's such a joy to code in Ruby)

#!/usr/bin/env ruby
out=[];ARGF.each_line {|line| out << line.gsub(/[Wd]/,"n").squeeze};puts (out-File.readlines("FileA")).join

Edited by m_k_e

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...