Jump to content
Chinese-Forums
  • Sign Up

Request: Handling large volumes of vocabulary


Olle Linge

Recommended Posts

hello,

ZDT have been my trustworthy companion ever since I started learning Chinese, and in general, I have nothing to complain about. However, I have thought of some features which would make the program much better. These problems didn't occur until recently, because one needs a lot of characters before they become significant. I have over 5000 entries now and here is one suggestion that would make it much, much easies to handle:

- Allow subdirectories

I prefer to keep my words separated into the various chapters of the relevant books I found them in, but this is very annoying when you have hundreds of categories (it takes a while to scroll).

Furthermore, subdirectories would also allow me to choose more quickly which words to test for the flashcards. Right now, I have to select a lot of folders, which is a bit annoying. Having the option of sorting my characters more efficiently would be great.

Here is another suggestion:

- Enable some search for duplicate entries

Right now I have to export the database, open it in an external program, search for the character, go back to ZDT and remove the overlapping entries. Not very convenient. Sometimes, it's also hard to spot duplicates.

These are just suggestions for making an already outstanding program even better. I will continue using ZDT, but especially the subdirectories would make me very, very happy.

Regards,

Snigel

Link to comment
Share on other sites

I agree, especially on the duplicate search. Maybe something like "compare this category with this one", with the option of deleting duplicates in the second category.

I would use this to compare new vocabulary lists with my HSK list, so I can practice only the ones I haven't studied before.

Another suggestion: merging and seperating categories, especially merging. At first I also added lists by chapter, but as I get more familiar with the words I'd like to merge the chapters into a single list.

Link to comment
Share on other sites

For duplicates:

I would prefer manual check here, but I assume you would to. All that is needed is a function that can find duplicates, preferably looking at characters and not pinyin. Of course, more fancy stuff would be good, too, but it's not essential.

For merging lists:

This is fairly easy to do:

1) Create a backup file

2) Edit the backup file with a text editor

3) Remove the category headings

4) Restore data from your modified file

5) Good to go

Link to comment
Share on other sites

  • 3 weeks later...
I prefer to keep my words separated into the various chapters of the relevant books I found them in, but this is very annoying when you have hundreds of categories (it takes a while to scroll).

Furthermore, subdirectories would also allow me to choose more quickly which words to test for the flashcards. Right now, I have to select a lot of folders, which is a bit annoying. Having the option of sorting my characters more efficiently would be great.

It's not a pretty solution, but keep in mind that at least on the Linux version, the categories are kept in a file called "user.script". When you start zdt, it reads in whichever user.script file is in the current directory. So you can create different user.script files in different directories, each with a different set of files. And you can use the backup / restore feature to move categories between difference files.

Link to comment
Share on other sites

- Enable some search for duplicate entries

Right now I have to export the database, open it in an external program, search for the character, go back to ZDT and remove the overlapping entries. Not very convenient. Sometimes, it's also hard to spot duplicates.

Here's a simple gawk script that does that in a very dumb way. Basically, it goes through all your categories, and removes all entries that have been seen before. [it matches on ALL of simplified, traditional, and pinyin.] Before you use it, RENAME YOUR USER.SCRIPT FILE FIRST. It tested it for about 2 minutes and it didn't crash anything. e.g

mv user.script user.script.good

gawk -f parse.awk user.script.good > user.script

[Assume this is under linux and you call the script parse.awk.]

Currently, it keeps the first occurrence. Note that the categories are saved in the order you created them. They are not saved in the order you see them inside zdt. To see the actual order, look for the "INSERT INTO CATEGORY VALUES" lines in your user.script file.

The ideas for improvement are endless. But if you actually use this and want a new feature, let me know.

BEGIN {
  foundCnt = 0;
}


/INSERT INTO USER_ENTRY VALUES/ {
  # Parsing the fields is a bit complex.  It's based on a comma that is NOT
  #  within single quotation marks.  So just scan one comma at a time.  Note
  #  that only the last field has commas

  # Find the first "("; this start the data
  n = index($0, "(");
  data = substr($0,n+1);
  #print data;

  # now find each field
  for ( i=1 ; i<=7 ; ++i ) {
     # See if the next character is a '
     firstQ = index(data, "'");
     if ( firstQ == 1 ) {
 # There is a ', so the field is from first ' to second '.  So find the next '
 secondQ = index(substr(data, firstQ+1), "'");
 #print substr(data, 0, secondQ+1);
 field[i] = substr(data, 0, secondQ+1);
 # assume the next field is a comma
 data = substr(data, secondQ+3);
     }
     else {
 # There is no ', so the field is until the next ,
 firstC = index(data, ",");
 #print substr(data, 0, firstC-1)
 field[i] = substr(data, 0, firstC-1);
 data = substr(data, firstC+1);
     }
  }

  # Check if we've seen this word before.
  if ( found [field[4], field[5], field[6] ] == "" ) {
     # If not, print it out, and mark it
     # as seen.  While that seems simple, it's not.  We can not print it out
     # as-is, as we need to change the index as we are removing entries.  In
     # addition, need to remember which indices we remove, so we can remove
     # them as well in the stats later
     found [field[4], field[5], field[6] ] = field[1];
     sub(field[1], foundCnt, $0);
     foundConvert[field[1]] = foundCnt;
     print $0;
     ++foundCnt
  }
  else {
     # just indicate to remove it
     foundConvert[field[1]] = -1;
  }
  next;
}


/INSERT INTO STAT VALUES/ {
  # print these out *almost* as-is.  The only change is to the index.  For
  # now, assume the index is between the first "(" and the first ","; this
  # seems a bit brittle, but it seems to be correct for now....
  findP = index($0, "(");
  findC = index($0, ",");
  idx = substr($0, findP+1, findC-findP-1);
  if ( foundConvert[idx] == -1 ) {
     # If the index was skipped, do nothing
  }
  else {
     # use the new index, and print it out
     sub(idx, foundConvert[idx], $0);
     print $0;
  }

  next;
}


# Not a word, so just print it out.
{
  print $0;
}

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...