Text databases and searching using Hanzi

September 8, 2009 at 01:43 AM

I have many documents, text, pdf, docs, xml, chat records, emails, containing English and Chinese that I want to index and search.

I do not want to create keywords myself. I want the indexing program to do that and later retrieve documents or sections of documents that are relevant.

storing and indexing large amounts of text in a data base is an issue but tools like askSam, or Xtree or OneNote do it well , provided the data is in English.

Most of them do not handle Hanzi very well. Some corrupt the Hanzi, others store it well but then can only search for a few characters, not string of characters or a slab of text. Some can not search for Hanzi at all.

In contrast Google Desktop is magnificent in locating a string of characters, it can list all the relevant documents. It provides a list, but it is not designed to do an extract or combine of the materials found. Those other programs can do that very well.

Does anyone know of a database or indexing tools with this sort of index / search / extract that handles Hanzi ?

I am wondering if Google's new Android online / desktop software will do this ? Now I am out of my depth.

September 8, 2009 at 03:22 AM

Not sure what you mean by "extract or combine" in the context of search, but I have found Windows Desktop Search to be very good, probably better than either Google or Copernic. It has quite a number of boolean search options. Maybe you can give it a try.

http://www.microsoft.com/windows/products/winfamily/desktopsearch/default.mspx

September 8, 2009 at 04:29 AM

Thanks Gato I will try the MSN search.

Not sure what you mean by "extract or combine" in the context of search,

In a number of programs, such as askSam, you can do a search together with the instructions to extract and combine text matching the criteria.

For example if I wanted to look for the text

薄霧濃云愁永晝

and its translation (various translations) such as

"Light mists and heavy clouds, melancholy the long dreary day"

ideally the program would find the documents (or paragraphs) and output them in a new document (together with id data such as file name).

You would then might have a new document with any number of different translations of that Chinese text. You could easily set it up the program to combine the text, add introduction and conclusion text and generate emails to a list of addresses. Useful for sending out study topics to students or a discussion group.

Currently many programs do it easily for searches in English. Finding something that does for Chinese and English is the hard part.

September 9, 2009 at 02:56 AM

Philip,

Are you looking for a prepackaged software application that will index all of these different document types for you or are you a software developer looking to create a custom search engine and looking for enabling software?

If the latter, we've done significant work with search as part of the Adso Chinese-English machine translation project. There are scripts in the /search subdirectory of the distribution you may find useful that demonstrate using the engine as a preprocessor to Lucene. Adso parses text into indexable chunks and outputs it to a file. Lucene then makes it all searchable. Requires programming ability but is quite powerful and totally customizable.

--dave

September 9, 2009 at 03:48 AM

trevelyan

Thanks for you reply.

I am looking for prepackaged software, however I know a bit about and computing and I have a friend with experience in writing databases. I am sure I could call on his help.

I would like to see your scripts and see what I could do with them.

As an aside, it amazes me that so little has been done in this area with such a large potential market. It seems the big American companies have just not bothered. Any comments on the reasons ?

Philip

September 10, 2009 at 02:40 AM

Sounds like we're not what you want, although you're more than welcome to download the latest distribution and check it out if you ever end up implementing something yourself.

On the business side, I remember a couple of years ago when Google was selling standalone search appliances (servers). Baidu had a division of their own that did something similar and they ended up just shutting it down because it wasn't making money.

Even with the major search companies good NLP isn't necessary for search since as long as you tokenize your text the same way on search as you do when indexing your engine will still work, it just won't be as precise. You can see this in action whenever you search on Baidu and Google for a less common proper name in Chinese and get results where partial matches float to the top above full matches. And those uses don't require any knowledge of things like translation or part-of-speech.

I think they'll be opportunities for more interesting NLP work moving forward, but I think the business model for more sophisticated services needs to be worked out before anyone will spend a lot of money on anything like what we're doing.

Sign In

Text databases and searching using Hanzi

Recommended Posts

PhilipLean

gato

PhilipLean

trevelyan

PhilipLean

trevelyan

Join the conversation