When I started to write this up, it got pretty long, so I am going to do it in installments as subsequent blog entries.
First off, the mere fact that there are so many of them makes any one of them uninteresting - is this particular one going to be the one that is so much better than the others? Will this one be the breakthrough? How can I tell?
Let's look a bit deeper at what is going on here. Why are these people interested in natural language processing? Their interest usually stems from an interest in search - they want to figure out a way to let a particular population of users search effectively through a highly specialized corpus of information. Some sample corpi they are are interested in include engineering documents, finance reports, clinical medical records, and genomic research reports. There is an assumption here, that since there are so many of these documents running around, that the only way to index them must be automatic. And since the only thing they all have in common is that they are written in English, automatic natural language processing is the only way to address the problem.
This is the path that Google (and Alta Vista before them) started down when they were just a fledgeling company. They quickly learned that the natural language content of the documents was a difficult and unreliable single source for indexing information. Google has made a very good business by exploiting other information about documents (most famously, reference analysis).
So when I read about another natural language analysis approach, I can't help but wonder whether it will succeed where Google failed, and if so, why. I don't mean this as a defeatist attitude, "Gee, those smart Googlers couldn't figure it out, so nobody else can, either!" Rather just to notice that Google found it useful to take advantage of other information to enhance the language analysis they were doing. Isn't it foolish of any approach to avoid using extra information to accomplish this task?
But what information might be available, if you don't have references to analyze, like Google does? I'll explore the answers to that in the next entry.