August 31, 2003

Urgh, Hurrah, Lucene worked out

lucene.gifSo after my two previous posts on the joy of indexing with Cocoon, I hit the same problem with indexing yesterday morning - the fated Too many open files error message.

I spent most of the morning crawling through thousands of lines of logs and code, to find out what the problem is. It looks like there were two problems.

The first was that my regular expression to specify what should be indexed was wrong: I had <exclude>.*\.png$,.*\.js$,.*\.css$,.*\.gif$,.*\.jpg$,*\.ico$,.*/search/.*</exclude> when obviously I should have had <exclude>.*\.png$,.*\.js$,.*\.css$,.*\.gif$,.*\.jpg$,.*\.ico$,.*/search/.*</exclude>. How could I be so silly, you're wondering, right? ;-)

This was compounded by two things: firstly, the regular expression wasn't valid, but Cocoon didn't tell me so. Secondly, I didn't really want to specify that in the indexing section. It should have been the crawling section. (Think "the thing that rummages through the filing cabinet pulling out relevant files" rather than "the thing that reads through each file marking relevant sentences".) As soon as I put the regular expression in the crawling section, I got a regexp error warning, so I was able to track down the mistake and fix it.

Once I fixed that, all was well. The documentation was somewhat obtuse on the matter, but it's difficult to see how it could be improved. I think I just need to add a few more examples. I should also see if I can hack in better warnings about regexps failing.

Posted by savs at August 31, 2003 10:23 AM