August 31, 2003

Urgh, Hurrah, Lucene worked out

lucene.gifSo after my two previous posts on the joy of indexing with Cocoon, I hit the same problem with indexing yesterday morning - the fated Too many open files error message.

I spent most of the morning crawling through thousands of lines of logs and code, to find out what the problem is. It looks like there were two problems.

The first was that my regular expression to specify what should be indexed was wrong: I had <exclude>.*\.png$,.*\.js$,.*\.css$,.*\.gif$,.*\.jpg$,*\.ico$,.*/search/.*</exclude> when obviously I should have had <exclude>.*\.png$,.*\.js$,.*\.css$,.*\.gif$,.*\.jpg$,.*\.ico$,.*/search/.*</exclude>. How could I be so silly, you're wondering, right? ;-)

This was compounded by two things: firstly, the regular expression wasn't valid, but Cocoon didn't tell me so. Secondly, I didn't really want to specify that in the indexing section. It should have been the crawling section. (Think "the thing that rummages through the filing cabinet pulling out relevant files" rather than "the thing that reads through each file marking relevant sentences".) As soon as I put the regular expression in the crawling section, I got a regexp error warning, so I was able to track down the mistake and fix it.

Once I fixed that, all was well. The documentation was somewhat obtuse on the matter, but it's difficult to see how it could be improved. I think I just need to add a few more examples. I should also see if I can hack in better warnings about regexps failing.

Posted by savs at 10:23 AM

MS legal trouble

microsoft.gifThis just in via Slashdot and the AFFS IRC channel: Microsoft is in trouble in the courts again.

Short story: Microsoft stole some technology, a company is suing them. Microsoft didn't hand over all relevant emails in the case. The other company complained, pointing out they were all backed-up somewhere. Then this: "So the judge ordered Microsoft to produce the missing messages. The employee PCs, the servers, and the off-site backup tapes have to be searched and soon. The Microsoft lawyers complained that would be like finding a needle in a haystack. The judge reminded them that it was they who had put that needle in the hay."

I think the fear and awe of Microsoft has worn off, and Judges are prepared to get tough. Finally. Now if someone would just sue them for the disruption caused by vulnerabilities in Outlook, I'd be a really happy guy.

Posted by savs at 10:08 AM

August 30, 2003

On XSLT and Cocoon

project-logo.gifRussell Beattie writes: I'm not totally against XSLT, I just think that it's not a programming language which is what many apps - like Cocoon - make it into. It's best for transforms only, in my opinion, anything beyond that makes it impossible to develop and maintain.

An interesting viewpoint, and while I don't pretend to know what he's building, his technical description further on makes me think he missed out on quite a bit of how Cocoon works, and how you can use XSLT within it.

For example: Mostly it's simple stuff for example, if you're logged in, you get an additional link to edit or delete that page. But it's still *logic* that would be a bitch in XSLT to work out correctly and maintain. That's where the problem is. Don't put logic in the XSLT if you can help it. Have an action, xsp, or whatever you like somewhere further up the line to create the login information as XML. Then, in the XSLT, all you need to do is <xsl:template match="your-login-details-tag">.

I do agree with Russ on one thing though: Once we get the tools standardized, living in XML within our apps is going to be the only way to do things -- amen!

I'm off to read Jon Udell's articles on The document is the database and XSLT Recipes for Interacting with XML Data now...

Posted by savs at 10:36 AM

August 29, 2003

Hurrah!

Index has been created. Total Count Of Documents 2228

The fact that 2.0.5 took at least twice as long to index as 2.0.4 I will look into later. First, I think I need some sleep.

Posted by savs at 2:30 AM

Urgh

I'm trying to convince Cocoon's built-in search engine, lucene, to index a site of some 3000 documents. Not a particularly challenging task, but unfortunately it keeps blowing up:

org.apache.cocoon.ProcessingException: Exception in numDocs()!: 
java.io.FileNotFoundException:
/usr/local/jakarta-tomcat/work/Standalone/some.site/_/cocoon-files/index/_a.tis (Too many open files)

This had been a continual problem for me, up until recently when I refactored mercilessly, chopping several hundred lines of code and streamlining the site. Two days ago, it was indexing just fine.

I know people are using this stuff on sites with 250,000+ documents, so 3000 should not be a problem. Hours of reading docs, mailing lists, google and source code has not enlightened me.

(Well, actually, that's a lie - I now know about store-fields, merge-factor, exclude and other fine things. Unfortunately, none of this has helped solve the problem.)

Update: it works fine in my old copy of Cocoon 2.0.4. Which is great. Except I need two features of Cocoon 2.0.5-dev (the bugfix tree for 2.0.*) - the PaginatorTransformer and Sylvain's and Jeremy's backport of the new Lucene features which let me specify which fields get output in search results (ie, rather than listing just web addresses, it can give meaningful results like page titles).

Time to find out what broke in 2.0.5 ... (time for another cup of tea)

Posted by savs at 1:54 AM

August 28, 2003

Patents postponed?

Looks like the current patent proposal has been withdrawn - for the time being, at least. The fact it has been withdrawn doesn't mean that it won't be back again - it just means they are being asked to think about it. It seems they acknowledge it has problems!

Update: I like this. Non-IT people getting in on the fun. A group of economists has blasted a proposed European Union (EU) law on software patents, characterizing it as damaging to technological innovation and Europe's software industry. -- itworld.com.

Posted by savs at 6:52 PM

August 27, 2003

Software Patents

noepatents_liberty120.pngSoftware patents are bad. Don't let them come to Europe.

When a patent is approved, it's claims are not made public for 18 months. A company could have a new piece of software in widespread use before it's even possible to check if it infringes existing patents.

For market competition, new packages must be allowed to build on existing practices. Companies with large patent portfolios would have a legal tool to enforce monopoly status.

Software development companies would have to regularly perform patent lookups while developing software, development resources would be diverted to legal issues.

In America, large software developing companies find it impossible to develop new software without infringing a few patents. Worse, a parasitic class of intellectual property firm is appearing. These small, usually new firms buy the patents of cash strapped companies and with nothing to lose, start suing anyone that they think they can get money out of.

Posted by savs at 9:26 AM

Steven is having problems with

hard_disk.jpgSteven is having problems with a dead hard disk. I must get into the office today and run the backup on my laptop. (When I work 20 hour days I tend to do it from home, so I don't have to walk far in order to crash out.) I've been having problems using the external USB hard disk I bought for backups - USB is a bit unstable for me in 2.4.21, causing my machine to hang if I try and do anything like an e2fsck on the external disk. Time to sort that out once and for all.

The great thing about CVS is it means copies of my work are distributed on two or more computers at any one time. The same doesn't apply to my email though, at least until I sort IMAP out. Or my photo collection. Or my music collection. Or .... etc etc etc!

Posted by savs at 9:15 AM

August 22, 2003

Defeating SoBig.F

I was struggling to keep up with the flood of SoBig.F emails hitting my inbox, and getting quite concerned for my chances of any mobile computing. Downloading an inbox with several hundred 100k attachments via your mobile phone is not an enticing proposition - especially when you pay per kilobyte used.

The mailserver I run is exim 3.*, so I spent some time trawling the exim mailing list archives and filter documentation, to see if I could block them before I had to download them. It is possible, and actually pretty straightforward - all you need is a .forward file in your home directory on the server that looks like this:

# Exim filter
 
if error_message then finish endif
 
if $header_subject: is "Re: Your Application"
        or $header_subject: is "Re: My Details"
        or $header_subject: is "Re: Details"
        or $header_subject: is "Your Details"
        or $header_subject: is "Re: That movie"
        or $header_subject: is "Re: Wicked screensaver"
        or $header_subject: is "Re: Details"
        or $header_subject: is "Re: Thank you!"
        or $header_subject: is "Thank you!"
        or $header_subject: is "Re: Approved"
        or $header_subject: is "Re: Re: My details"
then
        save /home/savs/sobig
endif

It seems to be doing the trick - it's been running for about 20 minutes and has so far zapped 18 virus emails. The downside is this still won't help reduce the server's incoming traffic, but we still have a fair way to go before we hit the 40gb allowance limit.

(There are more sophisticated methods for exim 4 users, but as I'm running the stable distribution of Debian, this isn't an option for me.)

Update: You can now monitor the battle almost live.

Posted by savs at 12:03 PM

August 20, 2003

Dublin

I spent last weekend in Dublin with Nic, as she was singing in Dublin cathedrals (including St Patrick's).

While we were there, we visited the Castle, and discovered a sand sculpture exhibition was on. The sculptures were really cool, like nothing I'd seen before, and certainly made my sandcastles on the beach look pretty poor. Seeing Matthew's post about sand castles reminded me to upload these pictures....

Posted by savs at 11:23 PM

Blogging the end of email

The first I heard about the resurgence of W32/Sobig email viruses (Re: Thank you!, Re: Wicked screensaver, Re: Details, failure notice, Re: movie was last night on the phone to Nic. Since then I've received several hundred of the emails myself, and reports are flooding in. I'm getting bored of passive-smoking Microsoft's carcinogenic crap. I say we should storm Redmond and dispose of King Bill.

Posted by savs at 10:47 PM

A step closer to laptop nirvana

debswirl.pngAfter several hours of fighting with my laptop, I've finally got closer to the flawless mobile computing that others have.

My old laptop handled suspend/resume in the BIOS. (suspend/resume is the ability to turn your laptop off, come back to it later, and start where you left off - with all the things you had running before you turned it off.) Because it was done in the BIOS, it didn't matter what operating system I ran (windows or linux), it just worked. Unfortunately, my new laptop leaves it up to the operating system - and linux didn't handle it very well!

However, with kernel 2.4.21, software suspend for linux, and this bugfix for pcmcia card services, I can now suspend, resume, AND use my wireless network card without any problems at all. I've also added the pre-emptive kernel patch (for more responsiveness), and the laptop patch (reduces writes to hard disk, saving power), and re-enabled framebuffer consoles (text console at 1400x1050 ... yum).

Now I just have to get bluetooth working again ...

Posted by savs at 9:58 PM | Comments (2)

August 11, 2003

Avoid Vodafone Charges

Want to avoid Vodafone's outrageous charges for free information? I'll show you how.

In my previous post, I mentioned Vodafone want to charge 25p per lookup on film times via the ents24 web site. You don't need to pay it.

In your WAP browser, simply add a new bookmark to http://wap.ents24.co.uk/. You can even still use the Vodafone Live! access point. You won't be charged (other than for your GPRS data connection to the internet), and the site is satisfyingly branded with O2 information, too!

Posted by savs at 11:13 AM

Wake up, Vodafone!

Vodafone just don't get it. That's not to say they are any better or worse than the other incumbent mobile "service" providers. But still, they clearly have no idea about how to run a successful telco. Two cases in point:

  • GPRS tariffs. I use my mobile phone on the road an awful lot, predominantly to download email to my laptop. I don't want to pay prohibitive amounts for the data usage on my phone, so when I heard that Vodafone had announced an "as much as you can use" tariff for around 50 quid, I was delighted.

    I tried to switch to that tariff last Friday. The support person on the end of the phone had no idea what tariff it was. The closest he could come up with was "Mobile Connect Complete" - but in order to use it, I must have a Vodafone PCMCIA card. Why on earth would I want to carry around another peripheral, when I already have a perfectly decent mobile phone / bluetooth laptop setup? I asked him to check this was the really necessary, but needless to say, he didn't call me back.

    Oh, and if you really fancy a laugh - try wading through the mess that is the Vodafone site, to find out what tariffs are available for GPRS and normal telephone calls. The information is split across several pages, and there's no "all on one page" table of charges. Customer-friendly, eh?

  • When I was on O2, I used to use the excellent ents24 site to look up films all the time. It was the perfect killer app for the mobile. Imagine you're at in the pub, fancy seeing a film, not sure what time it's on or the nearest cinema. Just type in the approximate postcode or town name, and up pops a list of cinemas, films, and times. Brilliant! Although every time I use the web on my phone it eats into my data allowance, I didn't mind paying the excessive per-kilobyte charges for mobile phone web browsing, as the service was so useful.

    On Vodafone, this has all gone horribly wrong. You can still access ents24, and it's even available as a menu entry in Vodafone Live! (the Vodafone web portal). But when you select a cinema and want to see film listings, you are asked to accept a charge of 25 pence per request.

    So: not ONLY do you have to pay vodafone for each page you download, but they also want to add ANOTHER charge for you to access the information you want to use. Information that is FREELY available elsewhere.

I'm disgusted, I really am. Time to consider moving to 3 after all.

Posted by savs at 11:04 AM

August 5, 2003

Woody woodpecker

greenwoodpecker.jpgGuess what I saw on my way to work this morning? Three of them, in fact. I left the house earlier than usual as I had to send an email before swimming. As I was walking through the first field, I spotted what I thought at first were doves or small pigeons. As I got closer I saw the flashes of red and green. They didn't let me get near enough for the phone's camera to be effective. I'll take the digital camera with me tomorrow!

Posted by savs at 10:15 AM