ralphm's blog

Friday, 29 April 2005

New Mimír Aggregator


Two weeks ago, I upgraded Perl, and all dependent packages, on the machine that runs Mimír, which caused Mimír's news aggregator to not function correctly anymore. This was the excuse to finally rewrite the aggregator, something that I've been putting off for a while now. A bit of background:

Mimír is fed by publish-subscribe notifications via Jabber. The idea is that any news source has a pubsub node to which new news items are published. All subscribers of that node get an instant notification of this. One of the subscribers is Mimír's news bot, which uses the notifications as input for each news channel in the system, and redistributes the new items among the channel subscribers (if they are online) or mark the item as unread for the subscriber. Unread items can be read at a later time, via the subscriber's personal news page.

Unfortunately, not many news sources publish their news via Jabber publish-subscribe. In comes the Mimír aggregator, a component that polls the news from legacy news sources. It keeps a list of news feeds, and periodically fetches these in search for new items. Once a new item has been discovered, it publishes the item on behalf of the news source to a pubsub node.

Up till now, Mimír's aggregator was based on Janchor. I modified it to send pubsub publish requests, instead of the normal chat messages. Janchor is written in Perl, and since the upgrade I couldn't easily get it going again. Didn't really want to, either, because of my long-time wish to replace it.

I wanted to have a well-behaving aggregator, that uses the Last-Modified and Etag headers in HTTP, handles broken feeds, accepts compressed encodings and strips out nasty HTML. The result is an aggregator written in Python, based on Twisted and the really great Universal Feed Parser by Mark Pilgrim.

Instead of the built-in, synchronous, feed fetching support in Feed Parser, I wrote a fetcher using Twisted Web for fetching the pages asynchronously and handling the HTTP result codes. I ended up writing a custom class that mimics the interface of Feed Parser's build-in fetcher because I needed to inject the received headers into the feed parsing code. For example, Feed Parser can rewrite relative links in the HTML found in the feed's items. Among other pointers, it uses the received HTTP headers for this. Objects of this custom class are created from the retrieved feed and headers.

Concluding, the new aggregator works like a breeze and I am really happy with the result. As a side-effect of using Feed Parser, Atom feeds are now supported, too.

On a side note: although it needs a rewrite too, the Mimír news bot (written in Perl) got an update as well. I discovered an entity encoding bug in Jabber::NodeFactory today that caused embedded SGML/XML/HTML tags to be unescaped one time too much.