ralphm's blog

Friday, 29 April 2005

New Mimír Aggregator

Twistifying...

Two weeks ago, I upgraded Perl, and all dependent packages, on the machine that runs Mimír, which caused Mimír's news aggregator to not function correctly anymore. This was the excuse to finally rewrite the aggregator, something that I've been putting off for a while now. A bit of background:

Mimír is fed by publish-subscribe notifications via Jabber. The idea is that any news source has a pubsub node to which new news items are published. All subscribers of that node get an instant notification of this. One of the subscribers is Mimír's news bot, which uses the notifications as input for each news channel in the system, and redistributes the new items among the channel subscribers (if they are online) or mark the item as unread for the subscriber. Unread items can be read at a later time, via the subscriber's personal news page.

Unfortunately, not many news sources publish their news via Jabber publish-subscribe. In comes the Mimír aggregator, a component that polls the news from legacy news sources. It keeps a list of news feeds, and periodically fetches these in search for new items. Once a new item has been discovered, it publishes the item on behalf of the news source to a pubsub node.

Up till now, Mimír's aggregator was based on Janchor. I modified it to send pubsub publish requests, instead of the normal chat messages. Janchor is written in Perl, and since the upgrade I couldn't easily get it going again. Didn't really want to, either, because of my long-time wish to replace it.

I wanted to have a well-behaving aggregator, that uses the Last-Modified and Etag headers in HTTP, handles broken feeds, accepts compressed encodings and strips out nasty HTML. The result is an aggregator written in Python, based on Twisted and the really great Universal Feed Parser by Mark Pilgrim.

Instead of the built-in, synchronous, feed fetching support in Feed Parser, I wrote a fetcher using Twisted Web for fetching the pages asynchronously and handling the HTTP result codes. I ended up writing a custom class that mimics the interface of Feed Parser's build-in fetcher because I needed to inject the received headers into the feed parsing code. For example, Feed Parser can rewrite relative links in the HTML found in the feed's items. Among other pointers, it uses the received HTTP headers for this. Objects of this custom class are created from the retrieved feed and headers.

Concluding, the new aggregator works like a breeze and I am really happy with the result. As a side-effect of using Feed Parser, Atom feeds are now supported, too.

On a side note: although it needs a rewrite too, the Mimír news bot (written in Perl) got an update as well. I discovered an entity encoding bug in Jabber::NodeFactory today that caused embedded SGML/XML/HTML tags to be unescaped one time too much.

Thursday, 21 April 2005

Pubsub subscription tracking

Where do I get stuff from?

Recently, there has been more and more discussion on the standards-jig and jdev mailing lists about Publish-Subscribe, in part because of updates to the User Avatar JEP. One discussion is about keeping track of your pubsub subscriptions. For clarity, the context is the use of pubsub in regular Jabber IM deployments. Pubsub is also useful in other contexts, but the IM context carries some inherent assumptions. I make three observations here, and expand on this further below.

No central pubsub service: What we will most probably see, is that entities will have subscriptions on a multitude of pubsub services, scattered troughout the Jabber universe.
No subscription notification: When send an initial subscription request to a pubsub node, you get a reply stating if you were indeed subscribed, or that the subscription is pending and waiting for approval of the node owner. Also, when your subscription is cancelled, you have no way of knowing what happened.
Subscriptions are bound to a JID.: This seems obvious, but what I want to point out here is that if you use a bare JID (without a resource part) to subscribe, notifications will go to the resource (client) with the highest priority. If you subscribe using a full JID (with a resource part), notifications will only go to the specified resource.

A pubsub service usually allows you to query your existing subscriptions, using the affiliations element. This is similar to fetching your roster using jabber:iq:roster. However, my first observation above causes this to be troublesome. Since my subscriptions can be on any pubsub service, I'd have to know which services I have subscriptions at.

Unlike the normal roster, pubsub does not have a way to relay changes in your subscriptions, as explained above. This means that if my subscription changed since I last queried for my subscriptions, I have no way of knowing. Sure, if I suddenly get notifications from a node to which my subscription was previously pending, I could refetch the list of subscriptions. But that amounts to polling, which is cumbersome. It would be much nicer to be notified of such changes.

To counter these two problems, one could register with the pubsub services, like with transports, and have the services in your roster. The client could then simply look in the roster for the services to query using the affiliations element. Like with rosters, as soon as the affiliations were first fetched after getting presence from the entity, the service could send out notifications to the entity that represent changes in affiliations (not only subscriptions) with that service. This could be done by sending notifications from the root node, having an affiliations element in the item body, or by allowing such an element to be sent as a direct child of the event element, similar to the notification of deleted nodes. After sending unavailable presence, the notifications would stop.

My third observation above says that it makes no sense to subscribe using your bare JID when running concurrent clients, because notifications will only go to one of the connected resources. For things like avatars, that is most likely not desirable. Adding subscriptions for each resource (depending on what the client supports) is one alternative. One other solution is to invent a new (boolean) node configuration option, that states that, if set, the pubsub service needs to check which of the resources support the namespace of the payload of the node's items, and send out notifications to all of those. This requires the service to subscribe to the presence of the resources in question. Registering with the pubsub service, again, solves this. The service could then use Entity capabilities and Service Discovery for checking the namespace support.

Wednesday, 6 April 2005

Every JID a Publish Subscribe Service

More pushing...

I was a little confused while reading offline offline stpeter's thoughts on integrating pubsub into Jabber servers. What he wrote is that every Jabber user's JID would represent a pubsub node. But that's not what he means. He meant each user having it's own, albeit virtual, pubsub service as part of his Jabber server, holding different nodes for moods, avatars, etc. Now that's a nice idea.

Waving Particles

Reading the discussion by offline offline stpeter and hildjj, I can't help wondering if we aren't re-inventing XML all over again. So, I wanted to throw the following into the discussion:

<x xmlns="jabber:x:data" xmlns:xdata="jabber:x:data" type="form">
  <s:light xmlns:s="http://jabber.org/protocol/shakespeare"
           xdata:type="list-multi">
    <option label="Juliet">Sun</option>
    <option label="Maid">Moon</option>
    <option label="Eyes">Stars</option>
  </s:light>
  <author xmlns="http://www.loc.gov" xdata:type="text-single">
    Shakespeare
  </author>
</x>

As, can be seen, I annotate the child elements in the form with the xdata:type attribute. This is an extension to Data Forms. The following example is equivalent to the previous one, but uses a different annotation with respect to namespaces:

<xdata:x xmlns:xdata="jabber:x:data" type="form">
  <light xmlns="http://jabber.org/protocol/shakespeare"
         xdata:type="list-multi">
    <xdata:option label="Juliet">Sun</xdata:option>
    <xdata:option label="Maid">Moon</xdata:option>
    <xdata:option label="Eyes">Stars</xdata:option>
  </light>
  <author xmlns="http://www.loc.gov" xdata:type="text-single">Shakespeare</author>
</xdata:x>

Note:

XML attributes, without a prefix, don't belong to any namespace, but depend on the containing element for providing their context. So, although the default namespace in the second version is jabber:x:data, this isn't inherited for attributes.

Monday, 4 April 2005

Mimír User Manual

Getting the most out of it...

I regularly get requests from people using Mimír, my Jabber Powered news reader service with web interface, for features that make no sense to me. That is, until I discover that they are not aware of all features that are already there.

To remedy this, I have created the Mimír User Manual, a complete guide to all features, with a detailed description of the preference settings, and several screenshots. This should help people get the most out of the service.

The manual is publicly available, so it can also be used to get an idea of what Mimír has to offer! For those who don't use it yet, if you get the idea that I copied Bloglines, Mimír predates it by almost a year. My first commit to the CVS repository was on 4 November 2002!

Sunday, 3 April 2005

Uncle again: Jeroen Schut

So very tiny...

Today, my sister-in-law Rian gave birth to Jeroen Schut, the brother of Maarten and son of Ben. My wife, Irma, is his godmother. Jeroen is a healthy boy of about 3Kg. I'm so very proud!