robubu – Page 2 – the weblog of Rob Yates

The argument AGAINST virtualization

Matt Heaton: ‘While virtualization techniques have improved dramatically in the last
10 years (Think 3D support, para-virtualization for direct access to
the hardware layer, etc) there is a fundamental problem with the whole
concept of virtualization that no one ever talks about.’

I don’t know why this isn’t getting more attention. You can cram a lot of shared hosts on a single box. You can get nowhere near the same number of guest operating systems using virtualization. As a data point, I am paying $6.95 per month for shared hosting, I am running 4 domains and countless applications. To keep this blog running continually on amazon elastic cloud would be $70 per month and a lot more labor.

As an aside, it is also worth noting that certain language runtimes are way better suited to shared hosting than others. Ones that support fork, shared libraries, process isolation and start fast, good. Those that start slow, rely on threads and don’t share memory not so good. hmmmmm

Unladen Swallow, a Hummer Hybrid?

One of the projects that I keep an eye on is LLVM. In a nutshell it is apple’s compiler strategy but is also “an aggressive
optimizing compiler system that can be used to build traditional static compilers, link-time optimizers, run-time optimizers and JIT systems”. The guys behind it eventually want to replace gcc, and fwiw I think that they eventually will.
It seems that every language on the planet currently has someone experimenting with porting the language to llvm. This includes java and .net, ruby, python, and a very interesting use of LLVM to AOT compile flash so it can run on the iPhone.
The LLVM folks recently had their 2009 conference and I just got through watching the presentation on Unladen Swallow. Unladen Swallow is an attempt to get python to go five times faster by porting it to LLVM. They also hope to be able to make it the default python compiler / JIT. They have made pretty good improvements so far, however they are running into all the problems inherent in making any dynamic language go faster, namely type inference. I’ve blogged about this before for Java and the unladen swallow presentation does a good job of explaining all the pitfalls around making a dynamic language go fast (hint, try and avoid all the indirections). So far Unladen Swallow seems to be trying to work out how to safely infer types for a host of things and it will be interesting to see how far they can get with this.
The sad state of affairs however is that none of the popular dynamic scripting languages really go very fast, they all consume a lot of CPU. I can’t help thinking that we are all effectively driving around in Hummers by using these languages. What Unladen Swallow seems to be trying to do is make a Hummer Hybrid, sure it consumes a little bit less gas than a regular Hummer but its still a Hummer and is fundamentally not designed to ‘sip’ gas. We need the programming language equivalent of the Prius or a Jetta TDI i.e. something that a lot of us would be willing to develop web apps in (i.e. not C++) as well as something that is CPU friendly.
The real question then, is whether there is a language or a set of languages that can ‘sip’ CPU while still feeling dynamic and programmer friendly. The closest I have found so far is boo. It’s a python inspired .net language that can run on mono. As the manifesto states the language has been explicitly defined for type inference throughout (as long as you don’t use duck typing). This should mean that most of the time it flies. At the moment this performance difference may not be that much of a big deal, but as we move more and more services to the cloud and as we consolidate to a few super cloud hosting organizations it is going to be a significant advantage to be able to operate a cloud with 10 or maybe a 100 times fewer servers than your competitors who are all hosting ‘Hummer’ like languages. Given the financial pressures and incentives it seems inevitable that we will all eventually be ditching our Hummers and discovering languages that are designed to ‘sip’ cpu. These ‘hybrid’ languages will probably honor the dynamic driving preferences that most web programmers have gotten used to over the last few years while making heavy use of type inference to provide static language like efficiency. Anyone want to port django to boo? :)

Why are we still sending each other HTML?

Google Wave is a great leap forward in user experience. It attempts to answer the question ‘what would email look like if it was invented today’. It, however, started from a clean slate, which I am not sure we need to do. I think we need to look at the kind of collaborations now taking place on the internet and look at streamlining their user experience. Given the spread of social sites and the volume of data exchanged within them it seems to make sense to look at those for some inspiration on how to reinvent / tweak email.

A lot of the social sites allow for collaborations and provide a notifications when a collaboration within the site has taken place e.g. a status updated, or a status commented on or a photo shared. Most of them send you an email with a link. Something like this.

You click on the link, then you login and finally you get to respond, at which point more emails are sent out to other participants, who all open emails, click links and login and around and around we go. The important point here though is that the social site maintains a single copy of the collaboration and makes it available via a url. Contrast this to collaborating in email.

Email today shunts around copies of html documents for collaborations. It does this as there have been historical constraints that make this the only practical design. The big two constraints were intermittent connectivity to the internet and a lack of single sign on for the web. The last point is not immediately obvious as something email addresses. It achieves it though by sending and then storing a copy of the collaboration in a store that the recipient is authorized to view. Its advantages becomes glaringly obvious when you try and share photos privately for the first time with your mother via facebook. The set of steps required for her to sign up, choose a password, etc. quickly become too tedious and cumbersome to make this a universal approach.

Both of these design constraints are however disappearing. Netbooks from cell phone companies and recent announcements from airlines make ‘always connected’ a reality and openid and oauth momentum, which both have a role to play in web single sign on, seems unstoppable. Openid has also recently been paired nicely with oauth and has had some nice usability improvements.

In addition to these design constraints disappearing another shift makes a rethink timely.

Email as a Service – Most of us are now getting email from a web site e.g. gmail, hotmail, yahoo. As this is a service in the cloud any data that I have in the service is available for integrations, e.g. it can provide a standard api for getting at contact data which is a big deal. In addition it can provide many other services for third parties to integrate with (more on this later).

My previous post claims that we should start to think about sending something other than html to each other over SMTP. This post will walk through what sharing photos with my mother could look like if facebook sent an atom entry instead of html with a link and if gmail understood how to interpret the atom entry. The main goal will be to provide an approach that my mother can use.

Here’s a possible step by step

1. I share a facebook private photo album with my mother by just entering her email address.

2. Facebook sends an email to my mother that also contains an Atom Entry as below (note that this can be in addition to the normal text and html representations all bundled up in a mulipart message)

 <entry>
   <title>Rob has shared the album Sugarloaf 2009 with you</title>
   <link href="http://facebook.com/photo.php?id=8237298723&mode=preview&email=mom@gmail.com&oid_provider={oid_provider}"/>
   <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
   <updated>2009-06-13T18:30:02Z</updated>
   <summary>Rob wants to share the album Sugarloaf 2009 with you.</summary>
 </entry>

Note also that the link utilizes uri templates. This is needed so that facebook can be passed the user’s openid provider’s endpoint by the email service.

3. My mother (who uses gmail for her mail) logs in and sees that she has an email from facebook stating that I have shared an album with her. She opens the ’email’. Gmail spots that the email contains an atom entry, extracts the link href, substitutes in the url for google’s openid endpoint and opens the web page within the mail client (probably in an iframe). My mom just thinks she opened an email.

4. Facebook now needs to work out who the user at the browser is. It already knows that the user at the browser is associated with the email address mom@gmail.com as it embedded this in the atom entries link url sent to her in the email (this is similar to the way that most web sites validate ownership of an email i.e. those ‘We just sent a confirmation email to you, please click the link…..‘ kind of things). Facebook now just needs to get my mom’s openid, associate it with her email and my mom can then automagically become a facebook user with an account that has access to the shared photos. So.

5. Facebook redirects to mom’s openid provider (which is google) and which was provided to facebook in the url. It does this from inside the iframe, so as far as my mom is concerned this is all taking place inside her email service. The redirect probably uses html fragments so that the context is not lost and it is implemented very much like the openid ui extension except it is both the RP and OP that use fragments. This redirect causes gmail to pop up a box that asks my mom’s permission to share her name and email with facebook.

6. My Mom grants permission causing a redirect back to facebook (in the iframe) passing back to facebook her openid, name and email. Facebook now sets up a new account for my mom associating my mom’s google openid with her email mom@gmail.com. Again as far as my mom is concerned all this is taking place within the email service. She has never left her inbox.

7. Mom now sees the message open and sees my photos rendered live from facebook in the message pane. If there has been any updates to the photos or comments she sees all of them. Mom can now also post comments on the photos in the iframe rendered right inside gmail, before moving onto her next message.

If I now share subsequent photos with her via facebook then the same dance occurs when my mom opens the message, but this time google no longer needs to ask for my mom’s permission to send along her openid so as far as she is concerned the photos delivered straight from facebook open up right inside her email client.

There are probably many services that would benefit from this kind of email integration. Facebook, flickr, evite and google groups spring to mind as does my current defect tracking system which I am sick of logging into :)

There are lots more variations on this theme. The atom entry can contain mulitple links, one could be for preview, one could be for full screen, the links could even represent REST services that would allow programatic access to the data from the email service. In addition the modified timestamp along with the id field in the atom entry can be used by the mail server to ensure that you only see the latest update and don’t have to wade through 12 emails of updates for the same thing. As a final thought a gadget xml could be sent in atom entry content section. There are a realm of possibilities, but it all begs the question, why are we still sending each other html?

Google Wave – I don’t get it and how to build it with existing standards

Don’t get me wrong I think that it is a great user experience and a great leap in invention (putting aside concerns that I have about attention management) but I don’t understand why it was built the way it was built. Sure it allowed them to do that neat search trick, but frankly, who really needs that.

At some basic level wave is about getting notifications that a shared collaborative space has changed. The shared space is held somewhere in the cloud and has a set of capabilities. There are a set of capabilities that wave collaborative spaces have (live real time editing et al) and those were cool but quite frankly, who cares?. The shared collaborative space should be just a web page, any old web page and so it can do whatever a web page can do. It can do real time group chat or code editing or realtime document editing, but why should it be constrained to being some XMPP Wave thingy. You get a message in your inbox, you click on the message and go to a shared space in the cloud and in that space stuff happens. Maybe the space uses the wave protocol but maybe it doesn’t and it shouldn’t matter. It’s just a url.

So what would be needed if, instead of building wave the way it was built, we simply built an inbox that alerted the user when web resources of interest to them had changed and that launched web pages (in the context of the inbox) when messages were opened? I think it is a short list of standards and great progress is being made on all of them.

Single Sign On – A lot of the notifications in the inbox would probably be about data shared privately with a group of individuals, as such authentication will be needed before viewing the data. It would be a terrible user experience if every message that you received required a different login to a different site to view the pages. However, more and more sites are now standardizing on openid. If openid is widely adopted, and it is quickly becoming the only game in town then this barrier comes down and it becomes possible to launch urls that represent private collaborative spaces without the need to continually login. As an aside, the inbox may need to pass along the openid identity provider for the user to the site, but that is very doable.

Portable Contacts – The other thing that is needed is the user’s contacts to follow them around the web. The user should have their collaborators available to them on any collaborative space (i.e. url) they visit so that they can easily bring them into the collaboration. The opensocial rest api already provides ‘portable contacts’ so collaborative spaces can provide type ahead etc. against the user’s contact list when sharing with other collaborators.

Notification – The final piece of the puzzle is how changes to the collaborative space make their way to a user’s inbox. Given the inbox metaphor SMTP seems like the natural choice. However, traditional approaches to sending an email update when a web based collaborative space changes involve the user, opening the message, clicking on the url link in the message and then logging in, not pretty. Google Wave excelled here, opening the message opens the collaborative space immediately, no login, nothing. What if the SMTP message shipped the data in a format that describes the update and contains a link to the collaborative space. An inbox client that understood this data format would display the meta data in the message list and launch the site immediately when the user opened or even previewed the message. So we are looking for an existing standard that provides meta data about a url. Atom, seems like the most natural choice. So notifications can be sent to the user by sending atom entries over SMTP. As an aside the multipart message sent to the inbox could also contain text and html representations that contain a url link for compatibility with inboxes that cannot process atom entries.

Does this work, would this provide the main features of Wave or have I missed something fundamentally different about Wave? I really like the user experience of Wave but I don’t understand why it needs XMPP et al. What am I not seeing?

How to get a little more concurrency / performance out of a database

The database should always be the bottleneck in any large scale web application. As such, it is becoming increasingly important to minimize concurrency and to eek out every last bit of performance from the database.

There are a set of techniques for improving concurrency in sql code. The one I have personally seen most widely applicable is best seen with an inventory example. Suppose that you are buying 10 copies of Jim’s book from Amazon. The logic starts out with a predicate i.e. does the store have 10 copies of Jim’s book in stock. If it does then reduce the quantity by 10. Pseudo code would be as follows.

Select quantity_on_hand from Inventory where ISBN = 'IUWHSUY' FOR UPDATE; If quantity_on_hand >= 10 Update Inventory set quantity_on_hand = quantity_on_hand - 10 where ISBN = 'IUWHSUY';

While this is pretty efficient it can be improved upon by combining the predicate i.e. ‘if quantity_on_hand >= 10’ with the transform i.e. ‘set quantity_on_hand = quantity_on_hand – 10’ into a single SQL statement.

The resulting single sql statement is as follows

Update Inventory set quantity_on_hand = quantity_on_hand - 10 where quantity_on_hand >= 10

The application then checks the return of the update call to see how many rows were effected. If it was 1 then all is well (there was enough stock and the inventory level has been reduced), if it is zero then there aren’t enough items in stock.

This pattern comes up over and over again and the single sql statement is ALWAYS preferable to sending off two separate sql requests i.e. one that checks whether some condition is satisfied and then another to perform the update.

Update:Removed reference to ‘field calls’ which was pointed out to me actually refers to pushing this support further into the db manager, the above pattern still holds though, just not called a field call

The Mini-Batch

In my current application we frequently need to migrate data from one release to the next. Most of the time these migrations are straight forward, an alter table here, a new table there, etc. However, every now and again we need to run a migrate script that takes the existing data and updates it e.g.

Update users set status='active' where is_active = 1

We have some very large tables and so this could update a lot of rows. The main problem with running the above sql on a table containing 100,000k rows is that it is an all or nothing approach, i.e. the sql either updates every row or no rows. This creates a very long running transaction and should it fail updating the 99,999th row the rollback will be take even longer. Doing this in production can take your site out of action for a long time. Transactions of this kind frequently do fail when they run out of log space as they swamp the transaction log with updates.

The solution is to split the single large transaction into many small ones (something Jim Gray refers to as a mini-batch). To achieve this we need to keep track of the last row that we have processed for the current batch. We record this in a database table e.g.

Create table batchcontext (last_id_done INTEGER);

Initially we set last_id_done to 0. Now the update logic needs to be as follows, note ‘:step_size’ should be something like 100.

:last_id_done = 0
While (:last_id_done > :max user id);
    Begin Work;
        Select last_id_done from batchcontext into :last_id_done;
        Update users set status='active' where is_active = 1 and id between :last_id_done +1 and :last_id_done + :step_size;
        Update batchcontext set last_id_done = :last_id_done + :step_size;
    Commit Work;
End While;

The important point is that the single unit of work contains both the update logic to the users table and the update to the batchcontext table. That way if the application fails at any point it can be restarted and picks up where it left off.

Http Caching and Memcached – made for each other

First the problem. You have a feed or a web page that changes infrequently and you know when it becomes invalid. The classic pattern here is a blog feed or a friendfeed feed. These feeds are great cache candidates i.e. cache it and then invalidate the cache when a new post is added. The important factors here are to minimize database usage, to cache as close to the client as is possible and to have very little logic required to determine if the cache is stale.

Http Caching was designed for this. It allows for the page to be cached in any number of proxy servers anywhere in the world. All the app server is now left to do is indicate whether or not the cached page is stale. This takes a significant load off the app server and the database as the page doesn’t need to get rebuilt.

To use http caching the app server sends down a last-modified header with the original retrieval of the page. Subsequent requests (that come via an http cache) send an if-modified-since header that contains the value of the last-modified header from the first page retrieval. If nothing has changed then the server can issue a 304 return code and the page is served from the cache. If something has changed then the full page is returned with a new last-modified header and a 200. This is explained in more detail elsewhere on the web.

It’s possible to implement this approach very efficiently using memcached. In the case of a cache hit, only memcached is used by the application and no load is placed on the database. To achieve this memcached simply stores the pages last-modified time keyed by the page. The page key often corresponds to something very natural in the application e.g. the blogs unique id, or the friend’s unique id.

The logic is as follows.

If the page request is a conditional get i.e. there is an if-modified-since header
     If memcached contains a timestamp for this page key
          If the timestamp matches the one in if-modified-since
               return 304
          Else
               Build the page and return 200, use the timestamp from memcached for the last-modified header
          End-If
     Else
          Calculate the current time and put it into memcached using the pages key
          Build the page and return 200, use the timestamp just calculated for the last-modified header
     End-If         
Else
     If Memcached contains a timestamp for this page key
          Build the page and return 200, use the timestamp for the last-modifed header
     Else
          Calculate the current time and put it into memcached using the pages key
          Build the page and return 200, use the timestamp just calculated for the last-modified header
     End-If
End-If

In addition, when the application decides that the pages cache is invalid e.g. new blog post was added to a blog, then it simply deletes the corresponding key from memcached.

The nice thing about the pattern is that it doesn’t mandate keeping a bunch of timestamps in the database up to date when things change and it can serve up a lot of pages without needing to reference the database at all.

Friendfeed, Twitter, Alert Thingy and Delicious

As part of my current project one of the things I am delving into a little deeper is social networking. To that end I now have a twitter account, a delicious account and a friendfeed account that aggregates the other two plus my google reader shares and this blog. Am keeping track of all of this with alert thingy.

Will see how much use I make of them.

Update Changed some accounts, apparently I should be using my real name as much as possible.

Certified Http

REST api’s are being developed for more and more business function. For certain business function once and once only delivery of a message over a REST call is required. Enter Certified Http, an effort led by second life with participation by IBM. It is simple to understand, has a reference implementation in python and they have this to say about their main competition.

httpr and ws-reliable.

These tend to be thoroughly engineered protocol specifications which regrettably repeat the mistakes of the nearly defunct XMLRPC and the soon to join it SOAP — namely, treating services as a function call. This is a reasonable approach, and is probably the most obvious to the engineers working on the problem. The most obvious path, which is followed in both of the examples, is to package a traditional message queue body into an HTTP body sent via POST. Treating web services as function calls severely limits the expressive nature of HTTP and should be avoided.

Installing php Shinding

Shindig is the open source implementation of both the opensocial spec and the gadgets spec.

Using cPanel I first created a subdomain of robubu.com called shindig.robubu.com. This seems to be necessary as a lot of the code seems to assume it is running in the root web directory. It also provides a security layer as the widget gets run in the context of shindig.robubu.com and therefore can’t access cookies, dom etc. delivered from robubu.com.

Then on my local machine I exported the svn head of shindig and uploaded it to robubu.
mkdir ~/src/shindig cd ~/src/shindig svn export http://svn.apache.org/repos/asf/incubator/shindig/trunk/ cd trunk scp -r . admin@robubu.com:public_html/shindig

and that was it. It’s important to note that the subdomain’s root web directory is mapped to public_html/shindig/php.

Then to use it with an embedded gadget, I included the following code in the html head.

Added an onLoad="renderGadgets()" to the html body and then added this DIV tag <div id="gadget-chrome-x" class="gadgets-gadget-chrome"></div> for where I wanted the Gadget to appear.

The “todo” gadget, rendered through the local shindig gadget container, shows up below if you are reading this on my blog. During testing very few of the widgets available through google managed to work, but the sample ones are working. I have no idea why this is, suggestions welcome.