A penny for thoughts?

About the correct valuation

Adding a jit to the web stack

leave a comment

I’m still not entirely sure how to restore transactions to the write side if the following idea is implemented but it could probably be done with a temporary transaction staging area where the transaction can either be applied or rejected and then written in order. The fact that writes would be done in order would greatly reduce write capacity though. There must be some sort of distributed MVCC described in the literature. But I digress. This post is about the read side.

I’m kind of surprised that no one has yet come up with (or at least that I’ve heard of) a middleware for your database that essentially acts like a jit compiler. This would allow you to keep relations while distributing data at will.

Let’s say you have an object of type A that is linked to many of type B. However, when loading A you rarely access the Bs. If your ORM is loading in all the Bs every time you load an instance of A, that’s a lot of wasted work and typically requires a join done in the database. Joins absolutely kill scalability because joins across machines and across datacenters are essentially impossible. (Theoretically it would be possible but it would be so slow as to be impractical.) So instead, you lazy-load the objects. You load up the A and then, when a B is needed you load up that instance of B (or possibly all related Bs). The problem with this is that when you deploy everything will be going fine until you end up with some power users that have hundreds or thousands (or hundreds of thousands) of Bs and either take down your database when you run out of open connections or end up with a bad experience because certain pages take forever to load. (Because you’re making 576 round trips over the network to the database, if A has 575 Bs. This is called the N+1 problem.)

So both solutions kind of suck in different scenarios. Most good ORMs included in frameworks essentially punt the problem over to the developer. When defining a has-many maybe it lazy-loads by default unless you add in an extra optional “eager” statement to load them in with a join. Or at least that’s what I’ve seen in the frameworks I’ve used.

Instead of doing that why don’t we put a sort of jit “compiler” in-between the app and the database. (Preferably the jit would integrate with the app but more on this later.) This jit would actually be part of the ORM you use and could read your definitions. Those definitions could include the traditional advice about eager or not but the jit would be free to ignore it, advice likely only being helpful on cold startup. Given the ORM integration, the jit would know that A has-many Bs and C has-one D and E is-a Z, etc. With that info it could build up knowledge and run heuristics on when to load what data how and even better, as the app is run, with the extra knowledge it gains about your data access patterns, it could only return what is likely to be relevant data.

Another huge advantage is that it would bring a scalable join. Since writes go through the ORM it would know (or at least have a pretty good guess at) what data is where. So when you load up an A and it figures it should return all the Bs, it can run its “join” across the relevant machines. Suddenly you have a scalable join. It involves an extra parallel network round-trip but I think that’s probably worth it. Especially if the jit has access to a large cache that it can manage intelligently. This join of course isn’t a real join but once you have the id of A, you can run a series of parallel “select * from table where foreign_key = A.id”, merge the results and return them together as if a join had happened.

The problem with making an intelligent jit though is the amount of knowledge you have available for your heuristics. The more information the better. So tight integration into the app would really be the best. It could look at the request that came in, the data that came over the wire from the database/cache and then look at what data was actually read/accessed to produce the final output. By looking at all that it could realize that in certain types of requests the application is actually requesting more data than it needs and tune itself to only return the relevant data with a mock/proxy in what should theoretically have been returned but was unlikely to be used (if it turns out to be a special case and the mock/proxy is accessed, this basically causes a cache miss and the tooling has to go back to the database/cache for the information and likely update its knowledge so that the heuristics become more accurate).

If this were all available you could build a massively scalable cloud database.

Of course, as this jit gets iterated on again and again eventually it can start doing more advanced stuff like managing indexes. For example if it finds queries that are running slow but would be quick with an index it could bring up a new machine, replicating from an original, add an index, let replication catch back up, lock for a little while it swaps the original for that one and then shut the original off entirely. It suddenly added an index to your schema that it decided you needed without costing you any downtime and very little additional expense. Running two machines instead of one for each db server for a little while would basically cost you a few dollars to get your entire database fleet to have an index added to them, your app’s performance upgraded and this without any downtime incurred. Pretty good bargain.

And there’s a company that happens to be in the absolutely perfect position to implement this. A company that has tight integration between their language, the tools developers use, and even control over the environment their deployments typically run on. That company is Microsoft. It would be awesome if you could start up a new .NET MVC project, code it all up in C# connected to an SQL Server running on localhost in development and then deploy to Azure with a massive scalable backend given to you “for free” (no extra work) as long as you stick with mostly LINQ for data requests or something that like that. They may be working on this already, I don’t know, but there definitely isn’t anyone in a better position to pull this off right now and the scenario I described above, especially if it’s billed pay-only-for-what-you-use cloud style (which, given Azure, is probable), would make developing on the .NET platform a very interesting proposal for startups and a no-brainer for large enterprises.

Maybe I’m missing something and maybe it’s already been done but if it hasn’t I’d be surprised if Microsoft didn’t already have it in the works.

Written by Smokinn

April 9th, 2011 at 2:52 am

Posted in Uncategorized

Leave a Reply

Comment moderation is enabled. Your comment may take some time to appear.