I got to talking with one of my developer friends today who is starting to hit a bit of scaling problem.
In general, he’s got a web service that he designed to handle updates to a data set via a simple REST based web service. The orig. design was to have the service called once per ‘record’ that needed updated…however it was quickly decided that clients could/should submit XML requests with lots of 'records’ and so the service was tweaked to handle this.
The problem now is that, at least one client wants to perform potentially millions of 'record’ updates daily. As the system is right now, it will likely timeout and/or lock up before it completes the processing.
To make things worse, the service is currently designed to do this processing sequentially…and return immediate results. Though this isn’t that big of an issue for him to fix (I think he’s already done it on the back end actually), it does mean changing the user expectations and actions on the front end once you redesign how the system works for these large data loads.
Anyway, getting back to the orig. problem he is now having…how does he scale a system to handle such a large data set?
Well the first thing I think you need to ask is what exactly needs to be done with this data each time it’s 'processed’?
In this case, it turns out that most of the updates are going to be quantity value changes (ie. one of the data points is a quantity field, like how many of a given book is in stock).
As it stands right now, if a million 'records’ are submitted to the service, it’s got to do at least a million updates (actually it does a lot more because of the business logic required, but for now we are going to pretend it’s just one call per record).
It’s my belief that you should always be looking for ways to cut down on the number of database calls you need to make (regardless of what you are trying to do)…so automatically my mind flips this problem into 'how do I cut down on making so many DB calls?’
In my opinion, it’s likely that you are going to have a lot less unique 'quantities’ than you are records (and in a worse case scenario it’s one to one)…so a simple first step in my mind is to flip your processing from 1 update per 'record’ to 1 update per 'quantity’…
So parse the XML, build up a list of ids (or whatever you use that ties quantity to a record) for each given quantity…and then execute those queries…
It’s not a complete solution to the problem at hand (you’ve still got to work in threading, probably rethink just how and what gets processed, and some other scaling related ideas), but it’s a path I would def. start exploring for a problem like this…
Of course I’m pretty sure my friend is NOT going to take my suggestion and trek down this path (I think he feels he’s got too much business logic and other things going on to be able to implement this type of flip)…
I still thought it was worth throwing out there for the rest of you think about…when you are dealing with scaling (and related problems) at least part of your work should be flipped into “how do I cut down the number of DB calls I’m making?”…and forcing yourself to answer that question almost always leads you down some interesting and useful paths (and to a stronger overall architecture).
This post has received 42 loves.
Kevin has a day job as CTO of Veritonic and is spending nights & weekends hacking on Share Game Tape. You can also check out some of his open source code on GitHub or connect with him on Twitter @falicon or via email at kevin at falicon.com.
If you have comments, thoughts, or want to respond to something you see here I would encourage you to respond via a post on your own blog (and then let me know about the link via one of the routes mentioned above).