Roy's Technology Diary: July 2008

Wednesday, July 30, 2008

VirtualBox – Simple and Sweet Virtualization

I have been experimenting with virtualization since quite sometime now. My interest towards hypervisors or virtualization was sourced from the need to enhance the productivity my development and testing staff, by enabling them to switch between multiple operating systems on the very same box, without having to reboot the entire system.

Although virtualization has been around since many years now, but my search for better and cost-effective hypervisors recently seems to have been complimented with a plethora of them available at zero cost.

VMware is finally giving away its Update 2 for VMware Infrastructure 3.5 along with a lightweight edition of its market leading hypervisor, ESX 3i, for free.

Monday, July 28, 2008

Shard – A Database Design

Scaling Database is one of the most common and important issue that every business confronts in order to accommodate the growing business and thus caused exponential data storage and availability demand. There two principle approaches to accomplish database scaling; v.i.z. vertical and horizontal.

Regardless of which ever scaling strategy one decides to follow, we usual land-up buying ever bigger, faster, and more expensive machines; to either move the database on them for vertical scale-up or cluster them together to scale horizontally.

While this arrangement is great if one has ample financial support, it doesn't work so well for the bank accounts of some of our heroic system builders who need to scale well past what they can afford.

In this write-up, I intend to explain a revolutionary and fairly new database architecture; termed as Sharding, that some websites like Friendster and Flickr have been using since quite sometime now. The concept defines an affordable approach to horizontal scaling with no compromise at all.

For instance Flickr handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost.

What is sharding...?

While working for an auction website, somebody got the idea to solve the site’s scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one server and the data for User B is stored on another server. It's a federated model. Groups of 500K users are stored together in what are called shards.

The advantages are:

High availability. If one box goes down the others still operate.
Faster queries. Smaller amounts of data in each user group mean faster querying.
More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.

How is Sharding different from traditional architectures...?

Sharding is different than traditional database architecture in several important ways; following are the key factors -

Data is denormalized. Traditionally we normalize data. Data are splayed out into anomaly-less tables and then joined back together again when they need to be used. In sharding the data are denormalized. You store together data that are used together.

This doesn't mean you don't also segregate data by type. You can keep a user's profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole. This is a very fast approach. You just get a blob and store a blob. No joins are needed and it can be written with one disk write.

Data is across many physical instances. Historically database servers are scaled up. You buy bigger machines to get more power. With sharding the data are parallelized and you scale by scaling out. Using this approach you can get massively more work done because it can be done in parallel.

Data is small. The larger a set of data a server handles the harder it is to cash intelligently because you have such a wide diversity of data being accessed. You need huge gobs of RAM that may not even be enough to cache the data when you need it. By isolating data into smaller shards the data you are accessing is more likely to stay in cache.

Smaller sets of data are also easier to backup, restore, and manage.

Data are more highly available. Since the shards are independent a failure in one doesn't cause a failure in another. And if you make each shard operate at 50% capacity it's much easier to upgrade a shard in place. Keeping multiple data copies within a shard also helps with redundancy and making the data more parallelized so more work can be done on the data. You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard. If one server goes down the other can take over.

It doesn't use replication. Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master.

Obviously the master becomes the write bottleneck and a single point of failure; and as the load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled.

Sharding cleanly and elegantly solves the problems with replication.

The most recommended approach to implement database shards is using the Hibernate Shards framework. The said framework offers critical data clustering and support for horizontal partitioning along with standard Hibernate services. Which enable the businesses to keep data in more than one relational database without any add-on complexity whilst building the applications.

Other than Hibernate; shards can also be implemented with any of the following toolkits –

Apache Slice (supports distributed XA transactions)
Websphere ObjectGrid

Well, thats all for the starters folks. Hope this was an useful read and has provided enough thoughts for your brains to work quite sometime now...

Tuesday, July 8, 2008

Symptoms of an aging software

Most product development starts with a good design in mind. The initial architecture is clear and elegant. But then something begins to happen. What really happens is that the programs, like people, get old. However, unlike (in most cases) people there is no guarantee that software will mature as it grows old. Even worse, if we aren’t taking care of standard Software Quality Assurance (SQA), our system is not only ageing, but infact it is rotting. A Software undesirably grows old for two major reasons –

The product owners have failed to modify the software to meet changing needs.
The changes are made that yield poor results.

It is SQA that determines the way we can check and improve our software quality. SQA is a planned and systematic approach to evaluation of the quality and adherence to software product standards, processes, and procedures. It includes the process of assuring that standards and procedures are established and followed throughout the software acquisition life cycle.

So, who is responsible for SQA...? Well, everyone has an influence on quality - independent of his or her status and position in the project. What developers can do is focus their eyes on the excellence of defect detection, and removal as well as on design improvements wherever feasible. Now, in case of a horrendous design wherein every component seems to be dependent on every other, it could be a nightmare to even think of changing any of them. In such a case it’s all together a different war game. But in the case of a typical day-to-day scenarios, before we look closer at defect detection and analysis, we must first look for the standard symptoms of "ageing software".

You should keep an eye out for these characteristics, especially when your software is getting older.

Architecture and design can't keep up - As software ages, it grows bigger and bigger, but architecture and design will get flushed down and one would clearly see some kind of design erosion.
Unused or dead code - Another phenomenon that occurs in growing software is, incorporation of features that are not explicitly requested, but are mainly incorporated by enthusiastic software engineering staff members (especially agile techniques try to avoid these "eloping features"). This surely leads to unused (dead) code. This phenomenon can also be result of inadequately performed reuse.
Poor modularization - Programs are not divided into meaningful subsystems.
Confusing workflow - The program's dynamic work flow cannot be derived by a static code inspection.
Hidden redundancy - Duplicated code is infact the worst enemy of quality software.
Inaccurate scope - The scope, or visibility, of data and methods is inaccurate and more extended than necessary (e.g., "public" instead of "protected").
Semantic issues - Class names, variables and methods of non-framework class names are semantically worthless or irrelevant to the context of what it does (for example, the most famous "int i;”, by its mere name wonder who would tell you what it does; similarly “void doProcessing()” or “class DataHandler").
Poorly documented code and design - Documentation might even be ignored because it is considered inaccurate. It's hard to maintain and modify code that follows the maxim "If it was hard to write, it should be hard to read." Unfortunately, there are no incentives for programmers (at least not a practice that I’ve come across) to document their code or write clear and understandable programs. In fact, it's usually the opposite; developers are urged to quickly turn out code, mainly because of unrealistic schedules (thanks, to the project managers who fail to anticipate and setup the right expectation for their team deliverables); this can even happen in agile projects.
Exponentially increasing code changes - There is more and more code to change to incorporate new features or to fix detected errors. It is difficult to find the modules and components that must be changed and to preserve the original design when doing these changes.
Reliability decreases - As the software is maintained by the so called workarounds or hacks, more than often the complexity of keeping track of changes may result in new bugs. Even with minor changes, known and unknown dependencies among parts of the software are likely to impact and cause problems.
Lack of scalability - There is a lack of scalability, mainly revealed as reduced performance, as poor design (or often direct implementation) cause performance bottlenecks.
Lack of reusability - This may concern the reuse of modules from the product itself or even from other products or platforms developed by the organization.

Now, being familiar with the symptoms of rotting software, and having seen what can turn good software into a bad one, we will look at possible corrective actions which will be hosted here sometimes soon in future…

Roy's Technology Diary