Archive for the 'Engineering' Category

Splitting Applications or Services for Scale

Thursday, May 8th, 2008

Most internet enabled products start their life as a single application running on an appserver or appserver/webserver combination and potentially communicating with a database. Many if not all of the functions are likely to exist within a monolithic application code base making use of the same physical and virtual resources of the system upon which the functions operate: memory, cpu, disk, network interfaces, etc. Potentially the engineers have the forethought to make the system highly available by positioning a second application server in the mix to be used in the event that the first application server fails.

This monolithic design will likely work fine for many sites that receive low levels of traffic. However, if the product is very successful and receives wide and fast adoption user perceived response times are likely to significantly degrade to the point that the product is almost entirely unusable. At some point, the system will likely even fail under the load as the inbound request rate is significantly greater than the processing power of the system and the resulting departure rate of responses to requests.

A great engineering team will think about how to scale their platform well in advance of such a catastrophic failure. There are many ways to approach how to think about such scalability of a platform and we present several through a representation of a three dimensional cube addressing three approaches to scale that we call the AKF Scale Cube.

The AKF Scale Cube consists of an X, Y and Z axes – each addressing a different approach to scale a service. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic service or product identified above: a product wherein all functions exist within a single code base on a single server making use of that server’s finite resources of memory, cpu speed, network ports, mass storage, etc.

 

The X Axis of the cube represents a means of spreading load across multiple instances of the same application and data set. This is the first approach most companies use to scale their services and it is effective in scaling from a request per second perspective. Oftentimes it is sufficient to handle the scale needs of a moderate sized business. The engineering cost of such an approach is low compared to many of the other options as no significant re-architecting of the code base is required unless the engineering team needs to eliminate affinity to a specific server because the application maintains state. The approach is simple: clone the system and service and allow it to exist on N servers with each server handling 1/Nth the total requests. Ideally the method of distribution is a load-balancer configured in a highly available manner with a passive peer that becomes active should the active peer fail as a result of hardware or software problems. We do not recommend leveraging round-robin DNS as a method of load balancing. If the application does maintain state there are various ways of solving this including a centralized state service, redesigning for statelessness, or as a last resort using the load balancer to provide persistent connections. While the X-axis approach is sufficient for many companies and distributes the processing of requests across several hosts it does not address other potential bottlenecks like memory constraints where memory is used to cache information or results.

The Y Axis of the cube represents a split by function, service or resource. A service might represent a set of use-cases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in reducing or distributing the amount of memory dedicated to any given application across several systems. A recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. As a quick first step, a monolithic application can be placed on multiple servers and dedicate certain of those servers to specific “services” or URIs. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service/pool/resource/application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as “swimlaning” an application.

The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). As with the Y axis split, this split aids not only fault isolation, but significantly reduces the amount of memory necessary (caching, etc) for most transactions and also reduces the amount of stabile storage to which the device/service needs attach. In this case, you might try a modulus by content id (article), or listing id, or a hash from the received IP address, etc. The Z axis split is often the most costly of all splits and we only recommend it for clients that have hyper-growth or very high rates of transaction. It should only be used after a company has implemented a very granular split along the Y axis. That said, it also can offer the greatest degree of scalability as the number of “swimlanes within swimlanes” that it creates is virtually limitless. For instance, if a company implements a Z axis split as a modulus of some transaction id and the implementation is a configurable number “N”, then N can be 10, 100, 1000, etc and each order of magnitude increase in N creates nearly an order of magnitude of greater scale for the company.

 

Fast or Right?

Thursday, April 24th, 2008

All of us have heard the mantra “Speed, Quality, or Cost…pick any two that you like” but when it comes to a specific software release you are usually only left with the speed and quality levers to pull.

The reason you only have these two is that budgets are generally set annually and even if your budget were instantly doubled you could not hire developers or QA staff fast enough to make a difference on the target release in question. So that leaves us often asking the question, do we want the release out quickly or correctly? In some organizations this question is asked every release. What is the right answer? We believe the single correct answer is “it depends”. I know, that sounds like a copout but really “it depends” on things like what type of software are you are developing, (e.g. shrink-wrapped or SaaS), how tolerant are the customers, how many open production bugs do you currently have, etc. The following is a quick look at some of these dependencies with the intent of informing better decisions regarding their trade-offs in the future.

The type of software you are developing is a critical element in the decision between speed and quality. If you have one shot to get it right because the software is being pressed onto DVDs then we argue that the decision is obvious: you have to skip your delivery date if the quality is not there. If you are developing SaaS type software the question is a bit more complicated because some services are fine to be a little buggy, such as a beta version of a report. Others, like transferring money in a banking application, are not. The user of the software is another critical element in the decision. If the application is for internal sales reps, bugs can be managed much more easily than if the software is for external sales reps.

The current or expected quality of the application is another factor to consider when making the decision to trade speed or quality. If you have very few bugs currently in production it may be fine to release a less-than-perfect version and clean it up over the next week, providing that your customers are tolerant of this type of deployment. It may be the case that the new functionality will rapidly create barriers to entry for your competition or switching costs for your users which in turn may allow for a higher defect density in the initial release. We have often had internal customers tell us this exact comment when we were building features to improve their efficiency or productivity.

When considering the tradeoff of speed and quality, consider your user’s willingness to tolerate defects in a release, the importance of quality in the release given the industry (our example of banking being a low defect need industry) and the business need for speed in the specific functionality contained within the release.

SOA vs ROA

Tuesday, April 8th, 2008

Two architectural approaches commonly applied today are the Service Oriented Architecture and the Resource Oriented Architecture. When considering these as a basis for your platform, the decision is not whether one is better than the other but rather which is more appropriate for the intended functionality, portion of the platform or entire platform. We will start by defining and explaining each of these architectures and then give some advice on how to make the right decision in light of your specific needs.

 

boxer

 

Service Oriented Architecture describes designing a system by modeling business processes as services or “actions”. Each service is a distinct unit of functionality and does not interact with the other services. In other words, calls between services are discouraged. As with object oriented programming wherein a major goal is reuse of code, the goal of SOA is to allow large chunks of functionality to be rearranged to form new applications. The size of the chunk of functionality is important because the more functionality that is included in each service, the fewer interface points that are required to implement a particular functionality and the fewer interface points there are in any particular functionality the lower the probability of transactional failure and improvement in performance. That said, very large chunks of functionality may not be granular enough to be easily reused. SOA is geared towards applications that are activity-based, such as a banking application where users are most interested in performing transactions such as deposit, withdrawal, open accounts, etc.

Resource Oriented Architecture is a term coined by Alex Bunardzic in his August 8, 2006 blog entry. It is now generally considered a specific set of guidelines for an implementation of REST. REST, or Representational State Transfer, comes from Roy Fielding’s Doctoral Thesis “Architectural Styles and the Design of Network-based Software Architectures” and describes a series of constraints that exemplify how the web’s design emerged utilizing the Hyper Text Transfer Protocol. It is difficult to discuss REST without blurring the lines between the technology of implementation and the ROA architectural principals. ROA is ideal for applications that are resource-based such as an RSS reader where feeds are the “resource” that the user is interested in getting information about and then updating the status of that resource.

A closely related technology decision is whether the implementation will be REST, SOAP, RPC or various other technologies. As discussed above, REST is often associated with ROA while SOAP is often associated with SOA. However, they are not mutually exclusive, in that it is possible to implement an SOA architecture through a RESTful implementation. Another confusing point is the difference between SOAP and SOA. SOAP, which originally stood for Simple Object Access Protocol, is a protocol for implementation whereas SOA is an architecture. The SOA and ROA concepts are the structure of the application and what information flows – they are descriptions of how something should work. REST, RPC, WS*, and other technologies are how the application actually works, or how they are actually implemented. We will talk more about these technology implementations in a later article.

The language analogy is that SOA focuses on exposing many verbs (actions) and ROA is focused on exposing many nouns (resources). Of course most applications, like sentences, are not all actions or all resources but rather they are a combination of both and have actions and resources. The predominance of the application should dictate the architecture. If you find yourself in a position of having a platform requiring a mix of actions and resources it may be useful to consider blending (albeit in separate environments) the two architectures. For instance, you might have a site that has reporting functionality that is primarily resource based and commerce functionality that is primarily action based. In this case your site could be comprised of two separate and distinct architectures that form a holistic client service including both commerce and reporting.

Ardent adherence to one architecture or another is not prudent. As we mentioned in our posting on Technology Agnostic Design (TAD) the professional technologist chooses the right tool for the job as should be the case for architectures, technologies, design patterns, programming languages, etc.the professional technologist chooses the right tool for the job as should be the case for architectures, technologies, design patterns, programming languages, etc.

If you have had a good or bad experience with either architecture let us hear about it

A Case for Technology Agnostic Design (TAD)

Tuesday, April 1st, 2008

Have you ever heard someone describe a design or platform architecture by using the names of the third party systems used to implement their platform? The discussion starts with a question: “How are you architected?” and ends with a statement like “We use ACME web servers running on PLATINUM computers connected to our own internal application running on ACE application servers. We use BESTCO databases to maintain state and store information and they run on FASTCO SMP servers connected to a BESTSAN storage area network. Our network is all FASTGEAR and SAFE supplies our firewalls”.

The statement is innocent enough, but it does not address how a site is architected; it is rather a statement of how that architecture is implemented using technology.

 

 

Mature technology organizations understand that there is a very big difference between architecture and technology. The architecture of a platform describes how something works in generic terms with specific requirements and the technology describes how it is implemented and what it is comprised at any point in time.

The aim of technology agnostic design, or TAD, is to separate design and architecture from technology and implementation for the purposes of reducing cost, decreasing risk and increasing both scalability and availability.

TAD and Cost
It is unlikely that you will see an architect of a house label trusses, beams and supports with the vendor’s name. More likely the aforementioned objects will be labeled with sizes or specifications for load bearing. One reason this is so is that the architect recognizes that most pieces with which they will design a house are “commodities” or things that can be easily swapped out and which might be purchased primarily based on price.

Most technology solutions also ultimately suffer from the effects of commoditization. This is to say that as a good idea starts to become successful it is bound to attract competitors. The competitors within the solution space initially compete on differences in functionality and service but over time the differences in functionality and service start to decrease as the most useful feature sets are adopted by every provider in the competitive landscape. In an attempt to forestall the effects of commoditization, providers of third party systems and software try to produce proprietary solutions or tools that interact specifically and exclusively with their systems in order to create switching costs for the users of their systems, hardware or tools.

While there are always exceptions, as a rule we believe you are better off maintaining the flexibility of using nearly any provider’s solutions where there exists competition for that provider’s product offerings. The ability to replace one vendor’s product with another fairly easily gives you significant leverage in negotiating prices long term. We don’t mean to imply that you should never leverage proprietary functionality; rather you should understand that the use of such an embedded toolset has hidden and long term costs.

TAD and Risk
Designing for a particular vendor’s solution also creates a great deal of risk for your product and for your company. What if the provider of the solution goes out of business? What if the provider finds themselves being sued for some portion of their solution that does not exist in other similar solutions? What if the viability and maintenance of the product relies upon the genius of a handful of people within the company of the provider? What if the solution suddenly starts to suffer from quality problems that aren’t easily fixed?

Technology agnostic design reduces your risk by increasing your ability to quickly move to other providers without the problems identified above.

TAD for Availability and Scalability
As discussed above, as competition increases between providers of software and hardware and before commoditization completely sets in, providers of software and hardware attempt to differentiate themselves on functionality, performance, and service. During this period of attempted differentiation there might be significant performance, quality, and functionality differences between the providers. Ensuring that you can easily switch between the providers gives you the ultimate flexibility in leveraging these differences to the benefit of your platform.

The TAD Approach
Implementing technology agnostic design is fairly simple and straightforward. At its core, it means designing and architecting platforms using concepts rather than solutions. Pieces of the architecture are labeled with their generic system type (database, router, firewall, payment gateway, etc) and potentially further described with characteristics or specific requirements (gigabit throughput, 5 TB storage, ETL cloud, etc). The first, very simple test is to see whether a 3d party vendor’s name is tied to the system described on paper in any given architecture or design. Data flows, systems, transfers and software that are specifically labeled as coming from a specific provider should be questioned and where possible the system should be designed to allow any provider of a service to exist in that area.

Branching

Wednesday, February 20th, 2008

One of the issues all development teams will face at least once and sometimes several times as the team grows is how to allow multiple versions of the code to exist in a variety of different states.  The solution to this problem (as implemented within many source code management systems) is commonly referred to as “branching”.  The purpose of branching as described above is to allow individuals or teams of developers to work on the same code without constantly interfering with each other.  For example let’s say engineer A is working on the application’s admin module and decides to check in his work to the source code repository in order to not lose it in the event of his hard drive crashing.  Unknown to him, he has a major bug in it that prevents the admin module from working properly.  Engineer B wants to fix a production issue so he checks out the source code but cannot get the admin module working properly to debug the production issue because engineer A checked in a bug.  This scenario and many others can be avoided by proper branching. 

The term branching refers to isolating changes onto a separate line of development that does not appear on the main trunk (aka “production branch” or “release candidate branch”) of code.  You can move changes from one branch to the main trunk or from main trunk to a branch by merging.  Some shops avoid branching all together because of the fear of merging.  While merging can be problematic there are ways to minimize this issue as well that we will address in future articles.

As with most things in software development there are as many branching strategies as there are opinions and most of them have pros and cons that make them worth discussing amongst the team.  There is no single right answer for every organization, so your approach should be to choose the strategy that works best for the given the skills of the team and the toolset/systems employed.  Here are two sample branching strategies that we have seen work in the past and you might want to consider.

The first strategy, for those of us who do not need the extra complexity and want a simple way to keep our code safe,  follows the KISS principle (Keep It Simple Stupid).  In this scenario you can use the main trunk for new development and pull a branch right after the release for maintenance.  For example if you just released version 2.5 then call this new branch “2.5_maint” and use the main trunk for development of version 2.6.  The “2.5_maint” branch is used only for production fixes and gets merged into the main before QA.  This is a very simple and easy to understand strategy that suffices to keep a pristine copy of the code base ready for any production issues but does not overburden the engineers by making them manage too many branches and thus environments.  A permutation of this strategy is to leave the main trunk as the code that is in production and pull a branch whenever you start a new release.  Continuing with our example, if you currently have version 2.5 in production and are starting development on 2.6 someone would create a new branch from the main code base (trunk) and name it “2.6”.  All developers working on 2.6 check out this branch and use it for committing their new features.  This branch is utilized through QA until the code is ready for production and then the branch is merged into the main trunk.  If there have been production fixes in the main trunk these need to be either double committed to the main and branch or merged up from the main to the branch. 

The second strategy that is much more complex but tends to work well with large development teams who are releasing code at different times, typically because of an iterative SDLC, is to give each developer or pair of developers their own branch.  The developer chooses their next feature and pulls a branch for themselves.  When they are done coding, unit testing and sometimes feature testing right on their personal branch, they merge it into a release branch that is the gathering place for all features ready to be sent through QA regression for a near term release.  The advantage of this is that features can be pulled out or added at the last minute before regression and therefore the release will have less chance of being delayed because of one feature.  A modification of this approach is to have a group of developers associated with a specific release have their own branch, which in turn enables parallel development.
These are just two of the myriad strategies that exist but these or permutations of these are ones that we have seen work well in the past.  Tell us your favorite branching strategy!