Archive for the 'Engineering' Category

Code Review

Monday, November 17th, 2008

All software engineers have heard of the studies that show how the cost of a defect increases by an order of magnitude for each successive phase.  If you haven’t heard of these it is now considered a maxim that if it cost $1 to fix a bug found in development it cost $10 to fix if found in QA and $100 to fix once it is in production.   This is one reason we consider processes such as unit tests and code reviews as critical to technology organizations.  Finding bugs earlier saves money in terms of the effort of the tech teams and no matter how large or small the business, getting the most out of the tech team is vital. 

The process that we’re focusing on in this post is code review.  The engineers reading this are probably groaning out loud and about to stop reading, but we implore you to read on.  Implementing a code review process in the correct manner can not only save the business money but also be a valuable tool for engineering cross training, mentoring, and professional development.  There are many different methods of implementing code reviews from group meetings to paired programming.  We have seen many various methods and while any method is better than no code review, the one that we’ve seen the most success as measured by engineering contentment, long term continuation, and defect identification, is the one-on-one peer review. 

One-on-one peer code review is conducted between two engineers who can interact in person, on the phone or via email.  The reviewer typically gets assigned a feature to review prior to the code being promoted to the testing branch.  This serves as one of the final steps in development prior to formal testing beginning.  The reason that this individual review is more effective in many ways can be attributed to the general nature of coding which typically involves periods of quite concentration alone.  There is no reason to expect that reviewing code is any different that writing it in the first place.  Secondly, engineers are much more receptive to feedback in private and reviewers are much more likely to ask tough questions via email with the developer as opposed to in a group setting with management present. 

A good resource for more details about the benefits of peer reviews is the book “Best kept Secrets of Peer Code Review” by Jason Cohen from Smart Bear, Inc.  Admittedly Smart Bear is selling software for peer code reviews but even with this bent the book is a good source of information about studies and the history of code reviews.  In one such study cited in the book that demonstrates why code reviews conducted in meetings are not as effective as code reviews done one-on-one,  they determined that engineers spent 25% of the time reading the code in prep for the meeting and 75% of the time in the code review meetings.  Interestingly, 80% of the defects were found during reading and only 20% discovered during the meetings. 

Whatever code review process you decide to implement, anything is better than not doing it as long as the team is bought into it as a performance and efficiency enhancer.  Consider providing some reading material on code reviews and let one or more of the engineers propose a code review process.  Engineers who own the solution are much more likely to be excited about it and follow through on it.

How to Scale a Read Subsystem

Monday, October 13th, 2008

Many SaaS systems have a large part of the system dedicated to reading or searching for information.  This read or search may be implemented in an ecommerce site as a product/inventory search based on keywords, or in a content site it may be a implemented as an unstructured string search or regular expression search against all indexed content.

In high transaction sites, this activity can be extremely taxing and even cost prohibitive on the primary database should be considered a great target for disaggregation along the Y axis of our scale cube as we describe in our database scaling post

The AKF scale cube can be applied to read/search subsystems to create multiple dimensions of splits that will allow for near infinite scale.  Below is our cube diagram depicting the three dimensions for scaling a read or search subsystem. 

click to enlarge

As with our past cubes, the X axis is the balancing of transaction load against multiple clones.  This allows the system to scale in terms of transactions but not necessarily the size of data.

The Y axis is a split along function or service.  While the read/search subsystem split is a Y axis split of the original database, we can recursively split this by creating read subsystems specific to a product catalog, user-specific information, order history, archived content, current content, recommended content, and on and on.

The Z axis is our modulus, lookup function or indiscriminate function split.
Let’s look at the X and Z axis to describe a physical system that can be easily scaled for reads. This physical architecture would be comprised of aggregators, load balancers, and an NxM matrix where N systems hold 1/Nth of the data and M storage systems each get 1/Mth of your transactions along those N dimensions.  The storage subsystems don’t have to be relational databases, they could be in memory data stores such as memcached or Berkeley DB instances. 

Read or search requests are load balanced to one of X (X scales with the number of requests being made) aggregators, each of which subsequently make N requests of the N unique data sets through a load balancer to one of M systems within the N data tiers.  The N unique responses are aggregated and sorted and returned to the customers.

click to enlarge

The benefit of such a system is that you can scale N easily with the number of items listed and M with the number of transactions requested.  If N is sized appropriately, the items can all be returned from memory thereby further increasing transaction speed.

If you want higher speed and even greater fault tolerance, you can further split these read subsystems along the Y axis as described previously in this document. 

 

To Get Better You Must Practice

Tuesday, July 15th, 2008

Almost everyone explicitly understands that physical activities such as golf or running or weight lifting require lots of repetitious practice in order to get better but most people don’t recognize that mental activities and business processes require the same practice.  We all studied in school to learn languages, algorithms, etc. but most of us either swore off studying upon graduation or forgot that study and practice are prerequisites to proficiency and excellence.  From the engineer to the manager to the CTO, everyone has skills and processes that need to be practiced and critiqued in order to improve.

As professionals no longer under the guidance of professors, we need to take responsibility for our own continuing education.  If you think that coding new features will provide you with enough stimuli for expanding your skill set, you should reexamine that idea by looking back over the past twelve months to see what you have learned that makes you a better engineer today.  It is most likely that if you have relied solely on assigned features you have fallen into the trap of using what you know to code them rather than stretching to learn new designs, patterns, or algorithms.  The wondrous thing about programming is that there are many ways to solve the same problem, some faster than others, some more eloquent than others.  We recommend a couple practices to help the engineers continue to learn and perfect their skills but these can really be expanded to other groups such as QA.  The theme behind these recommendations is leverage the shared knowledge of the entire organization to learn from each other.  This is one reason the whole is greater than the sum of the parts.   

Start your engineering all hands meeting with someone presenting a creative solution to a problem.  Have the engineering managers or architects decide whose solutions qualify for being interesting enough to share with the group or leave it to the individual engineers to decide.  Another idea for ensuring that you and other engineers continue to learn is practice code reviews.  Engineers sometimes get persnickety about someone reading over their code but this is a great way for the reviewer to learn new techniques as well as the engineer.  A final suggestion is to establish a Joint Application Design process where members of the operations team join the engineers and architects in the design process of the feature.  This inclusion of different perspectives will help broaden all participants understanding of technology. 

In terms of practicing processes this is similar to practicing skills.  If you never exercise the process or you do so in a halfhearted manner, you will never be good at it and when the time comes that you need that process to work perfectly it will assuredly not.  Some of the processes that get skipped too often are failovers and disaster recoveries.  If you don’t practice failing over when the failure occurs that requires a failover the process will not work.  It will either take way longer than you thought, result in unexpected outcomes, or possibly fail to work at all.  Obviously you must do these without impacting the production site but it is possible to exercise the failover without bringing down production. 

Remember what Sun Tzu said in the Art of War:  “The more you sweat in peace, the less you bleed in war.” 

 

Image provided by Mike Kline

Scalability Architect

Thursday, June 26th, 2008

You have probably never heard of a Scalability Architect.   In our vernacular it is someone who specializes in designing system architectures for high availability and scalability.   We think you might want to consider adding one to your roster if you are serious about scaling and keeping your site up. 

At AKF we are all about scalability and availability – for both platforms and businesses.  For a SaaS company, it is your lifeblood and must be a core competency to survive and grow.  Downtime not only means lost revenue (Amazon’s 2 hour downtime last week was estimated at costing them $30K per minute), but it also means losing customers to your competition.  While most companies like Amazon calculate the cost of downtime, the real cost can add up for months or years when adding in the loss of customers who never return.  We are advocating that companies seriously consider augmenting their architect team with a person or team of people who spend the majority of their time thinking and doing projects related to long term scalability and availability. 

Why is this role different from a traditional architect’s role?  We feel that there is sufficient specificity in technical knowledge, perspective, and focus that a general systems architect will often overlook scalability for more urgent short term matters.  We see this often in our engagements where companies have great architects but they are focused on designing the next feature or introducing a new technology.  They do not have time, ability or experience to focus on longer term scalability issues within the platform.  Often we promote seasoned engineers who have proven their ability to design properly and evaluate technology effectively to the role of architect.  This is perfectly acceptable and is considered the standard career progression.  However, to be a Scalability Architect, the individual needs to have made a study of scalability and availability issues for a number of years.

A key skill or experience to look for in the Scalability Architect is a thorough knowledge of how to split both the application as well as the database in multiple dimensions, see our application and database splitting posts.  Additionally, the ideal candidate will have been through several of these splits before where they have learned some of the pitfalls.  Knowing that you need to consider what objects need to be cached with each other can save a lot of redesign and headache.

Are you considering adding a Scalability Architect to your team?  If so let us hear about what made you decide and what skills you think are important.

Risk Assessment

Thursday, June 12th, 2008

One of the key components to high availability is the proper management of risk.  Obviously stopping all changes to a site will improve the availability in the short run but in the long run, no new features get deployed so customers stop coming and no required maintenance gets done so you end up with more down time to fix what could have been handled with preventative maintenance.  So that is not a good long term strategy for improving availability.  Luckily, we have found and proven to ourselves that It is possible to dramatically improve a site’s availability by simply understanding the riskiness of changes.  By taking the time to classify changes on a scale of risk and using some basic rules, the management of risk becomes a very useful tool in the quest for uptime.

The tool that we recommend using comes originally from the military and space program but we were taught it as part of Six Sigma and is called a Failure Mode and Effects Analysis or FMEA (pronounced fee’-ma).  The first step is to identify the ways in which the code or application can fail, or in our terminology the “failure modes”. We typically asked each engineer or product manager to come up with three to five failure modes for each feature.  Once these are gathered the team should rate each failure mode using these questions: how severe is the impact of the failure (severity), how detectable is it if it fails (detectability, yes we made up that word), what is the probability or “likelihood” that the failure will occur We recommend using a scale of 1, 3, and 9, because it provides an exponential weighting to help identify the riskier items.  These scores are then multiplied together to get a total risk score and the failure modes are ranked highest risk to lowest.  Here is an example risk matrix.

FMEA

So, once you have developed this matrix, what can you do with it?  For starters, the highest risk items should always have mitigation plans associated with them.  These mitigation plans are actions that help lower one or all of the three risks (severity, detectability, or likelihood).  The second thing that you can do with this matrix is determine a maximum level of risk that you will allow to be placed upon the site in any given time period (1 day,  1 week, or 1 release are all good intervals or timeframes).  As an example, you might determine either as a guess at first or later through analysis of past performance associated with the risk score of each release, that 275 is the maximum risk amount that you feel comfortable with for any release or change to the site within 1 week.  Therefore you can only have features in this week’s release that total less than 275 on the risk scale.  Lastly, this should be used in conjunction with other risk mitigation strategies, such as not mixing infrastructure changes with code releases.

Your email:  
subscribe unsubscribe