<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.0.2" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>AKF Consulting Blog</title>
	<link>http://akf-consulting.com/techblog</link>
	<description>Technical and Leadership Thoughts</description>
	<pubDate>Tue, 15 Jul 2008 19:57:46 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0.2</generator>
	<language>en</language>
			<item>
		<title>To Get Better You Must Practice</title>
		<link>http://akf-consulting.com/techblog/2008/07/15/to-get-better-you-must-practice/</link>
		<comments>http://akf-consulting.com/techblog/2008/07/15/to-get-better-you-must-practice/#comments</comments>
		<pubDate>Tue, 15 Jul 2008 19:57:46 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>Engineering</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/07/15/to-get-better-you-must-practice/</guid>
		<description><![CDATA[Almost everyone explicitly understands that physical activities such as golf or running or weight lifting require lots of repetitious practice in order to get better but most people don’t recognize that mental activities and business processes require the same practice.  We all studied in school to learn languages, algorithms, etc. but most of us either [...]]]></description>
			<content:encoded><![CDATA[<p>Almost everyone explicitly understands that physical activities such as golf or running or weight lifting require lots of repetitious practice in order to get better but most people don’t recognize that mental activities and business processes require the same practice.  We all studied in school to learn languages, algorithms, etc. but most of us either swore off studying upon graduation or forgot that study and practice are prerequisites to proficiency and excellence.  From the engineer to the manager to the CTO, everyone has skills and processes that need to be practiced and critiqued in order to improve.</p>
<div style="text-align: center"><img style="width: 118px; height: 113px" height="113" src="http://www.akf-consulting.com/images/samurai.jpg" width="118" /></div>
<p>As professionals no longer under the guidance of professors, we need to take responsibility for our own continuing education.  If you think that coding new features will provide you with enough stimuli for expanding your skill set, you should reexamine that idea by looking back over the past twelve months to see what you have learned that makes you a better engineer today.  It is most likely that if you have relied solely on assigned features you have fallen into the trap of using what you know to code them rather than stretching to learn new designs, patterns, or algorithms.  The wondrous thing about programming is that there are many ways to solve the same problem, some faster than others, some more eloquent than others.  We recommend a couple practices to help the engineers continue to learn and perfect their skills but these can really be expanded to other groups such as QA.  The theme behind these recommendations is leverage the shared knowledge of the entire organization to learn from each other.  This is one reason the whole is greater than the sum of the parts.   </p>
<p>Start your engineering all hands meeting with someone presenting a creative solution to a problem.  Have the engineering managers or architects decide whose solutions qualify for being interesting enough to share with the group or leave it to the individual engineers to decide.  Another idea for ensuring that you and other engineers continue to learn is practice code reviews.  Engineers sometimes get persnickety about someone reading over their code but this is a great way for the reviewer to learn new techniques as well as the engineer.  A final suggestion is to establish a Joint Application Design process where members of the operations team join the engineers and architects in the design process of the feature.  This inclusion of different perspectives will help broaden all participants understanding of technology. </p>
<p>In terms of practicing processes this is similar to practicing skills.  If you never exercise the process or you do so in a halfhearted manner, you will never be good at it and when the time comes that you need that process to work perfectly it will assuredly not.  Some of the processes that get skipped too often are failovers and disaster recoveries.  If you don’t practice failing over when the failure occurs that requires a failover the process will not work.  It will either take way longer than you thought, result in unexpected outcomes, or possibly fail to work at all.  Obviously you must do these without impacting the production site but it is possible to exercise the failover without bringing down production. </p>
<p>Remember what Sun Tzu said in the Art of War:  “The more you sweat in peace, the less you bleed in war.” </p>
<p> </p>
<p><font size="1"><em>Image provided by Mike Kline</em></font>
</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/07/15/to-get-better-you-must-practice/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Incenting Success in Technology Organizations</title>
		<link>http://akf-consulting.com/techblog/2008/07/08/incenting-success-in-technology-organizations/</link>
		<comments>http://akf-consulting.com/techblog/2008/07/08/incenting-success-in-technology-organizations/#comments</comments>
		<pubDate>Tue, 08 Jul 2008 14:56:27 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>CTO/CIO</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/07/08/incenting-success-in-technology-organizations/</guid>
		<description><![CDATA[As we’ve discussed before in articles like Be A Leader!, the primary job of a CTO is to help the executive team maximize shareholder value.  Notice our choice of verb in the last sentence, “maximize”.  It is a much stronger word than what an average performing company would select – that word typically being “create”.  [...]]]></description>
			<content:encoded><![CDATA[<p>As we’ve discussed before in articles like <a href="http://akf-consulting.com/techblog/2008/03/03/be-a-leader/">Be A Leader!</a>, the primary job of a CTO is to help the executive team maximize shareholder value.  Notice our choice of verb in the last sentence, “maximize”.  It is a much stronger word than what an average performing company would select – that word typically being “create”.  Maximizing shareholder value is the goal of a high performing team – a team which desires to say that “no other team in our position could provide the type of shareholder return that we do”.</p>
<p>The CTO however cannot maximize shareholder value and potentially can’t even prove that he or she is creating shareholder value without a set of aggressive goals along with the metrics and measurements that help define success or failure enroute to achieving those goals. </p>
<p><img src="http://www.akf-consulting.com/images/graph.jpg" /></p>
<p>We prefer to group our goals thematically, making it easier to determine how the goals impact the maximization of shareholder value.  Our themes include the reduction of cost, availability, the efficiency of engineering spend, the effectiveness of our product selection process, quality, and time to market.</p>
<p><strong>Cost<br />
</strong>No list of aggressive goals is complete without finding a set of goals to minimize the cost of operating a SaaS site.  In our experience, the best cost metrics are those normalized by transaction (cost per transaction) or normalized by cost of transaction type (cost per checkout, cost per signup, etc).  The associated goal is to reduce the cost by some relative value over time or to reduce the cost to an absolute value thereby increasing profit and shareholder returns.</p>
<p><strong>Availability</strong><br />
No SaaS site can realistically operate in this day and age without considering the impact of availability on revenue.  Our desire here is to identify the lost opportunity (in most cases lost revenue) associated with outages rather than just the amount of downtime a site has.  While measuring absolute downtime is valuable and should be tracked if possible, the measurement of revenue loss as a percentage calculation is more easily associated with shareholder value maximization (less revenue loss the better) and further takes into consideration that most sites don’t produce as much revenue in the middle of the night as they do during the middle of the day.</p>
<p><strong>Engineering Efficiency and Productivity</strong><br />
You can’t be maximizing shareholder value if you aren’t measuring and improving your engineering team.  These measurements are arguably difficult, but we try to break them into two component parts: </p>
<p>1) Efficiency - How many engineering days are you getting out of the theoretical maximum?  This is a measurement of how many engineering days you lose due to environment issues, training problems, tool issues, etc.  Most organizations that don’t measure this are surprised that their engineers spend well over 33% of their time on things other than designing systems and writing code.</p>
<p>2) Productivity - How much do you produce per engineering day? This one is tougher and there are lots of metrics out there from which you can select, KLOC, stories, function points, etc.  All of them have issues, but that’s no excuse not to select the best for you and measure how well you are doing.</p>
<p><strong>Product Efficacy</strong><br />
Simply put, this is a measure of how your product choices are performing.  You undoubtedly have more ideas than you can implement in any given year.  Are you choosing the right things?  Are you hitting your key metrics such as increasing revenue, decreasing drop outs, or increasing signups?</p>
<p><strong>Time to Market</strong><br />
Assuming that you are building the right things, are you getting them out to the market in time to create barriers to entry and/or switching costs?  Are you faster or slower than your competitors?</p>
<p><strong>Quality</strong><br />
How defect dense is your product?  Are you fixing the problems in engineering and product management that lead to bugs in production?  Are you making the right time, cost and quality tradeoffs?  How many defects do you introduce per new release, line of code or story?</p>
<p>You may have several other key metrics that you use and which you find valuable and we’d love to hear about them.  What you cannot do, at least without significantly damaging shareholder value, is ignore the need for improvement.  You simply cannot improve your team’s performance without a core set of metrics against which you measure absolute and relative performance.  And if you are not measuring your performance you simply cannot increase and ideally maximize shareholder value.
</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/07/08/incenting-success-in-technology-organizations/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Joint Application Design &#038; Architecture Review Board</title>
		<link>http://akf-consulting.com/techblog/2008/07/03/joint-application-design-architecture-review-board/</link>
		<comments>http://akf-consulting.com/techblog/2008/07/03/joint-application-design-architecture-review-board/#comments</comments>
		<pubDate>Thu, 03 Jul 2008 19:21:08 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>CTO/CIO</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/07/03/joint-application-design-architecture-review-board/</guid>
		<description><![CDATA[We have mentioned a couple key processes in other posts that we want to explain in a little more detail.  These two fundamental processes to producing scalable and highly available architectures are the Joint Application Design (JAD) and Architecture Review Board (ARB).  These two processes help create strong bonds of communication between organizations thereby enabling [...]]]></description>
			<content:encoded><![CDATA[<p>We have mentioned a couple key processes in other posts that we want to explain in a little more detail.  These two fundamental processes to producing scalable and highly available architectures are the Joint Application Design (JAD) and Architecture Review Board (ARB).  These two processes help create strong bonds of communication between organizations thereby enabling shared ownership of products by all of the organizational disciplines within the extended technology team.  These processes can fit into any PDLC be it waterfall, iterative (including Agile), or any variant of those.  If you don’t have similar processes in place, we highly recommend you consider adding them. </p>
<p>The JAD is usually accomplished through a series of small meetings where the architecture and design of any feature of significant size is discussed.  The participants of the JAD are the engineers assigned to a feature along with the operations/infrastructure engineers who have been assigned to assist with the feature in question.  Ideally, the meetings are held early in the development process to ensure that the design of the feature receives input from both software and operations engineers and that it does not violate the architecture principles of scalability and availability.  In an Agile development process these people can be normal members of the project team augmented by DBAs or systems administrators.  The JAD members will present to the ARB if the feature meets the criteria for board review.</p>
<p>The ARB is intended to catch potential scale and availability problems before they are launched to the site.  The ARB team should consist of the highest quality software and hardware engineers and members of the leadership team.  The membership of the ARB ideally be static (i.e. change very little over time).  The ARB should convene once every development cycle (monthly is usually sufficient) to review all features that are either greater than a specified number of development days (e.g. 5) or introduce a significant new technology (caching, language, service, etc).  The ARB members should a set of clearly defined architectural principals against which to test the new product by asking questions such as “How does this allow us to scale horizontally, maintain higher availability, etc”.  The development engineers and operation engineers who are responsible for the design of the feature present to the board and the board decides whether the feature was designed in such a manner that it will meet the scalability and availability requirements. </p>
<p>Hopefully these descriptions of the processes will give you general understand of what is required and help you see why they are critically important to the development of scalable architectures.  There are obviously a lot of details about each of the processes that we have not covered in a post but this should get you started. 
</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/07/03/joint-application-design-architecture-review-board/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Scalability Architect</title>
		<link>http://akf-consulting.com/techblog/2008/06/26/scalability-architect/</link>
		<comments>http://akf-consulting.com/techblog/2008/06/26/scalability-architect/#comments</comments>
		<pubDate>Thu, 26 Jun 2008 20:00:53 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>Engineering</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/06/26/scalability-architect/</guid>
		<description><![CDATA[You have probably never heard of a Scalability Architect.   In our vernacular it is someone who specializes in designing system architectures for high availability and scalability.   We think you might want to consider adding one to your roster if you are serious about scaling and keeping your site up. 
At AKF we are all about scalability [...]]]></description>
			<content:encoded><![CDATA[<p>You have probably never heard of a Scalability Architect.   In our vernacular it is someone who specializes in designing system architectures for high availability and scalability.   We think you might want to consider adding one to your roster if you are serious about scaling and keeping your site up. </p>
<p>At AKF we are all about scalability and availability – for both platforms and businesses.  For a SaaS company, it is your lifeblood and must be a core competency to survive and grow.  Downtime not only means lost revenue (Amazon’s 2 hour downtime last week was estimated at costing them $30K per minute), but it also means losing customers to your competition.  While most companies like Amazon calculate the cost of downtime, the real cost can add up for months or years when adding in the loss of customers who never return.  We are advocating that companies seriously consider augmenting their architect team with a person or team of people who spend the majority of their time thinking and doing projects related to long term scalability and availability. </p>
<p>Why is this role different from a traditional architect’s role?  We feel that there is sufficient specificity in technical knowledge, perspective, and focus that a general systems architect will often overlook scalability for more urgent short term matters.  We see this often in our engagements where companies have great architects but they are focused on designing the next feature or introducing a new technology.  They do not have time, ability or experience to focus on longer term scalability issues within the platform.  Often we promote seasoned engineers who have proven their ability to design properly and evaluate technology effectively to the role of architect.  This is perfectly acceptable and is considered the standard career progression.  However, to be a Scalability Architect, the individual needs to have made a study of scalability and availability issues for a number of years.</p>
<p>A key skill or experience to look for in the Scalability Architect is a thorough knowledge of how to split both the application as well as the database in multiple dimensions, see our <a href="http://akf-consulting.com/techblog/2008/05/08/splitting-applications-or-services-for-scale/">application</a> and <a href="http://akf-consulting.com/techblog/2008/05/22/splitting-databases-for-scale/">database</a> splitting posts.  Additionally, the ideal candidate will have been through several of these splits before where they have learned some of the pitfalls.  Knowing that you need to consider what objects need to be cached with each other can save a lot of redesign and headache.</p>
<p>Are you considering adding a Scalability Architect to your team?  If so let us hear about what made you decide and what skills you think are important.
</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/06/26/scalability-architect/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Business Acumen and the CIO/CTO</title>
		<link>http://akf-consulting.com/techblog/2008/06/20/business-acumen-and-the-ciocto/</link>
		<comments>http://akf-consulting.com/techblog/2008/06/20/business-acumen-and-the-ciocto/#comments</comments>
		<pubDate>Fri, 20 Jun 2008 14:45:51 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>CTO/CIO</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/06/20/business-acumen-and-the-ciocto/</guid>
		<description><![CDATA[In an earlier article we discussed how technical the CEO needed to be in a technology company.  No discussion on this topic would be complete without addressing how business savvy the CIO or CTO needs to be in nearly any company.
In keeping with our “bottom line up front” tradition, the executive in charge of technology [...]]]></description>
			<content:encoded><![CDATA[<p>In an earlier article we discussed <a href="http://akf-consulting.com/techblog/2008/02/25/how-technical-does-the-ceo-need-to-be/">how technical the CEO needed to be in a technology company</a>.  No discussion on this topic would be complete without addressing how business savvy the CIO or CTO needs to be in nearly any company.</p>
<p>In keeping with our “bottom line up front” tradition, the executive in charge of technology decisions needs to be a leader first, a business executive second and a technology decision maker last.  That is not to say that this executive should not also have some understanding of technology, rather it is our position that their primary role is to help make the right business decisions as they relate to technology.</p>
<p>Unfortunately, most technologists do not learn about business, finance or marketing within their undergraduate or graduate courses and most non-technologists do not have an opportunity to learn about the inner workings of technology within their fields of studies.  As a result, the teams have very little in common when it comes to training and they often find it hard to communicate and find common ground.  This is very different from the relationships that exist between other disciplines within a company like marketing and finance wherein most of the employees within those organizations have had some exposure to the fundamentals of the other organizations.  We refer to this gap between the technology organization and other organizations as the “experiential chasm” and it is the role of the chief technology executive within a company to partner with the CEO to build a bridge across this chasm. </p>
<p>Just as we have argued that the CEO needs to make an attempt to better understand technology,  technology process and the “physics” of product development (like technical project management, Brooks’ law, etc) so must the CTO/CIO better understand the fundamentals of the business in which they operate.  Just as importantly, the CTO/CIO should also understand the fundamentals of each organization’s responsibilities.</p>
<p>For example, while the chief technology expert does not need to be the expert on capital markets, he or she should be able to debate the relative merits and issues associated with the assumption of debt vs. the issuing of equity.  He or she should also be able to completely understand each of the statements used in running a company (e.g. Income Statement, SOCF, and balance sheet).  From a marketing perspective it is important that the person understand such basics as the 5Cs and the 6Ms to name just a few.  From a strategy perspective, it is useful to understand such basics as Porter’s forces.  These topics just scratch the surface and in no way are meant to be an all encompassing list.</p>
<p>Not having a background in such topics means that you cannot effectively function as part of the senior executive team or executive committee.  And not contributing as part of the executive team means that you are not performing your responsibilities in helping to maximize shareholder wealth.  And, of course, if you cannot help maximize shareholder wealth you simply should not be in your job.</p>
<p>We are not arguing that you need to go get an MBA to be effective or to provide value in the boardroom, though getting an MBA or going to an executive MBA program is certainly a great way to jumpstart the process.  We are arguing that it is absolutely your job to get better every day in the things that you do not know and are essential to an appropriate level of performance.  Here are some ideas:</p>
<p><strong>Develop a professional reading list</strong><br />
Seek out ideas of great books on each of the functional areas within your company and read and learn.  We will post our recommended reading list soon.</p>
<p><strong>Take community college business classes</strong><br />
You do not have to take masters level classes to learn basic business concepts.  Your local community college probably has some first and second year undergraduate classes that will fit your needs and your schedule.</p>
<p><strong>Take online classes in each of the disciplines</strong><br />
This is the information age after all, and we can all leverage the internet to learn.  We recommend taking structured course work as it is one of the easiest ways to learn.</p>
<p><strong>Discuss business concepts and seek help from peers</strong><br />
Be honest with yourself and with your peers.  You might think it shows a weakness, but it actually builds trust and strengthens relationships.  Your peers will walk away thinking “Here is a person who really wants to know how this works”.  Think about it – wouldn’t you have great respect for a peer who wanted to know more about technology?</p>
<p><strong>Start and Executive MBA Program<br />
</strong>This is probably the best and easiest way to get a good foundation in all of the areas, but it is also the most costly.  There is a chance that your company will pay for it and there are several great schools with very flexible programs including weekend and evening coursework or accelerated programs that limit your time away from work.
</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/06/20/business-acumen-and-the-ciocto/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Risk Assessment</title>
		<link>http://akf-consulting.com/techblog/2008/06/12/risk-assessment/</link>
		<comments>http://akf-consulting.com/techblog/2008/06/12/risk-assessment/#comments</comments>
		<pubDate>Thu, 12 Jun 2008 14:06:29 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>Engineering</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/06/12/risk-assessment/</guid>
		<description><![CDATA[One of the key components to high availability is the proper management of risk.  Obviously stopping all changes to a site will improve the availability in the short run but in the long run, no new features get deployed so customers stop coming and no required maintenance gets done so you end up with more [...]]]></description>
			<content:encoded><![CDATA[<p>One of the key components to high availability is the proper management of risk.  Obviously stopping all changes to a site will improve the availability in the short run but in the long run, no new features get deployed so customers stop coming and no required maintenance gets done so you end up with more down time to fix what could have been handled with preventative maintenance.  So that is not a good long term strategy for improving availability.  Luckily, we have found and proven to ourselves that It is possible to dramatically improve a site’s availability by simply understanding the riskiness of changes.  By taking the time to classify changes on a scale of risk and using some basic rules, the management of risk becomes a very useful tool in the quest for uptime.</p>
<p>The tool that we recommend using comes originally from the military and space program but we were taught it as part of Six Sigma and is called a Failure Mode and Effects Analysis or FMEA (pronounced fee’-ma).  The first step is to identify the ways in which the code or application can fail, or in our terminology the “failure modes”. We typically asked each engineer or product manager to come up with three to five failure modes for each feature.  Once these are gathered the team should rate each failure mode using these questions: how severe is the impact of the failure (severity), how detectable is it if it fails (detectability, yes we made up that word), what is the probability or “likelihood” that the failure will occur We recommend using a scale of 1, 3, and 9, because it provides an exponential weighting to help identify the riskier items.  These scores are then multiplied together to get a total risk score and the failure modes are ranked highest risk to lowest.  Here is an example risk matrix.</p>
<p><a href="http://www.akf-consulting.com/images/fmea.jpg"><img title="FMEA" alt="FMEA" src="http://www.akf-consulting.com/images/fmea.jpg" /></a></p>
<p>So, once you have developed this matrix, what can you do with it?  For starters, the highest risk items should always have mitigation plans associated with them.  These mitigation plans are actions that help lower one or all of the three risks (severity, detectability, or likelihood).  The second thing that you can do with this matrix is determine a maximum level of risk that you will allow to be placed upon the site in any given time period (1 day,  1 week, or 1 release are all good intervals or timeframes).  As an example, you might determine either as a guess at first or later through analysis of past performance associated with the risk score of each release, that 275 is the maximum risk amount that you feel comfortable with for any release or change to the site within 1 week.  Therefore you can only have features in this week’s release that total less than 275 on the risk scale.  Lastly, this should be used in conjunction with other risk mitigation strategies, such as not mixing infrastructure changes with code releases.</p>
<p><form method='post' action=''><p>Your email:&#160;<input type='text' name='email' value='' size='20' />&#160;<br /><input type='radio' name='s2_action' value='subscribe' checked='checked' /> subscribe <input type='radio' name='s2_action' value='unsubscribe' /> unsubscribe &#160;<input type='submit' value='Send' /></p></form>

</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/06/12/risk-assessment/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Top 20 Mistakes in Technology</title>
		<link>http://akf-consulting.com/techblog/2008/06/09/top-20-mistakes-in-technology/</link>
		<comments>http://akf-consulting.com/techblog/2008/06/09/top-20-mistakes-in-technology/#comments</comments>
		<pubDate>Mon, 09 Jun 2008 17:23:34 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>CTO/CIO</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/06/09/top-20-mistakes-in-technology/</guid>
		<description><![CDATA[We often get asked to encapsulate our experience into a top 10 list for CTOs and CEOs. As is the case in golf, in technology it is as much about ensuring that your bad hits (aka blunders, mistakes, and failures) are recoverable as it is ensuring that you nail your great hits or successes. We [...]]]></description>
			<content:encoded><![CDATA[<p>We often get asked to encapsulate our experience into a top 10 list for CTOs and CEOs. As is the case in golf, in technology it is as much about ensuring that your bad hits (aka blunders, mistakes, and failures) are recoverable as it is ensuring that you nail your great hits or successes. We are all going to have failures in our careers but avoiding the really big pitfalls will help ensure that we keep our companies and our products on the right growth path.</p>
<p>So, without further ado, and in keeping with our high standards of “raising the bar”, here are the top 20 things (rather than 10 and in no particular order) we believe are most important to avoid when developing platforms:</p>
<p><strong>1) Failing to design for rollback</strong></p>
<p>We said these were in no particular order, but right out of the gate we are going to provide an exception to the rule. If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad code roll in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job <strong>DO THIS BEFORE ANYTHING ELSE</strong>; if you have been in your job for awhile and have not done this <strong>DO THIS TOMORROW</strong>.</p>
<p><strong>2) Confusing product release with product success</strong></p>
<p>Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. The point here is that you are paid to increase shareholder wealth, so have success parties when you achieve objectives specifically tied to that wealth creation. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy.</p>
<p><strong>3) Insular product development/engineering</strong></p>
<p>How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Use best practices like teaming or a process that we later will discuss called Joint Applications Development. We are not arguing that designs should be done by committee, but rather than collaborative designs with a clear owner and decision maker are better than designing without input or checks and balances.</p>
<p><strong>4) Over engineering the solution</strong></p>
<p>Your job is to maximize shareholder value as cost effectively as possible. To that end, one of your mottos should be “simple solutions to complex problems”. The simpler the solution, the lower the cost and the more likely it is that it will be easily and cost effectively maintained. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.</p>
<p><strong>5) Allowing history to repeat itself</strong></p>
<p>Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize shareholder value and grounds for dismissal. In the operations world, a failure to correlate past site incidents and find thematically related root causes should be a cause for termination. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.</p>
<p><strong>6) Scaling through 3d parties</strong></p>
<p>Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. See our articles on <a href="http://akf-consulting.com/techblog/2008/05/22/splitting-databases-for-scale/">database scalability</a> and <a href="http://akf-consulting.com/techblog/2008/05/08/splitting-applications-or-services-for-scale/">platform scalability</a>. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases provide for the technology of log shipping to keep read or standby databases in synch with the primary. Per our discussion in <a href="http://akf-consulting.com/techblog/2008/04/01/a-case-for-technology-agnostic-design-tad/">technology agnostic design</a>, define how your platform scales through your efforts, not through the systems that a 3d party vendor or opensource software company provides. If you say we use ACME database clusters to scale our database we would argue you have the wrong solution. If, on the other hand you say we split our databases into read and write systems and further split them by customer id you are attacking the problem appropriately.</p>
<p><strong>7) Relying on QA to find your mistakes</strong></p>
<p>You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering. Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering. </p>
<p><strong>8) Revolutionary or “big bang” fixes</strong></p>
<p>In our experiences, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the expected ROI to complete and disastrous failures. 9 out of 10 times they are simply not warranted and should be avoided. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.</p>
<p><strong>9) </strong><strong>The Multiplicative Effect of Failure</strong></p>
<p>Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc then the product of all of the service calls is your theoretical availability. 5 calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create <a href="http://akf-consulting.com/techblog/2008/05/30/fault-isolative-architectures-or-%e2%80%9cswimlaning%e2%80%9d/">fault-isolative architectures</a> to help you identify problems quickly. </p>
<p><strong>10) </strong><strong>Failing to create and incent a culture of excellence</strong></p>
<p><a href="http://akf-consulting.com/techblog/2007/12/07/seed-feed-and-weed-to-succeed/">Bring in the right people</a> and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth. Read our article on <a href="http://akf-consulting.com/techblog/2008/03/03/be-a-leader/">being a leader</a>.</p>
<p><strong>11) </strong><strong>Under-engineering for scale</strong></p>
<p>The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now. That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.</p>
<p><strong>12) </strong><strong>“Not Built Here” Culture</strong></p>
<p>We see this all the time. You may even have agreed with point (6) above because you have a “we are the smartest people in the world and we must build it ourselves” culture. The point on relying upon third parties to scale was not meant as an excuse to build everything yourselves. The real point to be made is that you have to focus on your core competencies and not dilute your engineering efforts with things that other companies or open source providers can do better than you. Unless you are building databases as a business, you are probably not the best database builder. And if you are not the best database builder, you have no business building your own databases for your SaaS platform. Focus on what you should be the best at: building functionality that maximizes your shareholder wealth and scaling your platform. Let other companies focus on the other things you need like routers, operating systems, application servers, databases, firewalls, load balancers and the like.</p>
<p><strong>13) </strong><strong>A new PDLC will fix my problems</strong></p>
<p>Too often CTO’s see repeated problems in their product development life cycles such as missing dates or dissatisfied customers and look for something to blame. The PDLC is often the biggest target of this blame. Too often people believe that changing the process without addressing root causes will fix the problem. . Going from Waterfall to Scrum or from Scrum to RUP, is not the complete answer. All organizations are different in terms of level of skills, maturity level (as in the Capability Maturity Model), structure, and culture, so each organization needs to perform their own evaluations but here are some problems that we see over and over again in organizations blaming their PDLC.</p>
<p>A lack of involvement and ownership from the business tops the list of problems. In the Scrum model there needs to be consistent involvement from the business or product owner. If this is not the case, it is impossible to follow the Scrum principles. Another very common problem is an incomplete understanding or training on the existing PDLC. Everyone in the organization should have a working knowledge of the entire process and how their roles fit within it. Change the PDLC if there are valid reasons such as increasing engineering productivity or a better cultural fit but do not change it before addressing the core issues. Most often, the biggest problem with your PDLC is the lack of project management to meet dates and the lack of an appropriate “product discovery” phase to meet customer needs and demands. Changing your PDLC won’t address either of these issues; properly managing your teams to meet dates and appropriately understanding customer needs will help fix these problems.</p>
<p><strong>14) </strong><strong>We cannot hire great people quickly</strong></p>
<p>Often when growing an engineering team quickly the engineering managers will push back on hiring plans and state that they cannot possibly find, interview, and hire engineers that meet their high standards. We agree that hiring great people takes time and hiring decisions are some of the most important decisions managers can make. A poor hiring decision takes a lot of energy and time to fix. However, there are lots of ways to streamline the hiring process in order to recruit, interview, and make offers very quickly. A useful idea that we have seen work well in the past are interview days, where potential candidates are all invited on the same day. This should be no more than 2 - 3 weeks out from the initial phone screen, so having an interview day per months is a great way to get most of your interviewing in a single day. Because you optimize the interview process people are much more efficient and it is much less disruptive to the daily work that needs to get done the rest of the month. Post interview discussions and hiring decisions should all be made that same day so that candidates get offers or letters of regret quickly; this will increase the likelihood of offers being accepted or make a professional impression on those not getting offers. The key is to start with the right answer that “there is a way to hire great people quickly” and the myriad of ways to make it happen will be generated by a motivated leadership team.</p>
<p><strong>15) It is a SPOF (Single Point of Failure) but we can recover it onto another host quickly</strong></p>
<p>A SPOF is a SPOF and even if the impact to the customer is low it still takes time away from other work to fix right away in the event of a failure. And there will be a failure…because that is what hardware and software does, it works for a long time and then eventually it fails! As you should know by now, it will fail at the most inconvenient time. It will fail when you have just repurposed the host that you were saving for it or it will fail while you are releasing code. Plan for the worst case and have it run on two hosts (we actually recommend to always deploy in pools of three or more hosts) so that when it does fail you can fix it when it is most convenient for you.</p>
<p><strong>16) No Business Continuity plan</strong></p>
<p>No one expects a disaster but they happen and if you cannot keep up normal operations of the business you will lose revenue and customers that you might never get back. Disasters can be huge like Hurricane Katrina, where it take weeks or months to relocate and start the business back up in a new location. Disasters can also be small like a winter snow storm that keeps everyone at home for two days or a HAZMAT spill near your office that keeps employees from coming to work. A solid business continuity plan is something that is thought through ahead of time, before you need it, and explains to everyone how they will operate in the event of an emergency. Perhaps your satellite office will pick up customer questions or your tech team will open up an IRC channel to centralize communication for everyone capable of working remotely. Do you have enough remote connections through your VPN server to allow for remote work? Spend the time now to think through what and how you will operate in the event of a major or minor disruption of your business operations and document the steps necessary for recovery.</p>
<p><strong>17) No Disaster Recovery Plan</strong></p>
<p>Even worse, in our opinion, than not having a BC plan is not having a disaster recovery plan. If your company is a SaaS based company, the site and services provided is the company’s sole source of revenue. Moreover, with a SaaS company, you hold all the data for your customers that allow them to operate. When you are down they are more than likely seriously impaired in attempting to conduct their own business. When your collocation facility has a power outage that takes you completely down, think 365 Main datacenter in San Francisco, how many customers of yours will leave and never return? Our preference is to provide your own disaster recovery through multiple collocation facilities but if that is not yet technically feasible nor in the budget, at a minimum you need your code, executables, configurations, loads, and data offsite and an agreement in place for both collocation services as well as hosts. Lots of vendors offer such packages and they should be thought of as necessary business insurance.</p>
<p><strong>18) No Product Management team or person</strong></p>
<p>In a similar vein to #13 above, there needs to be someone or a team of people in the organization who have responsibility for the product lines. They need to have authority to make decisions about what features get added, which get delayed, and which get deprecated (yes, we know, nothing ever gets deprecated but we can always hope!). Ideally these people have ownership of business goals (see #10) so they feel the pressure to make great business decisions.</p>
<p><strong>19) It is okay to bring the site down to roll code</strong></p>
<p>Just because you call it scheduled maintenance does not mean that it does not count against your uptime. While some of your customers might be willing to endure the frustration of having the site down when they want to access it in order to get some new features, most care much more about the site being available when they want it. They are on the site because the existing features serve some purpose for them; they are not there in the hopes that you will rollout a certain feature that they have been waiting on. They might want new features, but they rely on existing features. There are ways to roll code, even with database changes, without bringing the site down. It is important to put these techniques and processes in place so that you plan for 100% availability instead of planning for much less because of planned down time.</p>
<p><strong>20) Firewalls, Firewalls, Everywhere!</strong></p>
<p>We often see technology teams that have put all public facing services behind firewalls while many go so far as to put firewalls between every tier of the application. Security is important because there are always people trying to do malicious things to your site, whether through directed attacks or random scripts port scanning your site. However, security needs to be balanced with the increased cost as well as the degradation in performance. It has been our experience that too often tech teams throw up firewalls instead of doing the real analysis to determine how they can mitigate risk in other ways such as through the use of ACLs and LAN segmentation. You as the CTO ultimately have to make the decision about what are the best risks and benefits for your site.</p>
<p> </p>
<p>And for those that made it all the way through this long, long post here is one of the designs that we are considering for our new logo.  Let us know what you think.  </p>
<p> <img style="width: 152px; height: 142px" height="142" src="http://www.akf-consulting.com/images/shorts.jpg" width="152" /></p>
<p> 
</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/06/09/top-20-mistakes-in-technology/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>The Bug is in the Code!</title>
		<link>http://akf-consulting.com/techblog/2008/06/04/the-bug-is-in-the-code/</link>
		<comments>http://akf-consulting.com/techblog/2008/06/04/the-bug-is-in-the-code/#comments</comments>
		<pubDate>Wed, 04 Jun 2008 14:08:25 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>CTO/CIO</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/06/04/the-bug-is-in-the-code/</guid>
		<description><![CDATA[We are engineers by training and vocation, so we understand what it is like to be a software developer. Too often during the course of any site or product problem we hear developers saying “It can’t be the code”. In our experience it is most often the case that the code is the problem. That [...]]]></description>
			<content:encoded><![CDATA[<p>We are engineers by training and vocation, so we understand what it is like to be a software developer. Too often during the course of any site or product problem we hear developers saying “It can’t be the code”. In our experience it is most often the case that the code is the problem. That is not to say that we have not seen our share of operating system, database, webserver and application server bugs, but statistically you are going to be right way more often by suspecting the code first. Here is why that is so.</p>
<p><img src="http://www.akf-consulting.com/images/hair.jpg" /></p>
<p>As we mentioned, operating systems, databases and any other piece of third party or open source software including firmware have bugs. But these pieces of software are changed far less frequently than your SaaS application code and the amount of testing performed before a release is more often than not an order of magnitude or more than what you are performing. And that is okay, as you are working in two completely different worlds where the cost of a defect and the opportunity cost of a delay resulting from testing are much different. A bug in your code that slows your application from 2sec response to 5sec is terrible but you should be able to quickly recover from it assuming that you have designed for rollback and have processes to quickly “fix forward” any release. A bug in a database that causes a loss of data integrity is disastrous because hundreds of thousands of organizations rely on that database to keep their data safe. So, given the likely differences in code quality, defect density and change frequency, you would be better off always suspecting your code first but there is another reason as well.</p>
<p>A simple but golden rule is whatever changed last caused the problem. This is one reason we harp so much on a rigorous change management process. Since you likely update the code between ten and twenty times more often than you update a piece of infrastructure it is reasonable to suspect your frequently changing code is the culprit. Even with this overwhelming evidence, the argument that engineers will typically use is that the one place in the code that is responsible for the broken feature has been checked and is fine. The number of times we have seen a fourth, fifth or sixth attempt to find a defect in the code yield a bug would astound you, further proving our point that “the defect is in the code”. Not reading with a critical eye, knowing that the bug is there waiting to be found by you, will guarantee that you will not find the defect. Secondly, most code bases have a pretty high cyclomatic complexity. This is a fancy term for how many unique code paths exist in the code, usually broken down by class and method. If something has 50 – 100 logical paths most of us cannot keep them straight in our head and thus should be using unit tests to verify them, but that is for a different post.</p>
<p>The bottom line is have every engineering discipline look in earnest for the possible cause. The bug is in your code more often than not. As our childhood friend Dr. Seuss would say, it is 98 and 3/4% guaranteed.</p>
<p><font size="1">*Image courtesy of krelic from flickr creative commons</font>
</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/06/04/the-bug-is-in-the-code/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Fault Isolative Architectures or “Swimlaning”</title>
		<link>http://akf-consulting.com/techblog/2008/05/30/fault-isolative-architectures-or-%e2%80%9cswimlaning%e2%80%9d/</link>
		<comments>http://akf-consulting.com/techblog/2008/05/30/fault-isolative-architectures-or-%e2%80%9cswimlaning%e2%80%9d/#comments</comments>
		<pubDate>Fri, 30 May 2008 23:03:34 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>Engineering</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/05/30/fault-isolative-architectures-or-%e2%80%9cswimlaning%e2%80%9d/</guid>
		<description><![CDATA[Two of our previous articles, Splitting Databases for Scale and Splitting Applications or Services for Scale have made references to a concept that we call “Swimlaning Architectures”.
The basics of this concept are covered in our two previous posts, but we have not spent a lot of time discussing the reasons for such a split or [...]]]></description>
			<content:encoded><![CDATA[<p>Two of our previous articles, <a href="http://akf-consulting.com/techblog/2008/05/22/splitting-databases-for-scale/">Splitting Databases for Scale</a> and Splitting <a href="http://akf-consulting.com/techblog/2008/05/08/splitting-applications-or-services-for-scale/">Applications or Services for Scale</a> have made references to a concept that we call “Swimlaning Architectures”.</p>
<p>The basics of this concept are covered in our two previous posts, but we have not spent a lot of time discussing the reasons for such a split or approach in technology architecture.</p>
<p><img title="swimlane" height="154" alt="swimlane" src="http://www.akf-consulting.com/images/swimlane.jpg" width="165" /></p>
<p>In our definition, a “Swimlane” is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of said boundary. The benefit of such a failure domain is two-fold:</p>
<p>1) Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain.</p>
<p>2) Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. As such, and depending upon approach only a portion of users or a portion of functionality of the product is affected.</p>
<p>A “swimlaned” architecture is one in which each failure domain is completely isolated. In order to achieve this, ideally there are no calls between swimlanes or failure domains. Synchronous calls are absolutely forbidden in this type of architecture as any synchronous call between failure domains, even with appropriate timeout and detection mechanisms is very likely to cause a series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a call to any other service in another domain, to any service outside of the domain, or if the domain receives calls from other domains or services.</p>
<p>It is acceptable, but not advisable, to have asynchronous calls between domains. If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services. Here is an interesting <a href="http://www.mysqlperformanceblog.com/2008/05/20/apache-php-mysql-and-runaway-scripts/">blog post</a> about runaway scripts and their impact on Apache, PHP, and MySQL.</p>
<p>As we have previously indicated, a swimlane should have all of its services located within the failure domain. For instance, if database accesses are necessary the database with all appropriate information for that swimlane should exist within the same failure domain as all of the application and webservers necessary to perform the function or functions of the swimlane. Furthermore, that database should not be used for other requests of service from other swimlanes. Our rule is one production database on one host.</p>
<p>As we have indicated with our Scale Cube in the past, there are many ways in which to think about swimlaned architectures. You can think about them in terms of a separation of services e.g. “login” and “shopping cart” (two separate swimlanes) each having the web and app servers as well as all data stores located within the swimlane and answering only to systems within that swimlane. Corresponding to the Scale Cube we have previously introduced this would be a “Y” axis swimlane.</p>
<p>Another approach would be to perform a separation of your customer base or a separation of your order numbers or product catalog. Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z axis swimlane along customer, order number or product id lines.</p>
<p>Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform.
</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/05/30/fault-isolative-architectures-or-%e2%80%9cswimlaning%e2%80%9d/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Splitting Databases for Scale</title>
		<link>http://akf-consulting.com/techblog/2008/05/22/splitting-databases-for-scale/</link>
		<comments>http://akf-consulting.com/techblog/2008/05/22/splitting-databases-for-scale/#comments</comments>
		<pubDate>Thu, 22 May 2008 19:45:44 +0000</pubDate>
		<dc:creator>Administrator</dc:creator>
		
	<category>Engineering</category>
		<guid isPermaLink="false">http://akf-consulting.com/techblog/2008/05/22/splitting-databases-for-scale/</guid>
		<description><![CDATA[The most common point of congestion and therefore barrier to scale that we see in our practice is the database. Referring back to our earlier article “Splitting Applications or Services for Scale”, it is very common for engineers to create scalability along the X axis of our cube by persisting data in a single monolithic [...]]]></description>
			<content:encoded><![CDATA[<p>The most common point of congestion and therefore barrier to scale that we see in our practice is the database. Referring back to our earlier article “Splitting Applications or Services for Scale”, it is very common for engineers to create scalability along the X axis of our cube by persisting data in a single monolithic database and having multiple “cloned” applications servers retrieve and store data within that database. For young companies this is a very good approach as if done properly it will also eliminate the need for persistence or affinity to a given application server and as a result will increase customer perceived availability.</p>
<p>The problem, however, with this single monolithic data structure is three-fold:</p>
<p>1) Even with clustering technology (the existence of a second physical system or database that can take the load of the first in the event of failure), failures of the primary database will result in short service outages for 100% of the user community.</p>
<p>2) This approach ultimately relies solely on technical improvements in cpu speed, memory access speed, memory access size, mass storage access speeds and size, etc to insure the companies needs for scale.</p>
<p>3) Relying upon (2) above in the extreme cases is not the most cost effective solutions as the newest and fastest technologies come at a premium to older generations of technology and do not necessarily have the same processing power per dollar as older and/or smaller (fewer cpus etc) systems.</p>
<p>As we have argued in the aforementioned post, a great engineering team will think about how to scale their platform well in advance of the need to rely solely upon partner technology advances. By making small modifications to our previously presented “Scale Cube”, the same concepts applied to the problem of splitting services for scale can be useful in addressing how to split a database for scale.</p>
<p>As with the AKF Services Scale Cube, the AKF Database Scale Cube consists of an X, Y and Z axes – each addressing a different approach to scale transactions applied to a database. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic database – a case where all data is located in a single location and all accesses go to this single database.</p>
<p><a href="http://www.akf-consulting.com/images/db_cube.png"><img title="database _cube" style="width: 357px; height: 246px" height="246" alt="database _cube" src="http://www.akf-consulting.com/images/db_cube.png" width="357" /></a></p>
<p>The X Axis of the cube represents a means of spreading load across multiple instances of a replicated representation of the data. This is the first approach most companies use in scaling databases and is often both the easiest to implement and the least costly in both engineering time and hardware. Many third party and open source databases have native properties or functions that will allow the near real time replication of data to multiple “read databases”. The engineering cost of such an approach is low as typically database calls only need to be identified as a “read” or “write” and sent to the appropriate write database or bank of read databases. The “bank” of read databases should have reads evenly split across this if possible and many companies employ simple 3d party load balancers to perform this distribution.</p>
<p>Included in our X-axis split are third party and open source caching solutions that allow reads to be split across “cache” hosts before actually reading from a database upon a cache miss. Caching is another simple way to reduce the load on the database but in our experience is not sufficient for hyper growth SaaS sites.</p>
<p>If implemented properly, this X-axis split also can increase availability as if replication is near real time, a read server can be promoted as the singular “write server” in the event of a “write server” failure. The combination of caching and read/write splits (our X axis) is sufficient for many companies but for companies with extreme hyper growth and massive data retention needs it is often not enough.</p>
<p>The Y Axis of our database cube represents a split by function, service or resource just as it did with the service cube. A service might represent a set of use-cases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in speeding up database calls by allowing more information specific to the request to be held in memory rather than needing to make a disk access. Just as with our approach in scaling services, our recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. It is possible to take a monolithic application and perform physical splits by say URL/URI to different service or resource oriented pools. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service / pool / resource / application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as “swimlaning” an application and data set, especially when both the database and applications are split to represent a “failure domain” or fault isolative infrastructure.</p>
<p>The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). The most common way to view this is to consider splitting your resources by customer if your entity relationships allow that to happen. In the world of media, you might consider splitting it by article_id or media_id and in the world of commerce a split by product_id might be appropriate. In the case where you split customers from your products and perform splits within customers and products you would be implementing both a Y axis split (splitting by resource or call – customers and products) and a Z axis split (a modulus of customers and products within their functional splits).</p>
<p>Z axis splits tend to be the most costly for an engineering team to perform as often many functions that might be performed within the database (joins for instance) now need to be performed within the application. That said, if done appropriately they represent the greatest potential for scale for most companies.
</p>
]]></content:encoded>
			<wfw:commentRSS>http://akf-consulting.com/techblog/2008/05/22/splitting-databases-for-scale/feed/</wfw:commentRSS>
		</item>
	</channel>
</rss>
