- What Is Software Architecture?
- How to Design a System
- Five Questions
- Seven Principles: The Overarching Concepts
- Designing for an Online Bookstore
- Designing for the Cloud
- Summary
Seven Principles: The Overarching Concepts
Several overarching concepts (tactics) will help us achieve good software architecture. However, they may not always help, so we must evaluate them against the end goal of the system and use what is helpful.
Principle 1: Drive Everything from the User’s Journey
The user journey defines what they can and will do with the system. It is not what is written down in the requirement specification. It is everything that can happen. The user journey, however, is never fully defined. It evolves as the user evolves and includes almost unlimited possibilities. For example, if we consider a bookstore, the user journey is what people do when they come in, and that is never fully defined. Do users want to search by the number of pages in the book? Do they want a specific author, or are they looking for a specific topic? Perhaps the user journey in this instance is to look only at journals.
We must strive to understand the user journey in as much detail as possible, covering the most important scenarios. Doing so provides a basis for building great UX and stops us from building unnecessary features.
UX makes or breaks a system. To provide a vivid example, “The Secret Startup That Saved the Worst Website in America,” by Robinson Meyer explains how bad user experiences at Healthcare.gov almost broke the Affordable Care Act (ACA).2 Many users gave up when registering, even though the alternative was not being able to go to a hospital when needed—the UX stopped even desperate users! UX alone does not make our systems successful, but without a good UX, our users won’t have a chance.
The greatest source of errors in our architectures is unused or rarely used features, wasting time and money spent on them. The first step in reducing such features is to understand the user journey and evaluate everything in terms of the feature’s utility to the user and the cost of forgoing the feature. We should build things that add value, not things that are easy regardless of the value.
Most systems have multiple groups of users who are interested in different parts of the user journey. We can never support all the users in all aspects of the user journey. We must choose one or the other. We have to make those choices deliberately and continuously. We return to this topic in the second principle. Furthermore, when we make a decision about the architecture, we need to consider these additional questions:
How does this affect the user journey?
How much value does it add?
Is there something else we can do that adds more value?
Principle 2: Use an Iterative Thin Slice Strategy
Premature optimization is the root of all evil. —Donald Knuth
There are two ways to build systems. The first approach is to build all the parts and then integrate them. In my experience, most problems surface in the integration step, often adding months, if not years, to the project. The second approach creates a thin slice of the system that goes end to end and is useful at each step, using the most simple architectural choices. Then we identify bottlenecks and improve those, add new features, and replace anything only later, implementing complex architectural choices as needed.
When we are writing a basic application, this means getting the main path working as soon as possible, not worrying about the performance in the first round, then profiling the system and improving it to handle bottlenecks. With advanced compilers like JIT (Just In Time) that do many optimizations, it is tough to guess what parts need special handling. It is better to write simple code and optimize it only if and when needed.
Using this approach with a distributed app is a bit harder; however, the same idea works. Start with the most straightforward architecture and iteratively improve it. This also means integrating and merging new code as soon as possible. In other words, do small commits.
The Wright brothers are a great example of the power of this approach. Working with limited funds to build an airplane, they competed against well-funded professionals. Their competitors focused on creating the best design, building the plane, and then flying it. Opponents thought (perhaps, arrogantly) that they could think through all contingencies and build a plane that would fly on the first run. However, every time it failed to fly, it wrecked the prototype, setting them back months.
In contrast, the Wright brothers used an iterative thin slice strategy. They focused on first building a glider that worked, one that could land successfully, and then preserving the prototype. This strategy enabled them to do many more test flights. They perfected the glider and figured out how to control it. Then they added propellers and engines, gradually converting the glider into an airplane. This approach allowed them to learn, to tinker, and to experiment without months of setbacks at each failure.
An iterative thin slice strategy creates a powerful feedback cycle. This thin slice approach enabled the Wright brothers to improve gradually while in competition with much greater brain power and millions of dollars.
Unless you have a specific reason, always start with simple architectural choices. Measure the system, find the bottlenecks, and improve the system later; choose complex architectures only if needed. (Parts II and III describe some default choices and more complex selections for many situations.)
When undertaking the thin slice strategy, I have seen that simple architectures are enough to support systems over the years. A great example comes from threading models where the request per thread (with a pool) is inefficient, and nonblocking architectures can do much better. However, the resulting code from nonblocking models is harder to read, and it is not easy to find people experienced in writing this code. For many use cases, a simple request per thread model is sufficient throughout its life cycle. Let’s keep our systems as simple as possible, starting unambiguously and then gradually adding complexity.
Another advantage of the thin slice strategy is that it forces everyone to integrate code early, fixing any misunderstandings about design before they come to a head and become overwhelming. This strategy works because it rapidly creates a working system, unlocking feedback, and enables us to uncover integration problems early. This approach gives us the time and the opportunity to improve and fix any problem we might encounter.
Principle 3: On Each Iteration, Add the Most Value for the Least Effort to Support More Users
As discussed, when designing the software architecture, we want to use an iterative approach that starts with limited features and then gets user feedback to improve the system. On each iteration, we want to add the most value for the least amount of effort. This means avoiding features that have little value, delaying less value-adding features to later iterations. It is important to note that most systems have many different user groups, and certain features add unique value for different users.
The user journey provides a powerful lens for making feature-related decisions. In most products, many users do only a few critical things. Find those and optimize for them. Doing this is the secret behind Apple’s legendary UX. The podcast “Inside the Apple Factory: Software Design in the Age of Steve Jobs” describes Apple’s approach in detail.3 At Apple, about one-third of most teams are UX experts, so their UX quality is not an accident; they invest in it. Also, at Apple, any feature starts with the product lead (or product manager) and UX experts who then do mockups and iterations for stakeholders until the design is perfect. The code comes later.
Investing in such a process early on removes a lot of future changes and also provides a strong basis for accepting or rejecting future feature requests. Consequently, features won’t be what is easy to implement but what is required by the end user.
The first step for this principle is defining value. This step can mean supporting the most number of users, users who bring the most revenue, or users who can give the product the most exposure. We may even use different value criteria at different stages of the product. Examine the user journey to identify features that would add the most value by focusing on user groups that bring in the most value. In line with this principle, the following are concepts I try to follow:
Principle 3.1: It is impossible to thoroughly think through how users will use your product, so embrace a minimum viable product (MVP). The idea is to identify a few use cases, do only features that support those cases, get feedback, and shape the product based on the feedback and experience from the MVP.
Principle 3.2: Do as few features as possible. When in doubt (e.g., when the team disagrees), leave it out. Many features are never used, so you might develop an extension point instead.
Principle 3.3: Wait for someone to ask for the feature. If the feature is not a deal-breaker, wait until three people ask for it before focusing on implementation.
Principle 3.4: Have the courage to stand your ground if the features the customer requests adversely affect the product. Focus on the bigger picture and try to find another way to handle the problem.
Remember the quote often attributed to Henry Ford: “If I had asked people what they wanted, they would have said faster horses.” Also remember that you are the expert. You are supposed to lead. It is the leader’s job to do what is right, not what is popular. Users will thank you later (fourth principle).
Principle 3.5: Look out for Google envy. Do not overengineer. We all like shiny designs. It is easy to bring features and solutions into your architecture that you will never need. For features such as quality of service (QOS) improvements, scale, and performance limitations, wait until those requirements are imminent. Also, approach the product with the mindset that you will rewrite it. Implement what you want now.4
Principle 3.6: When possible, use middleware tools or cloud services. For example, consider authentication and authorization. If you decide to implement these, it will create a lot of feature requirements in the future. For instance, you will need a user registration flow, password recovery, and attack detection. Using an identity and access management (IAM) tool supports all those features, and IAM will continue to evolve its product as requirements change. The same idea applies to message brokers, workflow systems, payment systems, and so forth.
Principle 3.7: Interfaces and other abstractions are techniques for creating options and delaying decisions. Use them carefully. Like financial options, software options also have costs. Learn to be mindful of them. Know that this presents a trade-off, thus a judgment call and, hence, a leader’s responsibility. For example, a common mistake, or anti-pattern, is too many abstraction layers, which creates a terrible performance impact when we ignore the cost of abstractions.
This UX approach should go beyond UIs. We must use the same approach with APIs and internal and external messages because these formats are hard to change later. Create those APIs and message formats, iterate, and get feedback. Remember, we must design deeply but implement as little as possible.
There is one exception to implementing features as late as possible. This minimal approach does not work with features you’ll need as a competitive advantage or for security. You must invest in them independently of the design process. The sixth principle addresses these unknowns.
Principle 4: Make Decisions and Absorb the Risks
The most senior technical person in the project (whom I call the chief architect) must make decisions and absorb the risks. Any project faces many uncertainties; for example, how much load and latency limits should the first version of the system have? The reality is that, often, nobody knows that number. We often ask customers, and they do not know it either. However, someone has to put down the numbers so that the team can go ahead and hit the target date. Without a target, the team can lose much time in indecision.
Richard Rumelt’s book Good Strategy, Bad Strategy (Profile Books, 2011) provides a great example of this principle. When beginning to design a moon rover, nobody knew the moon’s surface. The team designing the first such vehicle was stuck. Phyllis Buwalda, director of NASA’s Future Mission Space Studies team, wrote a specification for the moon’s surface based on the toughest desert on Earth. She understood that unless she took the risk of specifying the target, much time would go to waste. By writing the specification, she absorbed the uncertainty on her shoulders, thus enabling the team to make real progress.
Similarly, the chief architect must collect the required data, perform the necessary experiments, and yet, at the end, understand the unresolvable uncertainties (such as how much load the system will get) and make decisions that set concrete targets. Leaders must remove ambiguity and create targets that are solvable.
Principle 5: Design Deeply Things That Are Hard to Change but Implement Them Slowly
In my opinion, this fifth principle is the crux of designing software systems. We should design deeply but implement slowly. Let’s explore what this means.
I usually advocate simple designs and adding complexity only when needed. However, some parts of the design are hard to change, such as
APIs exposed directly to customers
APIs of highly shared services
Database schemas (if we deploy a product that uses a database in the customer premises)
Shared data, objects, and message formats
Technology frameworks
When designing, we need to expend significant energy in designing parts like APIs and database schemas. These designs must go through a lot of reviews and iterations before putting them out to the customer. For example, with APIs, even if we version those that are exposed to our customers, old releases hang around for a long time. They are hard to change, even beyond a rewrite. APIs of shared services are also difficult to change because that would require coordinated releases.
To understand what is hard to change, we must design the system deeply. At the design level, we need to dive thoroughly into creating a design that can potentially solve the entire problem and even create PoCs as needed. Having a potential design opens our eyes to possible surprises and enables us to learn more from evidence as they come up. Designing early and deeply lets us start discussions and build consensus from the start, which often consumes a lot of time.
When designing deeply, know that it’s impossible to go deep into every aspect of the software due to limited time and resources. For any part of the system that we can change and evolve without affecting the rest of the system and for those that do not contain significant unknowns, we can defer the details to a later date. Doing this properly requires judgment. Unless we do this, however, we will drown in the details.
For example, writing a service is a well-understood problem, but unless we see the need for that service to handle complexity (e.g., large throughput, large messages), we can defer the implementation details after defining the APIs. In general, if an API or interface hides the implementation details and that is understood, we can delay the implementation design. Thus, designing deeply should focus on APIs, interfaces, and their interactions. We must, however, realize that the current design will be based on our incomplete understanding of the problem and will evolve over time.
The deep design does not imply an urgency to implement it. Doing things slowly lets us implement things with more understanding and helps avoid future changes. Bring about things only when your user journey analysis indicates that they are necessary and that they add significant value. Designing deeply, implementing slowly, and using the judgment required to do this efficiently and decisively are hallmarks of a great architect.
Principle 6: Eliminate the Unknowns and Learn from the Evidence by Working on Hard Problems Early and in Parallel
Detect unknowns early and systematically eliminate them rather than trusting your luck. Often this effort requires experiments to resolve them, which is one of the chief architect’s key responsibilities. Resolving unknowns requires trial and error, which usually takes time. Proactively exploring unknowns gives us enough time to inspect those problems and find the right solutions. This foresight differentiates a great architect from a good one.
Kelly Johnson, the aircraft designer, offers a great example. Designing aircraft for Defense Advanced Research Projects Agency (DARPA), his team built the first aircraft that goes three times faster than sound (Mach 3). Wind tunnels at that time could not simulate wing design at this speed. Kelly found a simple solution: He collected data by borrowing 400 missiles, mounted different wing designs on them, and conducted experiments.
Experiments are a crucial tool in any designer’s arsenal. Because it is much easier to do experiments with software than with an aircraft; we have little excuse for not doing them. One of my advisors used to say never argue or analyze something that you can check with fifteen minutes of code.
This principle also ties in with the deep design that allows us to proactively identify unknowns beyond what is apparent at first glance. If we believe a certain part of the design is unknown and risky, we need to dig into that part early to give us time to resolve the unknown.
There is a second related point. With software, it is easy to rerun something. Yet, we do not want to build monitoring into the system and are bad at collecting enough data to understand what is really happening. Ironically, because it is easy to collect the data, we never collect it. Yet, complex problems and situations do not happen often and are hard to recreate. Unless we collect data, it is hard to learn from these situations, robbing us of an opportunity to fix bugs and to deeply understand the system.
In contrast, designers in many other disciplines such as vehicle design, aeronautics, and medicine have only a few experiments about a particular topic at their disposal. Hence, they collect a lot of data and usually know much more about their systems than software professionals do.
We should add monitoring into our systems early and take the time to instrument it. For example, we can measure operating system telematics, queue sizes, selected traces, timed breakdowns, and throughput at different places in our system. Also, because it is not practical to comb through the data daily, we should automate the analysis process as much as possible. Careful monitoring enables us to learn a lot from every situation.
Monitoring has a minor performance penalty. Yet, in the long run, we will save money by building better systems. This kind of monitoring is essential for the feedback loop if we operate within tight performance constraints.
Principle 7: Understand the Trade-offs Between Cohesion and Flexibility in the Software Architecture
As budding architects, we learned about the principles of flexibility and cohesion in the architecture. Venkat Subramaniam’s talks are a great source for understanding these principles.5 However, most of these principles have costs too. Hence, software architecture must be evaluated in its context, which we explored in the five questions, but sometimes we have to break the principles to create the best architecture.
Flexibility refers to the ability of the system to change. As mentioned, flexibility also costs and can be more expensive. For example, as we discussed earlier in this chapter, flexibility to run on multiple clouds can, on average, be more expensive than building for one cloud and redesigning if and when it’s needed.
Cohesion broadly means that architectural concepts are applied throughout the system. A common thing to check is whether the system reuses its components or services everywhere. An ideal system should be composed of services or components that handle one aspect (e.g., only logging, security, messaging, registry, mediation, or analytics), and all parts of the system must reuse those aspects when needed without reimplementing them. If you need configuration parsing, use configuration parsing components. If you need logs, use the logging component. This extends the DRY principle (Don’t Repeat Yourself) from code to architecture.
In modern architectures, this reuse can happen at the library level (same process) or at the service level. Unfortunately, trying to enforce this principle too rigidly can lead to problems. For example, asking every service to call a configuration service or query builder service can be too much (but not always). Sometimes, bringing in a component can also be too complicated because it brings in other dependent components in turn. Simple features can cascade into significant changes. I saw an example of this, where adding mediation dependency to an identity server added hundreds of new dependencies.
The most unfortunate use of cohesion happens as follows: We detect some aspects of one service that can be reused by another service and ask the first team to refactor and create a new service or a component. The second team incorporates this service into their system. This kind of refactoring, which forces close communication between multiple teams, should be done only when it is absolutely necessary.
Usually, it is not worth doing this to reduce duplication slightly. I have done this and paid the price. With hindsight, I am now willing to live with some level of duplication and inconsistencies when fixing those results in significant complexity. The cure, sometimes, can be worse than the disease.
It is useful to think about architecture as a way to build systems that are cheaper in the long run and tactics as tools in your toolbox. We use tools only when they make sense. In the next section, we look at a sample system to explore how to use these questions and principles.