- 4.1 Do DevOps Practices Require Architectural Change?
- 4.2 Overall Architecture Structure
- 4.3 Quality Discussion of Microservice Architecture
- 4.4 Amazon's Rules for Teams
- 4.5 Microservice Adoption for Existing Systems
- 4.6 Summary
- 4.7 For Further Reading
4.3 Quality Discussion of Microservice Architecture
We have described an architectural stylemicroservice architecturethat reduces the necessity for inter-team coordination by making global architectural choices. The style provides some support for the qualities of dependability (stateless services) and modifiability (small services), but there are additional practices that a team should use to improve both dependability and modifiability of their services.
Dependability
Three sources for dependability problems are: the small amount of inter-team coordination, correctness of environment, and the possibility that an instance of a service can fail.
Small Amount of Inter-team Coordination
The limited amount of inter-team coordination may cause misunderstandings between the team developing a client and the team developing a service in terms of the semantics of an interface. In particular, unexpected input to a service or unexpected output from a service can happen. There are several options. First, a team should practice defensive programming and not assume that the input or the results of a service invocation are correct. Checking values for reasonableness will help detect errors early. Providing a rich collection of exceptions will enable faster determination of the cause of an error. Second, integration and end-to-end testing with all or most microservices should be done judiciously. It can be expensive to run these tests frequently due to the involvement of a potentially large number of microservices and realistic external resources. A testing practice called Consumer Driven Contract (CDC) can be used to alleviate the problem. That is, the test cases for testing a microservice are decided and even co-owned by all the consumers of that microservice. Any changes to the CDC test cases need to be agreed on by both the consumers and the developers of the microservice. Running the CDC test cases, as a form of integration testing, is less expensive than running end-to-end test cases. If CDC is practiced properly, confidence in the microservice can be high without running many end-to-end test cases.
CDC serves as a method of coordination and has implications on how user stories of a microservice should be made up and evolve over time. Consumers and microservice developers collectively make up and own the user stories. CDC definition becomes a function of the allocation of functionality to the microservice, is managed by the service owner as a portion of the coordination that defines the next iteration, and, consequently, does not delay the progress of the current iteration.
Correctness of Environment
A service will operate in multiple different environments during the passage from unit test to post-production. Each environment is provisioned and maintained through code and a collection of configuration parameters. Errors in code and configuration parameters are quite common. Inconsistent configuration parameters are also possible. Due to a degree of uncertainty in cloud-based infrastructure, even executing the correct code and configuration may lead to an incorrect environment. Thus, the initialization portion of a service should test its current environment to determine whether it is as expected. It should also test the configuration parameters to detect, as far as possible, unexpected inconsistencies from different environments. If the behavior of the service depends on its environment (e.g., certain actions are performed during unit test but not during production), then the initialization should determine the environment and provide the settings for turning on or off the behavior. An important trend in DevOps is to manage all the code and parameters for setting up an environment just as you manage your application code, with proper version control and testing. This is an example of infrastructure-as-code as defined in Chapter 1 and discussed in more detail in Chapter 5. The testing of infrastructure code is a particularly challenging issue. We discuss the issues in Chapters 7 and 9.
Failure of an Instance
Failure is always a possibility for instances. An instance is deployed onto a physical machine, either directly or through the use of virtualization, and in large datacenters, the failure of a physical machine is common. The standard method through which a client detects the failure of an instance of a service is through the timeout of a request. Once a timeout has occurred, the client can issue a request again and, depending on the routing mechanism used, assume it is routed to a different instance of the service. In the case of multiple timeouts, the service is assumed to have failed and an alternative means of achieving the desired goal can be attempted.
Figure 4.3 shows a time line for a client attempting to access a failed service. The client makes a request to the service, and it times out. The client repeats the request, and it times out again. At this point, recognizing the failure has taken twice the timeout interval. Having a short timeout interval (failing fast) will enable a more rapid response to the client of the client requesting the service. A short timeout interval may, however, introduce false positives in that the service instance may just be slow for some reason. The result may be that both initial requests for service actually deliver the service, just not in a timely fashion. Another result may be that the alternative action is performed as well. Services should be designed so that multiple invocations of the same service will not introduce an error. Idempotent is the term for a service that can be repeatedly invoked with the same input and always produces the same outputnamely, no error is generated.
Figure 4.3 Time line in recognizing failure of a dependent service [Notation: UML Sequence Diagram]
Another point highlighted in Figure 4.3 is that the service has an alternative action. That is, the client has an alternative action in case the service fails. Figure 4.3 does not show what happens if there is no alternative action. In this case, the service reports failure to its client together with context informationnamely, no response from the particular underlying service. We explore the topic of reporting errors in more depth in Chapter 7.
Modifiability
Making a service modifiable comes down to making likely changes easy and reducing the ripple effects of those changes. In both cases, a method for making the service more modifiable is to encapsulate either the affected portions of a likely change or the interactions that might cause ripple effects of a change.
Identifying Likely Changes
Some likely changes that come from the development process, rather than the service being provided, are:
- The environments within which a service executes. A module goes through unit tests in one environment, integration tests in another, acceptance tests in a third, and is in production in a fourth.
- The state of other services with which your service interacts. If other services are in the process of development, then the interfaces and semantics of those services are likely to change relatively quickly. Since you may not know the state of the external service, a safe practice is to treat, as much as possible, all communication with external services as likely to change.
- The version of third-party software and libraries that are used by your service. Third-party software and libraries can change arbitrarily, sometimes in ways that are disruptive for your service. In one case we heard, an external system removed an essential interface during the time the deployment process was ongoing. Using the same VM image in different environments will protect against those changes that are contained within the VM but not against external system changes.
Reducing Ripple Effects
Once likely changes have been discovered, you should prevent these types of changes from rippling through your service. This is typically done by introducing modules whose sole purpose is to localize and isolate changes to the environment, to other services, or to third-party software or libraries. The remainder of your service interacts with these changeable entities through the newly introduced modules with stable interfaces.
Any interaction with other services, for example, is mediated by the special module. Changes to the other services are reflected in the mediating module and buffered from rippling to the remainder of your service. Semantic changes to other services may, in fact, ripple, but the mediating module can absorb some of the impact, thereby reducing this ripple effect.