Building Microservices

Designing Fine-grained Systems

Published in

FACILELOGIN

38 min readMay 27, 2016

Microservices is one of the most trending buzzwords at the time of this writing, along with the Internet of Things (IoT), Containerization and some more. Everyone talks about microservices and everyone wants to have microservices implemented. The term ‘microservice’ was first discussed at a software architects workshop in Venice, in May 2011. It’s being used to explain a common architectural style they’ve been witnessing for sometime. Later, after a year in May 2012, the same team agreed that the ‘microservice’ is the best-suited term to call the previously discussed architectural style. At the same time, in March 2012 James Lewis went ahead and presented some of the ideas from the initial discussion in Venice at the 33rd Degree conference in Krakow, Poland.

The abstract from James Lewis’ talk on ‘Microservices — Java, the Unix Way’, which happened to be the very first public talk on Microservices, in March 2012:
“Write programs that do one thing and do it well. Write programs to work together” was accepted 40 years ago yet we have spent the last decade building monolithic applications, communicating via bloated middleware and with our fingers crossed that Moore’s Law keeps helping us out. There is a better way.

Microservices. In this talk we will discover a consistent and reinforcing set of tools and practices rooted in the Unix Philosophy of small and simple. Tiny applications, communicating via the web’s uniform interface with single responsibilities and installed as well behaved operating system services. So, are you sick of wading through tens of thousands of lines of code to make a simple one-line change? Of all that XML? Come along and check out what the cools kids are up to (and the cooler grey beards).”

The book Building Microservices by Sam Newman is one of the very first on the subject. It’s a great book for anyone who talks about or designs or builds microservices must read — I strongly recommend buying it!. This article reviews the book while highlighting the key takeaways from each chapter. There are many places here, where I extract out the content directly from the book, but present in a way to build a summary.

Chapter — 1: Microservices

This is the introductory chapter of the book and the author tries to bring all the readers to a common ground.

Image credits: http://savingthefamilymoney.com/wp-content/uploads/2013/04/Melissa-Doug-Blocks.png

One common question anyone learning about microservices for the first time would raise is, how would a microservices based architecture differ from a service-oriented architecture (SOA)?

SOA is a design approach where multiple services collaborate to provide some end set of capabilities. A service here is an isolated process — and the inter-service communication happens over the network. Having the services exposed via well-defined interfaces, the service implementation behind the interface can be changed without having any impact on the other dependent services.

SOA also promotes technology heterogeneity. Since services are not tightly coupled to each other — and the inter-service communication happens over the wire following standard message formats (for example SOAP), one service does not have any hard dependency on the technology behind the other service’s implementation.

As per the author’s opinion, despite many efforts, there is lack of good consensus on how to do SOA well. Most of the complexity found in SOA is inherited by the WS-* standard stack, which some developers still tend to misunderstand as SOA itself. The communication protocols, vendor middleware, a lack of guidance about service granularity — all lead into a discussion to differentiate good SOA from bad.

The microservices approach emerged from real-world use, with all the experience from the SOA gone-bad, to build an architectural paradigm to do SOA well.

What is the right size of a microservice? John Eaves at RealEstate.au in Australia characterizes a microservice as something that could be rewritten in two weeks, a rule of thumb that makes sense for his particular context. There are multiple opinions about the ‘right’ size of a microservice. The author argues, it should be small enough — and no smaller. A strong factor in helping to answer how small — is how well the service aligns to the team structure. If the codebase is too big to be managed by a small team, looking to break it down is very sensible.

The book highlights seven key benefits of microservices over a monolithic application.

Technology Heterogeneity: In a monolithic application all the components are developed with the same technology stack. This does not allow to utilize the optimal technology for a given component based on it’s intended functionality. With the microservices approach, each service can pick the best technology stack which fits it.
Resilience: In a monolithic application — if it fails — all of it’s components fail too. We can run these applications on multiple machines to reduce the chance of failure, but with microservices, we can build systems that handle the total failure of services and degrade functionality accordingly.
Scaling: A monolithic application carries all it’s services in a single unit — that’s the monolithic application itself. If you want to scale the application, then you need scale all of it — you cannot scale each and individual component separately based on their usage. With microservices, we can just scale those services that need scaling, allowing us to run other parts of the system on smaller, less powerful hardware.
Ease of Deployment: Under monolithic applications, even in a system having best automated tests, a one line change introduced to a single component would require running the whole test suite. Also would require releasing the complete product — hence a new fresh deployment. (If you think patching would solve this — it’s not quite true — as patches are released only to fix an issue — introduction of new features would require the release of the complete monolithic application). With microservices we can make a change to a single service and deploy it independently of the rest of the rest of the system.
Organizational Alignment: The idea of a “two pizza team” was coined by Jeff Bezo, the founder of Amazon, highlights the importance of having smaller teams (a team should be small enough so that the entire team can be fed with two pizzas).The nature of microservices allows us to better align our architecture to our organization. Unlike monolithic applications, the domain and the scope of a given microservice is small — hence can be owned by a smaller team.
Composability: The reuse of functionality is a key promise made in service-oriented architecture. Microservices can be consumed in different ways for different purposes.
Optimizing for Replaceability: Most of the legacy applications are hard to replace, and for the same reason they are sitting in many enterprises creating barriers for innovation. These monolithic applications are hard to replace — as no one knows what would happen even with the introduction of a little change. Microservices remove this bottleneck. A microservice can be completely replaced by a new implementation without too much fuzz.

Chapter — 2: The Evolutionary Architect

The 2nd chapter of the book presents a fairly opinionated view of what the role of an architect in building a microservices architecture. The author argues whether ‘architect’ is the right term to refer to a software architect. It’s a term borrowed from the building/construction industry — and the role of a software architect is quite different from a building architect, in terms of responsibilities and accountability. He further elaborates, with an example , it’s the town planner’s role, which goes with the software architect’s role — rather than the building architect’s. The town planner’s role is to look at multitude of sources of information, and then attempt to optimize the layout of a city to best suite the needs of the citizens today, taking into account future use.

Image credits: http://i.telegraph.co.uk/multimedia/archive/02210/evo2_2210436b.jpg

Instead of designing each and every building of the town, the town planner zones the city. These zones could be industrial zones, residential zones likewise. It is then up to other parties to develop buildings in these zones appropriately. The author argues the role of a software architect would also be to identify these ‘zones’ — which, would the service boundaries.

A software architect needs to worry much less about what happens inside a zone than what happens between the zones. In other words, a software architect needs to pay more attention on how services communicate with each other than how the individual services are built. Within a zone, the team who owns the zone can pick the technology stack or data store which, fits its use cases. On paper this looks fine — though in practice this could lead into other problems. Having different technology stacks, it would make the hiring process complicated and also moving people between teams. To overcome such issues, it is required to establish some standards, principles and practices across all the teams. For example, Netflix has mostly standardized on Cassandra as the data store, although it may not be the best fit for all of its use cases.

The author highlights six core responsibilities of an evolutionary architect.

Vision: The architect should clearly communicate the technical vision of the project/system to the rest of the members of the team, so that the system will meet its design goals and the requirements.
Empathy: The architect should understand the impact of the decisions he/she makes on the business.
Collaboration: The architect should engage with peers and the colleagues as much as possible to help define, refine and execute the vision.
Adaptability: The architect should be flexible to adapt into new situations as the business requirements evolve.
Autonomy: The architect should find the right balance between standardizing and enabling autonomy for the teams.
Governance: The architect should ensure that the system being developed fits the technical vision.

Chapter — 3: How to Model Services

The 3rd chapter of the book presents a set of best practices in designing and developing ‘good’ services — not necessarily microservices.

Image credits: http://www.revitmodelingindia.com/wp-content/uploads/2015/07/3d-bim-modeling-services1.jpg

Loose coupling and high cohesion are two important characteristics of any good design. When services are loosely coupled, a change to one service should not require a change to another. The whole point of a microservice is being able to make a change to one service and deploy it, without needing to change any other part of the system. With high cohesion, all related behavior will sit together, and unrelated behavior will sit elsewhere. The benefit of a highly cohesive system is, if we want to change behavior, we only need to change it in one place, and release that change as soon as possible. If we have to change that behavior in lots of different places, then we’ll have to release lots of different services.

Bounded Context, is an important design pattern one should follow in designing a ‘good’ service. This pattern is first introduced by Eric Evans — in his book Domain-Driven Design. The idea is that any given domain consists of multiple bounded contexts, and each bounded context encapsulates related functionalities into domain models — and defines integration points to other bounded contexts. In other words, each bounded context has an explicit interface, where it defines what models to share with other contexts. By explicitly defining what models should be shared, and not sharing the internal representation, one can avoid the potential pitfalls that can result in tight coupling. These modular boundaries are excellent candidates for microservices. In general microservices should cleanly align to bounded contexts. If our service boundaries are align to the bounded contexts in our domain, and our microservices represent those bounded contexts, we are off to an excellent start in ensuring that our microservices are loosely coupled and strongly cohesive.

Chapter — 4: Integration

The 4th chapter of the book explains different interaction patterns between microservices.

Image credits: http://emergetech.com/wp-content/uploads/2015/05/NetSuite-integration-page.jpg

Following lists out some takeaways from this chapter:

Avoid database integration at all costs: Having a shared database between multiple services, would bound each of these service to a particular database schema. Any changes to the schema could effectively break all the dependent services — and would require fixing and redeploying.
Understand the trade-offs between REST and RPC, but strongly consider REST as a good starting point for request/response integration: The communication between two services can be either request/response based or event-based. With request/response a client initiates a request and waits for the response. This mode clearly aligns well to synchronous communication, but can work for asynchronous communication too. One can kick off an operation and register a callback, asking the server to let him/her know when the operation has completed. With event-based collaboration, we invert things. Instead of a client initiating requests asking for things to be done, it instead says this thing happened and expects other parties to know what to do.
Prefer choreography over orchestration: Orchestration would require a central governing body, which should exactly know which operations should be carried out in which way. It can become a hub in the middle of a web, and a central point where the logic starts to live. Under choreography there is no such central body to govern everything. Once a given task is completed, it notifies the system with a task-completed event — there can be multiple other services, which could subscribe to the events of their interest and start acting, upon receiving task-completed events.
Avoid breaking changes and the need to version by understanding the Postel’s law and using tolerant readers: The Postel’s law states that, be conservative in what you do, be liberal in what you accept from others. In other words, a service that sends commands or data to other services should conform completely to the specifications, but a service that receives input from other services should accept non-conformant input as long as the meaning is clear. This is also known as the Robustness Principle. The book explains different ways of handling service versioning — having limited impact on the consumers and also the manageability of those services. Tolerant reader is a design pattern built on top of the Robustness Principle. It recommends to be as tolerant as possible when reading data from a service. If you’re consuming an XML file, then only take the elements you need, ignore anything you don’t.
Think of user interfaces as compositional layers: To build a user interface it may require to talk to multiple services, which could be fairly chatty. Having an API Gateway could help here, as you could expose calls that aggregate multiple underlying calls. This approach has it’s own drawbacks, as the author suggests. It could end up having a one giant layer for all the services. This leads to everything being thrown in together, and we lose isolation of our various user interfaces, limiting our ability to release them independently. To overcome this the author proposes the pattern backends for frontends. It allows the team focusing on any given UI to also handle it’s own server-side components.

Chapter — 5: Splitting the Monolithic

In almost all the microservices meetups that I have attended one common question I hear is, how to migrate from monolithic to microservices?. The chapter 5 of the book answers this question in detail. This is one of the best chapters in the book.

The problem with the monolithic is that it grows over time. It acquires new functionality and lines of code at an alarming rate. Before long it becomes a big, scary giant presence in the organization that people are scared to touch or change. Breaking down a monolithic application into microservices, the author recommends an incremental approach.

The book highlights the pace of change, team structure, security and technology as possible reasons for the monolithic to microservices transition.

The first step in breaking a monolithic application into microservices, is to identify bounded contexts within the monolithic. This could be an iterative process that could run from days to few weeks, based on the complexity of the application. A bounded context encapsulates all it’s internal operations from the rest of the world and defines an object model to communicate with the outside world. We can also call these bounded contexts as seams. A seam is a concept introduced by Michael Feathers, in his book, Working Effectively with Legacy Code. Identifying the seams in your monolithic application is the first thing in splitting your monolithic application.

The way monolithic applications are written, is that all it’s components talk to the same database. Once we break down the code into multiple seams, still they will continue to talk to the same database, which is not the desired pattern in the microservices world. The author presents in this chapter few approaches to tackle this problem.

One approach is to refactor the database in a way, each seam will have its data under its own control. The challenge we face here is with the foreign-key relationships. We lose this altogether. Rather than preserving the foreign-key relationship at the data level, the author suggests to maintain it at the service layer. This would mean we need to implement our own consistency checks across services or else trigger actions to clean up related data. This can be implemented with eventing. For example if some action is performed over an item_code, that event will published to the corresponding topic — and all the subscribers to that topic (other services) can act upon based on the item_code.

Another challenge we face with data is — the way we treat static data. These data too reside in the database — and if we refactor the database, where do we place the static data? One option is to duplicate these data in each database behind each microservice — or write them down to a configuration file and refer it from each service. The author prefers the latter. Another approach is to introduce a microservice over the static data — and all the other microservices will talk to it to retrieve data and then build a local cache.

Once the monolithic application is divided into seams — and the database is refactored, the author still recommends running everything still as the same monolithic application, before splitting it into microservices. With a separate schema, we’ll be potentially increasing the number of database calls to perform a single action. Also we could end up breaking transactional integrity. By splitting the application (by seams) and refactoring the database, but still keeping the code together, we give ourselves the ability to revert our changes or continue to tweak things without impacting any consumers of our service.

The author suggests three approaches to facilitate transactions in the microservices world:

Eventual consistency: Rather than using transactional boundaries to ensure that the system is in a consistent state when the transaction completes, instead we accept that the system will get itself into a consistent state at some point in the future. We could queue some operations in a queue or a log file, and try again later — when it fails. For some sort of applications this makes sense, but we have to assume that a retry would fix it.
Compensation transactions: With compensation transactions — if one fails everything is rejected. For the operations that have being committed already — a compensation transaction will be executed to revert the effect. This is a common pattern used in the SOA world too.
Distributed transactions: Distributed transactions try to span multiple transactions within them, using some overall governing process called a transaction manager to orchestrate the various transactions being done by underlying systems.

As per the author, all the above three alternatives would add complexity. His recommendation is to do everything possible to avoid splitting the state (that needs to be kept consistent).

Reporting is another aspect we need to deal with when splitting a monolithic application into microservices. Reporting would need to group together data from across multiple parts of our organization in order to generate a useful output. The author suggests two patterns to deal with this:

Data retrieval via service calls: There are many variants of this model, but they all rely on pulling the required data from the source system via API calls.
Data pumps: Rather than having the reporting system to pull data — the data pumps pattern suggests a push approach. Each service will publish all the data required for reporting to a topic — which the reporting service would subscribe to.

Netflix needs to report across all its data — but given the scale involved this is a non-trivial challenge. Its approach is to use Hadoop that uses SSTable backup, as the source of its job. Netflix stores back ups from the Cassandra nodes running behind each of its services in SSTables in Amazon S3 objects store.

Chapter — 6: Deployment

Another gem!. The chapter 6 of the book explains different deployment patterns of microservices and their respective pros and cons.

Image credits: https://i3-vso.sec.s-msft.com/dynimg/IC831068.png

The chapter starts by introducing Continuous Integration (CI) and Continuous Delivery (CD). When thinking about microservices and continuous integration, we need to think about how the CI builds map to individual microservices. Author suggests to have a single CI build per microservice, which allows to quickly make and validate a change prior to deploying into production — rather than having one monolithic build for all the microservices. Per service CI builds can be carried out against the same source repository — pointing to multiple subdirectories. This approach is not quite preferable. Author suggests each microservice to live in its own source repository having its own CI build process.

Spinnaker is an open source multi-cloud Continuous Delivery platform developed by Netflix for releasing software changes with high velocity and confidence.

Continuous Delivery (CD) is the approach where we get constant feedback on the production readiness of each and every check-in, and furthermore treat each and every check-in as a release candidate. To fully embrace this concept, we need to model all the processes involved in getting the software from check-in to production, and know where any given version of the software in terms of being cleared for release. In CD this is done by extending the idea of multistage build pipeline to model each and every stage the software has to go through, both manual and automated. A typical CD pipeline will include the stages: build automation and continuous integration; test automation; and deployment automation. By modeling the entire path to production for the software, we greatly improve visibility of the quality of our software, and can also greatly reduce the time taken between releases. In a microservices world, where we want to ensure we can release our services independently of each other, author recommends to have one pipeline per service.

In our CD pipelines it’s an artifact that we want to create and move through the path to production. From the point of view of a microservice, (depending on your technology) the artifact produced at the end of the pipeline may not be enough by itself. For example, a Java JAR file can be made to be executable and run an embedded HTTP process — while for things like Ruby and Python applications we’ll expect to use a process manager running inside Apache or Nginx. This raises the need for automation — which would take care of installing and configuring all the other software that is required to deploy and launch our artifacts, microservices. Puppet, Chef and Ansible are few automated configuration management tools used widely in the industry.

Author discusses multiple types of artifacts, that can be produced at the end of the build process. Operating system artifacts includes, RPMs for RedHat — or CentOS-based systems, deb package for Ubuntu — for MSI for Windows. The advantage of using OS-specific artifacts is that from a deployment point of view we don’t care what the underlying technology is. We just use the tools native to the OS to install the package. One downside author highlights with respect to this approach is, the overhead of managing artifacts for different operating systems could be pretty steep.

The other type of artifact highlighted in this chapter is, custom images. The virtual machine images that bakes in some of the common dependencies required in the runtime, could reduce the spin up time. Spin up time is a challenge we find in automated configuration management systems like Puppet, Chef, and Ansible. They take time to run the scripts on a machine. Custom images can be built with common tools required to run the software. When we want to deploy the software, we just spin up an instance of the corresponding custom image. Netflix has adopted the model of backing its own services as AWS AMIs. A couple of drawbacks with the custom images approach is, the time it takes to build the image and also the size of the image. These drawbacks lead the direction into container technologies like Docker.

Packer is a tool for creating machine and container images for multiple platforms from a single source configuration.

Just as with operating system specific packages, the VM images also become a nice way of abstracting out the differences in the technology stack used to create the services. With this abstraction, author presents another deployment concept: immutable servers. Once the server images are built from the configuration loaded from the corresponding repository — no changes should be made to the running instance. This is the basic concept behind immutable servers. If someone logs into a running instance and does some changes there — which would differentiate the configuration of the running instance from what it is built from — it would lead to a problem well-known as as the configuration drift. One should be able to build identical immutable servers repeatedly from the same configuration loaded from the corresponding repository.

Most of the enterprises deploy their software in multiple environments: QA, Test, Staging and Production. Even through its the same instance which runs in each environment, some small portion of configuration could change from one environment to the other. For example, the server instance running in QA will connect to the QA database, while the production instance will connect to the production database. To tackle these kind of scenarios, author suggests using the same instance produced by Continuous Delivery — having different multiple property files to carry properties, which change from one environment to another.

While deploying microservices what would be the best approach? What would be the service-to-host mapping? “Host” here does not necessarily mean the physically machine — it’s basically an instance of the operating system. Against multiple services per host approach — author recommends using a single service per host approach. With a single-service-per-host approach we avoid side effects of multiple services living on a single host, making monitoring and remediation much simpler. Also it reduces the big-bang single point of failure. An outage to one host should only impact only a single service. The author also suggests looking at alternative technologies like LXC or Docker to make the management of the moving parts cheaper and easier.

Chapter — 7: Testing

The chapter 7 of the book talks about different testing methodologies and their applicability in the microservices world.

Image credits: https://www.accusoft.com/wp-content/uploads/2016/04/SURGE_Blog_Post_banner.jpg

The chapter starts with explaining Brian Marick’s testing quadrant. At the bottom of the quadrant we have the tests that are technology-facing, tests that aid developers in creating the system in first place. Performance tests and small scope unit tests fall into this category — which are typically automated. The top half of the quadrant includes business-facing tests, tests that help nontechnical stakeholders understand how the system works. These tests include large-scoped, end-to-end tests — acceptance testing and exploratory testing (usability).

In his book Succeeding with Agile, Mile Cohn outlines a model called Test Pyramid to help explain what types of automated tests we need. This also helps to identify the scopes the tests should cover and also the proportion of different types of tests we should aim at. Cohn’s model splits automated tests into Unit, Service and UI tests — from bottom to top of the pyramid — which implies we need to have more unit tests. The author proposes the name end-to-end tests for UI tests.

Unit tests typically test a single function or a method call. Service tests are designed to by-pass the user interface and test services directly. End-to-end tests are the tests run against the entire system. Having more small-scoped unit tests are important than having more large-scoped end-to-end or service tests. This will make the feedback cycle small — anything fails can be detected early in the development process. A common anti-pattern, what is often referred to as a test snow cone, or inverted pyramid, describes a scenario where there are little or no small scoped tests, with all the coverage in large-scoped tests.

Author proposes two approaches in developing service tests: mocking and stubbing. When we write a service test, that needs to collaborate with multiple other downstream services, a stub service will respond with canned responses to known requests from the service under test. If a mock is used instead of a stub, it will actually go further and make sure the call was made to the downstream service. If the expected call is not made, then the test will fail.

In the microservices world, in a typical environment there can be many other downstream services. There can be cases the same downstream service can appear in the deployment under multiple versions. This raises the question, in such cases, which version of the downstream service should we use. Another problem: if we have a set of service tests that deploy lots of services and run tests against them, what about the end-to-end tests that the other services run? If they are testing the same thing, we may find ourselves covering lots of the same ground, and may duplicate a lot of the effort to deploy all those services in the first place. The approach author suggests to overcome both the above problems is to have a multiple pipelines fan-in to a single, end-to-end test stage (you definitely would need to read this chapter in the book — it explains everything very clearly). This means, each service has its own pipeline, which includes build, unit tests and service tests phases and finally all these pipelines will fan in to a single, common end-to-end test face. Any time any of our services change, we run the tests local to that service, and if those tests pass, we trigger the end-to-end tests.

End-to-end tests bring lot benefits — at the same time there are many disadvantages in end-to-end testing. As test scope increases (up in the pyramid), so too does the number of moving parts. These moving parts can introduce test failures that do not show that the functionality under test is broken, but that some other problem has occurred. If you have tests that sometimes fail, but when just re-running them pass — then you have flaky tests. Author suggests to remove such flaky tests — when those are detected until they are fixed properly. A test suite with flaky tests can become a victim of what Daine Vaughan calls as the normalization of deviance — the idea that over time we can become so accustomed to things being wrong that we start to accept them as being normal and not a problem.

Another disadvantage the author highlights in end-to-end tests is the test ownership. Some organizations have a dedicated team to write these tests. This will make the team who develops the software increasingly distant from the tests for its code. The long feedback cycles associated with end-to-end tests is also another problem in end-to-end testing. With a long test suite any breaks take a while to fix, which reduces the amount of time that the end-to-end tests can be expected to be passing. While a broken integration test stage is being fixed, more changes from upstream teams can pile in. These new changes could make fixing the issue much harder. One way to resolve this is to not let people check in, if the end-to-end tests are failing, but given a long test suite time, this is often impractical. A key driver that we can release our software frequently is based on the idea that we release small changes as soon as they are ready.

In a microservices deployment there can be multiple services as well as multiple versions of the same service. When we do changes to microservices and by versioning together the changes made to multiple services (metaversion), we effectively embrace the idea that changing and deploying multiple services at once is acceptable. Author suggests this as an anti-pattern. In doing so, we cede the one of the main advantages of microservices, the ability to deploy one service by itself, independently of other services.

Towards the end of the chapter, the author introduces the concept of consumer-driven tests as a way to overcome the issues we discussed above with respect to end-to-end tests. One of the key issues we try to address with integration tests (or the end-to-end tests) is to ensure that when we deploy a new service to production, the new changes won’t break consumers. One way we can do this without running tests against the real consumer is by using consumer-driven contract (CDC). CDC defines the expectations of the consumers. Because these CDCs are the expectations on how the service should behave, they can be run against the service by itself with any of its downstream dependencies stubbed out. Pact is a consumer-driven testing tool author highlights in this chapter — and also the Pacto.

Testing after production is another important aspect the author highlights in this chapter. Author suggests to have a smoke test suite, which should be run against the production deployment, before directing any production load against it. The smoke test suite is a collection of tests designed to be run against newly deployed software to confirm that the deployment works fine. The blue/green deployment is another approach. With this model we have two copies of the software deployed, at a time but only one version of it is receiving real requests. Any change can be introduced to the node that does not accept production traffic — test it there — and then route the traffic to that node from the already active node. Canary releasing is another approach. With this model, we are verifying the newly deployed software by directing small amounts of production traffic against the system to see if it performs as expected, and with the time increase the load on it. This approach is different from blue/green deployment, as you would expect all the nodes to accept production load, with different portions — with blue/green deployment only one node accepts production load at a given time. Netflix is using canary releasing approach extensively.

Chapter — 8: Monitoring

A typical microservices deployment will consist of many services. Monitoring these individual services and the service interactions add lot of complexity. The chapter 8 of the book builds a great discussion on monitoring microservices.

Image credits: https://www.thesocialsavior.com/wp-content/uploads/2015/02/small-business-analytics.jpg

In monolithic applications we have a very obvious place to start our investigations in case of an issue. It’s the monolithic application itself. In the microservices world, the capabilities we offer users are served from multiple small services, some of which communicate with yet more services to accomplish their tasks. We now have multiple servers to monitor, multiple log files to sift through and multiple places where network latency could cause problems.

This chapter presents a set of guidelines for monitoring, at each service level and for the system.

For each service:

Track inbound response time at a bare minimum. Once that’s done, follow with error rates and then start working on application level metrics.
Track the health of all downstream responses, at a bare minimum including the response time of downstream calls, and at best tracking error rates. Libraries like Hystrix can help here.
Standardize on how and where metrics are collected.
Log into a standard location, in a standard format if possible. Aggregation is a pain if every service uses a different layout.
Monitor the underlying operating system so we can track down rogue processes and do capacity planning.

For the system:

Aggregate host-level metrics like CPU together with application level metrics.
Ensure the metric storage tool allows for aggregation at a system or service level, and drill down to individual hosts.
Ensure the metric storage tool allows to maintain data long enough to understand trends in the system.
Have a single, queryable tool for aggregating and storing logs.
Strongly consider standardizing on the use of correlation IDs.
Understand what requires a call to action, and structure altering and dashboards accordingly.
Investigate the possibility of unifying how we aggregate all of our various metrics by seeing if a tool like Suro or Reimann makes sense.

Author initially breaks down the discussion into three high-level categories, based on the deployment:

Single service — single server: We need monitoring here to know when something goes wrong, so we can fix it.
Single service — multiple servers: We have multiple copies of a service running on separate hosts. The requests to different service instances are distributed via a load balancer. Monitoring is required here as in the case of the previous one, but need to do so in such a way that we can isolate the problem — at the service instance level.
Multiple services — multiple servers: Monitoring such deployments would require collection and central aggregation of as much as we can get our hands on, from logs to application metrics.

Monitoring requires an excellent tooling support. The author in this chapter recommends the following set of tools. Also he recommends looking at the publication, Lightweight Systems for Realtime Monitoring (same author).

Nagios: Nagios provides monitoring of all components including applications, services, operating systems, network protocols, systems metrics, and network infrastructure.
logrotate: logrotate is a linux tool designed to ease administration of systems that generate large numbers of log files. It allows automatic rotation, compression, removal, and mailing of log files. Each log file may be handled daily, weekly, monthly, or when it grows too large.
ssh-multiplexer: This tool helps in monitoring multiple hosts. It allows to run the same commands on multiple hosts.
logstash: logstash can parse multiple log file formats and send them to downstream systems for further investigation.
Kibana: Kibana is an elastic search-backed system for viewing logs. We can use query syntax to search through logs, allowing us to do things like restrict time and date ranges or use regular expressions to find matching strings. Kibana can also generate graphs from the logs.
Graphite: Graphite tool exposes a very simple API and allows to send metrics in realtime. It then allows to query those metrics to produce charts and other displays to see what is happening.
Zipkin: Zipkin is a distributed tracing system that helps to gather timing data for all the disparate services at Twitter. It manages both the collection and lookup of this data through a Collector and a Query service. Zipkin is closely modeled after the Google Dapper paper.
Riemann: Riemann aggregates events from servers and applications with a powerful stream processing language. Sends an email for every exception in the app. Tracks the latency distribution of web apps. Sees the top processes on any host, by memory and CPU. Combines statistics from every Riak node in the cluster and forward to Graphite. Tracks user activity from second to second.
Suro: Suro has its roots in Apache Chukwa, which was initially adopted by Netflix. It is explicitly used to handle both metrics associated with user behavior and more operational data like application logs. This data can be dispatched into variety of system, like Storm for real-time analytics, Hadoop for offline batch processing or Kibana for log analysis.

Chapter — 9: Security

This is the area which interests me most. In addition to what is presented in the chapter 9 of the book, I would also recommend you to go through the two blog posts I wrote on microservices security: How Netflix secures Microservices and Securing Microservices with OAuth 2.0, JWT and XACML.

Image credits: https://udemy-images.udemy.com/course/750x422/375826_0581_4.jpg

With the granularity of the services and the frequent interactions between them, securing microservices is challenging. There are multiple angles in security microservices. One is from the system security point of view and the other is from the application security point of view. Firewalls, intrusion detection (and prevention) systems, hardened operating systems, network segregation, auditing at the system level, patch management all play a key role on system level security.

Application level security can also be divided into two parts: end-user interactions and service to service interactions. End users can be authenticated to a system (say a web app, mobile app) via SAML, OpenID Connect or any other industry accepted standards. Once the end user is authenticated, these systems need to talk to backend microservices, either directly (as itself) or on behalf of the end user. OAuth 2.0 is a standard way of securing such communications. Service to service communication can be either secured with TLS mutual auth or with a JWT based approach.

Author also suggests to make sure the application goes through static code and dynamic code analysis to detect most common vulnerabilities at the code level (for example OWASP top 10 threats), before deploying the system in to production. All these tests can be automated and make them part of the build process.

Chapter — 10: Conway’s Law and System Design

The chapter 10 of the book focuses on organizational issues that need attention while moving towards a fine-grained architecture, based on microservices.

Image credits: https://dn-linuxcn.qbox.me/data/attachment/album/201504/18/220023ozllzbql2lqlyzle.jpg

Conway’s law highlights the perils of trying to enforce a system design that does not match the organization. Melvin Conway’s paper, How Do Committees Invent, published on Datamation magazine in April 1968 observed that — “Any organization that designs a system will inevitably produce a design whose structure is a copy of the organization’s communication structure”. This statement is often quoted as Conway’s law.

The paper, Exploring the Duality Between Product and Organizational Architectures, looks at different software systems, loosely categorized as being created either by loosely coupled organizations or tightly coupled organizations. A tightly coupled organization could be one, where all the employees are typically colocated with a strongly aligned vision and goals. Loosely coupled organizations are well represented by distributed open source communities. This study found that, more loosely coupled organizations produce more modular, less loosely coupled systems — whereas more tightly focused organizations produce less modular software.

Author highlights Amazon and Netflix are two organizations that believe in, the organization and its architecture should be aligned. Amazon believes the benefits of teams owning the whole lifecycle of the systems they manage. They wanted teams to own and operate the system they looked after, managing the entire lifecycle. Amazon also believes in small teams. This driver for small teams owning the whole lifecycle of their services is a major reason why Amazon developed AWS. It needed to create the tooling to allow its team to be self-sufficient. Netflix also followed the Amazon example.

Author explains the service ownership as — the team owning a service is responsible for making changes to that service. For many teams, ownership, extends to all aspects of the service, from sourcing requirements to building, deploying, and maintaining the application. This model is especially prevalent with microservices, where it is easier for a small team to own a small service. This increased level of ownership leads to increased autonomy and speed of delivery.

Shared service ownership is another model many teams follow. The author highlights this as a suboptimal approach. Hard to split services, feature teams (UI team, database team) and delivery bottlenecks are some of the drivers towards the shared service ownership model. If it’s hard to break this model, then the author proposes a model called — internal open source. This highlights the importance of bringing open source fundamentals inside the organization. When people who worked on a service originally are no longer on a team together, perhaps they are scattered across the organization,they can create pull requests to the new features or bug fixes they do on the original service and submit to the current team.

At the end of the chapter, the author also highlights the existence of the reverse of the Conway’s law — that is, the system designs have driven organizations to change their structure and the communication patterns.

Chapter — 11: Microservices at Scale

Another gem! Chapter 11 talks about the challenges and patterns to overcome those challenges in a large scale microservices deployment. This is the longest chapter of the book — and also the best!

Image credits: http://www.jorgeleon.mx/wp-content/uploads/2014/04/Fibre.jpg

In a distributed computing environment many things can go wrong. Baking in the assumption that everything can and will fail leads us to think differently about how we solve problems. We need to worry about cross-functional requirements and consider about aspects like durability of data, availability of services, throughput, and acceptable latency of services.

In 1994, Peter Deutsch, a sun fellow at the time, drafted 7 assumptions architects and designers of distributed systems are likely to make, which prove wrong in the long run resulting in all sorts of troubles and pains for the solution and architects who made the assumptions. In 1997 James Gosling added another such fallacy. The assumptions are now collectively known as the “The 8 fallacies of distributed computing” :
1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn’t change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous.

An essential part of building a resilient system, especially when your functionality is spread over a number of different microservices that may be up or down, is the ability to safely degrade functionality. For example, there can be a web site that depends on a set of microservices. If a single microservice is down — still the web site should have the ability to operate just cutting down the functionality provided by the failed microservice. Though in a monolithic application, we don’t have many decisions to make. System health is binary.

Failures are unavoidable. Ariel Tseitlin coined the concept of the anti-fragile organization in regards to how Netflix operates. Netflix takes an aggressive approach, by writing programs that cause failure and running them in production in daily basis (The Netflix Simian Army). Google too goes beyond simple tests to mimic server failure, and as part of its annual Disaster Recovery Test (DiRT) exercises it has simulated large-scale disasters such as earthquakes.

The author describes four techniques to handle failures in a system:

Timeouts: Keep timeouts on all out-of-process calls, and pick a default timeout for everything. Log when timeout occurs, look at what happens, and change them accordingly.
Circuit Breaker: The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips. This chapter explains in detail various benefits a system gain by the circuit breaker pattern.
Bulkheads: The bulkhead pattern is used to isolate dependencies and limit concurrent access. For example, separate thread pools are used per dependency so that concurrent requests are constrained. Latency on the underlying executions will saturate the available threads only in that pool. Using semaphores instead of thread-pools is also an option, which allows for load shedding, but not time-outs.
Isolations: The more one service depends on another being up, the more the health of one impacts the ability of the other to do its job. When services are isolated from each other, much less coordination is needed between service owners. The less coordination needed between teams, the more autonomy those teams have, as they are able to operate and evolve their services more freely.

Idempotency, is another important attribute we should have in our services. In idempotent operations, the outcome does not change after the first application, even if the operation is subsequently applied multiple times. If operations are idempotent, we can repeat the call multiple times without an adverse impact. For example HTTP GET is an idempotent operation. The HTTP GET operation should never change the state of the resource. The HTTP PUT is also idempotent if you change the same data again and again to update a resource — the end result should be the same. The idempotent operations will work well with an event-based collaboration, and can be especially useful if you have multiple instances of the same type of service subscribing to events. Even if we store events that have been processed, with some form of asynchronous message delivery, there may be small windows where two instances can see the same message. By processing the events in an idempotent manner, we can ensure this won’t cause any issues.

We scale our systems for two reasons: to deal with failures and scale for performance. This chapter highlights six techniques that can be used to scale a microservices deployment.

Go Bigger: This is also called vertical scaling. Getting a bigger box with faster CPU and better I/O can often improve latency and throughput. This has some limits — no matter how much resource you add, if the application is not written in a way to utilize those resources in an optimal way — it would hit its limits.
Splitting Workloads: Having a single microservice per host is the preferable deployment model. Even in this case, if that microservice takes too much of load — it would be the point that you need to think of dividing it further into more services — based on the functionality.
Spreading You Risk: You should avoid having multiple services running on the same host, where an outage would impact multiple services. Even in a virtualized environment it is a common practice to have the virtual machine’s root partition mapped to a single SAN (storage area network). If that SAN goes down, it can take down all connected VMs. Another common form of separation to reduce failure is to ensure that not all your services are running in a single rack in the data center or that your services are distributed across more than one data centers.
Load Balancing: Load balancers allow to add more instances of microservices in a way that is transparent to any service consumers. This gives an increased ability to handle load and also reduce the impact of a single host failing. If we have multiple microservice instances on different host machines, but only a single host, running the database instance, our database is still a single source of failure.
Worker-Based Systems: Load balancing is not the only way to have multiple instance of a service to share the load and reduce fragility. In a worker-based system — we have a set of workers that work on shared backlog of work — for example it can be pool of worker thread that’s waiting on a queue — as soon as the queue gets populated, one worker thread gets it and starts processing — and once done will return back to the thread pool.
Starting Again: The architecture that gets you started may not be the architecture that keeps you going when your system to handle very different volumes of load. As Jeff Dean said in his presentation, “Challenges in Building Large Scale Information Retrieval Systems”, we should design for 10x growth — but plan to rewrite before 100x. At some point we need to do something pretty radical to support the next level growth.

Scaling stateless microservices is fairly straightforward. But what if we are storing data in a database. Author explains five techniques to scale databases.

Availability of Service vs Durability of Data: Separate the concept of availability of the service from the durability of data. For example, we can store a copy of all the data written to a database, in a resilient file-system. If the database goes down — still the data is not lost — but would kill the availability of the service. Another model is — all the data is written to the primary database gets copied to a standby replica database. If the primary goes down— still the data is safe in the secondary, but would require a mechanism to bring it back or make the secondary the primary.
Scaling for Reads: Many microservices are read only. Caching data can play a large part here. Another model is to use read replicas. With this approach all the writes go to a one database node — and the data gets copied to other read replicas. All the read operations, initiated from the service layer will happen against these read replicas. With this technique, reads may sometimes see stale data until the replication has completed. But eventually the reads will see the consistent data. Such a setup is called eventually consistent.
Scaling for Writes: Sharding is one approach. With sharding we have multiple database nodes. We take the data to be written, apply some hashing function to the key of the data and based on the result of the function learn where to send the data.
Shared Database Infrastructure: One running database could host multiple independent schemas, one for each microservice. This can be useful in terms of reducing the number of machines we need to run the system, but we are introducing a single point of failure.
CQRS: Command-Query Responsibility Separation (CQRS) pattern refers to an alternate model for storing and querying information. With normal database we use one system for performing modifications to data and querying data. With CQRS, part of the system deals with commands, which capture requests to modify state, while another part of the system deals with queries.

Caching is an important attribute in a system, which is commonly used for performance optimization. This chapter explains several caching techniques with their pros and cons: client side, proxy and server side caching, caching in HTTP. Caching can be done both for reads and writes.

Towards the end of the chapter it explains the benefits of auto-scaling and also the applicability of the CAP theorem into the microservices world. CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency (all nodes see the same data at the same time), Availability (every request receives a response about whether it succeeded or failed) and Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures).

Service discovery is another important aspect in the microservices world. Author explains two techniques for service discovery: DNS and Dynamic Service Registries. Zookeeper, Consul and Eureka are few tools recommended by the author, which would function as service registries.

Chapter — 12: Building It All Together

This is the final chapter of the book — and the end of an amazing journey through the microservices world.

Image credits: http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2014/07/iphone-5-diary-pen-book.jpg

This chapter summarizes seven principles of microservices.

Model around business concepts.
Adopt the culture of automation.
Hide internal implementation details.
Decentralize all the things.
Independently deployable.
Isolate failures.
Highly observable.

At the end of the book, author provides a valuable advice to anyone who is ready to walk in the microservices journey. The less we understand the domain we are in or the problem space, the harder it will be to find proper bounded contexts to design a microservices architecture. Getting service boundaries wrong can result in having to make lots of changes in service-to-service collaboration — an expensive operation. Greenfield development is quite challenging in the microservices world. It is much easier to chunk up something we have than something we do not. Author advices to consider starting monolithic first and then break things out when you are stable.

References

Building Microservices, http://www.amazon.com/Building-Microservices-Sam-Newman/dp/1491950358
Service Instance per VM, http://microservices.io/patterns/deployment/service-per-vm.html
Bounded Context, http://martinfowler.com/bliki/BoundedContext.html
Tolerant Reader, http://martinfowler.com/bliki/TolerantReader.html
Server-side Service Discovery, http://microservices.io/patterns/server-side-discovery.html
Client-side Service Discovery, http://microservices.io/patterns/client-side-discovery.html
Netflix Eureka, https://github.com/Netflix/eureka/wiki/Eureka-at-a-glance
Circuit Breaker, http://martinfowler.com/bliki/CircuitBreaker.html
Michael Nygard on Building Resilient Systems, https://www.infoq.com/interviews/Building-Resilient-Systems-Michael-Nygard
Netflix Hystrix — Latency and Fault Tolerance for Complex Distributed Systems, https://www.infoq.com/news/2012/12/netflix-hystrix-fault-tolerance