Stability patterns

Timeouts

A simple mechanism to stop waiting for an answer after certaion period of time.

It’s essential that any resource pool that blocks threads must have a timeout to ensure calling threads eventually unblock, whether resources become available or not.

Use a generic gateway to provide the template for connection handling, error handling, query execution, and result processing.

Queueing the work for a small retry later is a good thing, making the system more robust.

Things to remember

Apply Timeouts to Integration Points, Blocked Threads, and Slow Responses
Apply Timeouts to recover from unexpected failures
Consider delayed retries

Circuit Breaker

The circuit breaker exits to allow one subsystem ( an electrical circuit) to fail (excessive current draw, possibly from short circuit) without destroying the entire system.

When the circuit is “open” calls to the circuit breaker fail immediatley, without any attempt to execute the real operation.

Circuit breakers are a way to automatically degrade functionality when the system is under stress.

Changes in a circuit breaker’s state should always be logged, and current state should be exposed for querying and monitorning.

A circuit breaker should be built across the scope of a single process. They are effective at guarding against integration points, cascading failures, unbalanced capacities, and slow responses.

Bulkheads

By partitioning your systems, you can keep a failure in one part of the system from destroying everything.

In the cloud, you should run instances in different divisions of the service

The goal is to identify the natural boundaries that let you partitioon the system in a way that is both technically feasible and financially beneficial.

At smaller scales, **********process bounding************** is an example of partitioning via bulkheads.

Bulkheeads are effective at maintaining service, or partial service, even in the face of failures. Especially useful in service-oriented architecutres.

Things to remember

Save part of the ship
Pick a useful granularity
Consider bulkheads particularly with shared services models

Steady state

“fiddling” - to handle something idly, ignorantly, or destructively.

The logical extreme on the “no fiddling” is immutable infrastructure

The steady state pattern says that for every mechanism that accumulates a resource, some other mechanism must recycle that resource.

Several types of sludge that accumulate

Data purging - cleaning useless data from database, after the application has been in production for a certain amount of time

Log Files - they can fill up the file system and jeopardize system funcitonality or stability. Log file rotation requires just a few minutes of configuration. Log files on production systems have a terrible signal-to-noise ratio Ship the log ifles to a centralzied logging server, such as Logstash, where they can be indexed, searcherd and monitored.

**In-memory caching -** f time-based cache flush. Improper use of. caching is the major cause of memory leaks, which in turn lead to horrors like daily server restarts.

Things to remember-

Avoid fiddling - eliminate the need for recurring human intervention
Purge data with application logic -
Limit caching
Roll the logs - Configure log file rotation based on size.

Fail Fast

Even when failing fast, be sure to report a system faillure ( resources not available) differenlty than an application failure ( parameter violation or invalid state). Reporting a generic “error” message may cause an upstream system to trip a circuit breaker.

Fail fast pattern improves overall system stability by avoiding slow responses. Together with timeouts, can help avert impending cascading failures.

Let it crash

The best thing to do to create system-level stability is to abandon component-level stabiility.

The cleanest state your program can ever have is right after startup. The “let it crash” approach says that error recovery is difficult and unreliable, so our goal should be to get back as soon as possible to that clean state.

Things for this to be possible

Limited granularity - crash components in isolation, don’t affect the overall system
Fast replacement - depends on the time of the “stack” instance has to be re-started on.
Supervision - Actor systems use a hierachical tree of supervisors to manage the restarts. Whenever an actor terminates, the runtime notifies the supervisor, this one then decides to restart child actor, all children or crash itself. Supervisor is not the service consumer.
Reintegration - system must resume calling the newly restored provider.

Handshaking-

refers to signalling between devices that regulate communication between them. Handshaking is ubiquitous in low-level communications protocols but is almost nonexistent at the application level. Handshaking is all about letting the server protect itself by throttling its own workload.

With http this can be done as a partnership between load balancer and web server. When the later responds with 503, or a page with error message load-balancer doesn’t redirect traffic to this instance. It is an effective way to stop cracks from jumping layers, as in the case of cascading failures.

to remember-

Create cooperative demand control
Consider health checks
Build hanshaking into your own low-level protocols

Test harness

Distributed systems have failure modes that are difficult to provoke in development or QA environments.

In “Integration testing” environment , our system is fully integrated to all other systems it interacts with. This approach constrains the entire company to testing only one new piece of software at a time.

Integration test environments can verify only what the system does when its dependencies are working correctly.

Test harness is used to emulate the remote system on the other end of each integration point. It’s job is to make the system under test cynical.

A test harness differences from mock objects, in that a mock object can only be trained to produce behaviour that conforms to the defined interface. A test harness runs as a separate server, so it’s not obliged to conform to any interface.