Published on

Architecture ntipatterns

Authors

The dominant architectural style today is the horizontally scaled farm of commodity hardware. Horizontal scaling - more servers that run the same application code.

Provides fault tolerance through redundancy.

Even though, in general horizontal clusters are not subject to the single point of failure, they can exhibit a load-related failure mode. (memory- leak from application code) Chain reaction occurs when an application that has some defect - load-related crach, resource leak. This can also be caused by blocked threads.

Things to remember

  • One server down jeopardizes the rest
  • Hunt for resource leaks.
  • Hunt for obscure timing bugs - (race conditions)
  • Use autoscaling - In cloud, creation of health checks for every auto-scaling group is a must
  • Defend with Bulkheads

Cascading failures

Occurs when a crack in one layer triggers a crack in a calling layer.

This often result from resource pools that get dreained because of a failure in a lower layer. Integration points without timeouts are a surefire to create cascading failures.

Cascading failures are the number-one crack accelerator. The most effective patterns to combat cascading failures are Circuit Breaker and Timeouts.

To remember

  • Stop cracks from jumping the gap
  • Scrutinize resource pools
  • Defend with Timeouts and Circuit Breaker.

Users

Human users have a gift for doing exactly the worst possible thing at the worst possible time.

Traffic

“Capacity” is the maximum throughput your system can sustain under a given workload while maintaining acceptable performance.

Heap memory - is a hard limit, particularly in managed code languages. The “On-heap memory user session”

Keep as little in the in-memory session as possible. Weak references - the weak reference holds another object, called the payload, but only until the garbage collector needs to reclaim memory. Usually the only guarantee is that weakly reachable objects will be reclaimed before an out-of-memory error occurs.

Another way to deal it with user memory - is to farm it out to a different process.

These approaches

exercise a trade-off between total addressable size and latency.

Sockets - Port number is 16-bit, 65536 total connections.

If there are only 64,511 ports available, and millions of connections how ?

  • Virtual IP adresses OS binds additional IP addresses to the same network interface.

A bogon, is a wandering packet that got routed inefficiently and arrives late, possibly out of sequence, and after the connection is closed.

Expensive users - test aggresively. If retailer store expects 2% conversion rate, test for 4, 6 or 10% conversion rate.

Unwanted users

sessions are the Achilles’ heel of web applications. Pick a deep link from the site and start requesting it without sending cookies. Web servers never tell application servers that user stopped waiting for an answer.

Keeping out legitimate robots is fairly easy through use of robots.txt file

2 approches work

  • technical - when identifying a scraper block it from the network
  • legal

Denial of service (DDoS) attacks. Attacker causes computers widely distributed in the net, to. start generating load on your site. Load comes from a botnet.

A specialized circuit-breaker can help to limit the damage done by any particular host.

  • Users consume memory
  • Users do weird, random things
  • Malicious users are out there
  • Users will gang up on you

Blocked threads

There’s a catch about interpreted languages. The interpreter can be running, and the application can still be totally deadlocked, doing nothing useful.

The most common failure mode is navel gazing - a happily running interpreter with every single thread sitting around waiting for Godot.

In object theory, the Liskov subsitution principle states that any property that is true about objects of type T should also be true for objects of any subtype of T.

A method without side effects in the base class, should also be without side effects in the derived class.

Things to remember

  • Recall that the Blocked Threads antipattern is the proximate cause of most failures.
  • Scrutinize resource pools
  • Use proven primitives
  • Defend with Timeouts
  • Beware of the code you can not see

Self-Denial Attacks

This type of attack is described as any situation in which the system - or the extended system that includes humans - conspires against itself.

Avoid this type of attack by building a “shared-nothing” architecture. ( Each server can run on it’s own without knowing what other server is doing)

Autoscaling can help when the traffic surge does arrive.

Things to remember

  • Keep the lins of communication open
  • Protect shared resources
  • Expect rapid redistribution of any cool or valuable offer.

Scaling effects

Be sure to distinguish between point-to-point inside a service versus point-to-point between services. If the application will only ever have 2 servers, then point-to-point is fine.

Replacement potentials

  • UDP broadcasts
  • TCP or UDP broadcasts
  • Publish/subscribe messaging
  • Message queues

Shared resource

is some facility of all member of a horizonally scalable layer need to use. It could be a cluster manager, or a lock manager. When it gets overloaded it becomes a bottleneck.

The trouble with shared-nothing architecture is that it might scale better at the cost of failover.

Things to remember

  • Examine production versus QA environments to spot Scaling Effects
  • Watch out for point-to-point communication
  • Watch out for shared resources

Callers and providers should be resilient, for the caller Circuit Breaker will help by relieving the pressure of downstream services when responses get slow or connections get refused. For the providers, Handshaking and Backpressure should be used to inform callers to throttle back on the requests.

Drive out Through Testing

Unbalanced capacities are rarely observed by QA, (scaled down to just 2 servers)

Things to remember:

  • Examine server and thread counts
  • Observe near Scaling Effects and users
  • Vritualize QA and scale it up
  • Stress both sides of the interface

Dogpile

When a bunch of servers impose this transient load all at once, it’s called a dogpile

occues in different situation:

  • Booting up several servers, code upgrade or restart
  • When cron job triggers at midnight
  • When configuration managment system pushes out a change

Force multiplier

Like a lever, automation allows administrators to make large movements with less effort.

A service discovery service is a distributed system that attempts to report on the state of many distributed systems to other distributed systems.

“Control plane” refers to software that exists to help manage the infrastructure and applications rather than directly delivering user functionality.

A failure can also resiult when the “desired” state is computed incorrectly and may be impossible or impractical.

Things to apply in control plane software

  • Apply hysterisis. Start machines quickly, but shout them down slowly.

Slow Responses

Generating a slow response is worse than refusing a connection or returning an error, particularly in the context of middle-layer services.

Slow responses usually come from exessive demand.

Things to remember

  • Slow responses trigger Cascading Failures
  • For websites, slow responses cause more traffic (reload button)
  • Consider Fail Fast
  • Hunt for memory leaks or resource contention

Unbounded result set

In the abstract, an unbounded result set occurs when the caller allows the system to dictate terms. It’s a failure in handshaking. Social media assumed at first that the number of connections per user would be distributed like a bell-curve, but it’s actually distributed like a power law.

Things to remember

  • Use realistic data volumes
  • Paginate at the front end
  • Don’t rely on data procedures
  • Put limits to other application-level protocols