Architecture antipatterns

The dominant architectural style today is the horizontally scaled farm of commodity hardware. Horizontal scaling - more servers that run the same application code.

Provides fault tolerance through redundancy.

Even though, in general horizontal clusters are not subject to the single point of failure, they can exhibit a load-related failure mode. (memory- leak from application code) Chain reaction occurs when an application that has some defect - load-related crach, resource leak. This can also be caused by blocked threads.

Things to remember

One server down jeopardizes the rest
Hunt for resource leaks.
Hunt for obscure timing bugs - (race conditions)
Use autoscaling - In cloud, creation of health checks for every auto-scaling group is a must
Defend with Bulkheads

Cascading failures

Occurs when a crack in one layer triggers a crack in a calling layer.

This often result from resource pools that get dreained because of a failure in a lower layer. Integration points without timeouts are a surefire to create cascading failures.

Cascading failures are the number-one crack accelerator. The most effective patterns to combat cascading failures are Circuit Breaker and Timeouts.

To remember

Stop cracks from jumping the gap
Scrutinize resource pools
Defend with Timeouts and Circuit Breaker.

Users

Human users have a gift for doing exactly the worst possible thing at the worst possible time.

Traffic

“Capacity” is the maximum throughput your system can sustain under a given workload while maintaining acceptable performance.

Heap memory - is a hard limit, particularly in managed code languages. The “On-heap memory user session”

Keep as little in the in-memory session as possible. Weak references - the weak reference holds another object, called the payload, but only until the garbage collector needs to reclaim memory. Usually the only guarantee is that weakly reachable objects will be reclaimed before an out-of-memory error occurs.

Another way to deal it with user memory - is to farm it out to a different process.

These approaches

exercise a trade-off between total addressable size and latency.

Sockets - Port number is 16-bit, 65536 total connections.

If there are only 64,511 ports available, and millions of connections how ?

Virtual IP adresses OS binds additional IP addresses to the same network interface.

A bogon, is a wandering packet that got routed inefficiently and arrives late, possibly out of sequence, and after the connection is closed.

Expensive users - test aggresively. If retailer store expects 2% conversion rate, test for 4, 6 or 10% conversion rate.

Unwanted users

sessions are the Achilles’ heel of web applications. Pick a deep link from the site and start requesting it without sending cookies. Web servers never tell application servers that user stopped waiting for an answer.

Keeping out legitimate robots is fairly easy through use of robots.txt file

2 approches work

technical - when identifying a scraper block it from the network
legal

Denial of service (DDoS) attacks. Attacker causes computers widely distributed in the net, to. start generating load on your site. Load comes from a botnet.

A specialized circuit-breaker can help to limit the damage done by any particular host.

Users consume memory
Users do weird, random things
Malicious users are out there
Users will gang up on you

Blocked threads

There’s a catch about interpreted languages. The interpreter can be running, and the application can still be totally deadlocked, doing nothing useful.

The most common failure mode is navel gazing - a happily running interpreter with every single thread sitting around waiting for Godot.

In object theory, the Liskov subsitution principle states that any property that is true about objects of type T should also be true for objects of any subtype of T.

A method without side effects in the base class, should also be without side effects in the derived class.

Things to remember

Recall that the Blocked Threads antipattern is the proximate cause of most failures.
Scrutinize resource pools
Use proven primitives
Defend with Timeouts
Beware of the code you can not see

Self-Denial Attacks

This type of attack is described as any situation in which the system - or the extended system that includes humans - conspires against itself.

Avoid this type of attack by building a “shared-nothing” architecture. ( Each server can run on it’s own without knowing what other server is doing)

Autoscaling can help when the traffic surge does arrive.

Things to remember

Keep the lins of communication open
Protect shared resources
Expect rapid redistribution of any cool or valuable offer.

Scaling effects

Be sure to distinguish between point-to-point inside a service versus point-to-point between services. If the application will only ever have 2 servers, then point-to-point is fine.

Replacement potentials

UDP broadcasts
TCP or UDP broadcasts
Publish/subscribe messaging
Message queues

Shared resource

is some facility of all member of a horizonally scalable layer need to use. It could be a cluster manager, or a lock manager. When it gets overloaded it becomes a bottleneck.

The trouble with shared-nothing architecture is that it might scale better at the cost of failover.

Things to remember

Examine production versus QA environments to spot Scaling Effects
Watch out for point-to-point communication
Watch out for shared resources

Callers and providers should be resilient, for the caller Circuit Breaker will help by relieving the pressure of downstream services when responses get slow or connections get refused. For the providers, Handshaking and Backpressure should be used to inform callers to throttle back on the requests.

Drive out Through Testing

Unbalanced capacities are rarely observed by QA, (scaled down to just 2 servers)

Things to remember:

Examine server and thread counts
Observe near Scaling Effects and users
Vritualize QA and scale it up
Stress both sides of the interface

Dogpile

When a bunch of servers impose this transient load all at once, it’s called a dogpile

occues in different situation:

Booting up several servers, code upgrade or restart
When cron job triggers at midnight
When configuration managment system pushes out a change

Force multiplier

Like a lever, automation allows administrators to make large movements with less effort.

A service discovery service is a distributed system that attempts to report on the state of many distributed systems to other distributed systems.

“Control plane” refers to software that exists to help manage the infrastructure and applications rather than directly delivering user functionality.

A failure can also resiult when the “desired” state is computed incorrectly and may be impossible or impractical.

Things to apply in control plane software

Apply hysterisis. Start machines quickly, but shout them down slowly.

Slow Responses

Generating a slow response is worse than refusing a connection or returning an error, particularly in the context of middle-layer services.

Slow responses usually come from exessive demand.

Things to remember

Slow responses trigger Cascading Failures
For websites, slow responses cause more traffic (reload button)
Consider Fail Fast
Hunt for memory leaks or resource contention

Unbounded result set

In the abstract, an unbounded result set occurs when the caller allows the system to dictate terms. It’s a failure in handshaking. Social media assumed at first that the number of connections per user would be distributed like a bell-curve, but it’s actually distributed like a power law.

Things to remember

Use realistic data volumes
Paginate at the front end
Don’t rely on data procedures
Put limits to other application-level protocols

Architecture antipatterns

Things to remember

Cascading failures

Users

Traffic

Unwanted users

Blocked threads

Self-Denial Attacks

Scaling effects

Shared resource

Drive out Through Testing

Dogpile

Force multiplier

Slow Responses

Unbounded result set

Share this post