Consistency Considered Harmful 2016-06-30

Why Etcd is Often The Wrong Answer


In certain technology circles, whenever you mention MySQL, it gets torn to pieces because a long time ago most MySQL deployments used MyISAM tables; which did not have transactions.

And there were good reasons to criticise this in many cases. But a lot of the criticism was ideological rather than founded on a sound basis:

Transactions, surely, are always better?

These days, with distributed systems being more important than ever we find this in a new form. If you are going to distribute data, surely you need a consistent store like Etcd? Because what good could inconsistent data be?


(Click here to read about my consulting services - Devops, Linux, Systemd, Cloud/AWS/GCE, Software development and more)


In the real world, we always have an inconsistent view

The reality is that while we can strive for - and sometimes should ensure (despite the somewhat inflammatory title, do not think I am suggesting consistency is never necessary or desirable) consistency - it very often is meaningless.

Humans are used to inconsistencies. We realise that even before we've received our bank statement (or checked it online) it may already be out of date, for example. We deal with inconsistency all of the time.

Inconsistency is sometimes a problem, but very often it only becomes a problem when we create the wrong expectation. E.g. when we confidently state there are 142 search results but it turns out there's only 135, instead of saying there are "ca. 140" or "more than 100" or "6 pages".

The problem with consistency

The problem is that for any distributed system, demanding 100% consistency creates a whole slew of other problems:

If you need consistency, this is just something you need to design for. But surprisingly often you have a better alternative, and sacrificing consistency allows you to get better performance and/or better availability.

Design for lack of consistency

One of the most obvious examples of a system that is inconsistent by design is your banking. It is inconsistent because almost everything is - or can be in specific circumstances - asynchronous:

There is a lesson here: In a large number of instances, you can design systems that can work reliably despite such temporary inconsistency by applying operations to a ledger instead and treating sums (such as your account balance) as just an aggregate view.

Got a database system without transactions? As long as the operations you carry out are idempotent and atomic, you can usually design such a system just fine so that the result eventually becomes consistent in terms of what matters (account balances) if you stop feeding transactions through the system.

The upside is that if you are careful, such a system can continue to process updates in multiple locations even if your inter-node communications has totally collapsed. It just needs to eventually be able to exchange a list of updates.

In other circumstances, you can design systems to self-heal, and for inconsistency to be of low priority.

Consider a system that maintains a view of your cluster state. You want to know where to send traffic, but as long as your load balancer knows how to deal with failures, your view of the cluster state need not be perfect:

As long as agents on each host or centrally reasonably often are able to obtain information about what is running on a given host, any services that don't need to coordinate global state changes can remain unaffected

Services that do need to coordinate global changes are somewhat worse. E.g. for a database server, you may need a way to determine canonically where the master server is. But in such cases, allowing the old master to keep running is almost equaly bad - you need to know that it is no longer processing updates. As long as you actually forcibly ensures the previous master is down, it again does not matter if everyone has a consistent view, because they will not be able to connect to the now killed master.

So What Do You Have Against Etcd?

I don't hate Etcd. It has various issues I will not go into here, but the problem is not Etcd. The problem is abusing Etcd and software like it for the wrong thing.

If you need consistency, Etcd is fine in most cases.

But if you can avoid a need for consistency, you can save yourself a lot of trouble by using a system that provide fewer consistency guarantees and instead focuses on availability.


(Click here to read about my consulting services - Devops, Linux, Systemd, Cloud/AWS/GCE, Software development and more)



blog comments powered by Disqus