Avoid Kafka if unsure (think twice series)

Some co-workers started using Apache Kafka con a bunch of our Customers.

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log[*].

To get this goal, Apache Kafka needs a complex servers setup, even more complex if you want the certification for the producing company (Confluent).  Now, if you are planning to use Kafka like a simple JavaMessaeSystem (JMS) implementation, think twice before going on this route.

PostgreSQL 12 offers a fair (and open source) partition implementation, whereas if money are not a problem, Oracle 12c can happy scale on billions of record before running into troubles (and ExaData can scale even more).

PostgreSQL and Oracle offer optimizations for partitioned data, called “Partition Pruning” in PostreSQL teminology:

With partition pruning enabled, the planner will examine the definition of each partition and prove that the partition need not be scanned because it could not contain any rows meeting the query’s WHERE clause. When the planner can prove this, it excludes (prunes) the partition from the query plan.

This feature is quite brand new (popped in PostreSQL 11) but it is essential to a successful partition strategy. Before these feature, partitioning was a black magic art. Now it is simpler to manage.

Kafka boring effect on my Best Developer

On the opposite Kafka need you think in advance on how to insert your data in it, because Kafka is essentially a
NoSQL database. And you must think about data partition in advice too, as far as I can understand.


Also, relational theory works great when you need to reshape your data; it can be easily explained to other engineering at the Customer side (see also SQL: One of the most valuable skills).

Apache Kafka offer a KSQL extension to mix stream in a SQL92 fashion. anyway we discovered you can join data only on the same partition (!). Also, in production you need to provide a fixed set of query on a single line file, making very complex vendor inter-operation.

Assigning a KSQL server to every workflow seems only the recommended option, increasing operation costs.

Last but not least, PostgreSQL can be fired in a  single docker container with a ridiculous amount of memory (!), so why bother on a over-complex Kafka?

If you do not plan to have trillions of data, think twice about Kafka.