Sophisticated systems don’t ensure great performance, but a simple test can tell if your data pipeline is getting too complex
Everyone has data pipelines compiled of lots of different systems. Some may even look very sophisticated on the surface, but the reality is there’s lots of complexity to them––and maybe unnecessarily so. Between the plumbing work to connect different components, the constant performance monitoring required, or the large team with unique expertise to run, debug and manage them, all these factors can add time-to-market delays and operational overhead for product teams. And that’s not all. The more systems you use, the more places you are duplicating your data, which increases the chances of data going out-of-sync or stale. Further, since components may be developed independently by different companies, the upgrades or bug fixes might break your pipeline and data layer.
If you aren’t careful, this might look very similar to the depiction in this amusing (and depressingly accurate) short video by KRAZAM. Has a product manager ever asked you why a seemingly simple feature isn’t possible to ship?
Complexity arises because even though each system might appear simple on the surface, they can actually bring the following variables into your pipeline and can add a ton of complexity:
1. How does the system transport the data? (HTTP, TCP, REST, GraphQL, FTP, JDBC)
2. What format does the system support? (Binary, CSV, JSON, Avro)
3. How is the data stored? (tables, streams, graphs, documents)
4. Does the system provide the necessary SDKs and APIs?
5. Does the system provide ACID or BASE consistency?
6. Migration—does the system provide an easy way to migrate all the data into or away from the system?
7. What guarantees does the system have around durability?
8. What guarantees does the system have for availability? (99.9%, 99.999%)
9. How does it scale?
10. How secure is the system?
11. How fast is the system in processing the data?
12. Is it hosted, on-premise only or a hybrid?
13. Does it work on my cloud, region, etc.?
14. Does it need an additional system(s)? (e.g. Zookeeper for Kafka)
The variables such as the data format, schema and protocol add up to what’s called the “transformation overhead.” Other variables like performance, durability and scalability add up to what’s called the “pipeline overhead.” Put together, these classifications contribute to what’s known as the “impedance mismatch.” If we can measure that, we can calculate the complexity and use that to simplify our system.
Now, you might argue that your system, although it might appear complex, is actually the simplest system for your needs.
How do you measure if your data layer is truly simple or complex? And secondly, how can you estimate if your system will remain simple as you add more features? That is, if you add more features in your roadmap, do you also need to add more systems?
That’s where the “impedance mismatch test,” comes in.
What is “Impedance Mismatch?”
The term originated in electrical engineering to explain the mismatch in electrical impedance, resulting in the loss of energy when energy is being transferred from point A to point B.
Simply said, it means what you have doesn’t match what you need. To use it, you take what you currently have, transform it into what you need, and then use it. Hence there is a mismatch and an overhead associated with fixing the mismatch.
In software engineering, we have data in some form or some quantity, and need to transform it before it can be used. The transformation might happen multiple times and might even use multiple systems in between.
In the database world, the impedance mismatch happens for two reasons:
1. Transformational overhead: The way the system processes or stores the data differs from what the data actually looks like, or how you think about it. For example: In your server, you have the flexibility to store the data in numerous data structures, such as collections, streams, Lists, Sets, Arrays, and so on. It helps you naturally model your data. However, you need to then map this data into tables in relational database management systems (RDBMS) or JSON document stores, in order to store them. Then do the opposite for reading the data. Note that the specific mismatch between object-oriented language models and relational table models is known as, “Object-relational impedance mismatch.”
2. Pipeline overhead: The amount of data and the type of data you process in the server differs from the amount of data your database can handle. For example, if you are processing millions of events that are coming from mobile devices, your typical RDBMS or document store might not be able to store it, or provide APIs to easily aggregate or calculate those events. So you need special stream-processing systems, such as Kafka or Redis Streams, to process it and also, maybe a data warehouse to store it.
The Impedance Mismatch Test
The goal of the Impedance Mismatch test is to measure the complexity of the overall software architecture and whether the complexity grows or shrinks as you add more features in the future.
You can simply calculate the “transformational overhead” and the “pipeline overhead,” using an “Impedance Mismatch Score” (IMS). This will tell you if your system is already complex relative to other systems, and also if that complexity grows over time as you add more features.
Here is the formula to calculate IMS:
IMS = 𝜖Transformational overhead + 𝜖Pipeline overhead
The formula simply adds both types of overheads and then divides them by the number of features. This way, you’ll get the total overhead/feature (i.e. complexity score).
Here is how you use it:
1. For each data layer or data pipeline, simply list out:
a. Features you currently have.
b. Features that are in the roadmap. This is important, because you want to make sure that your data layer can continue to support upcoming features without any additional overheads.
2. Then map the Transformational overhead and the Pipeline overhead for each feature.
3. And finally, divide the sum of all the overheads by the number of features.
4. Repeat steps 2 and 3 for pipelines with different systems to compare and contrast them.
Embrace the Elegance of Simplicity
It is very easy to get carried away and build a complex data layer without thinking about the consequences. The IMS score was created to help you be conscious of your decisions.
Try using the IMS score to easily compare and contrast multiple systems for your data pipeline and see which one(s) are really the best for the experience you need your app to deliver. You’ll also be able to validate if your system can hold up to feature expansions and continue to remain as simple as possible.
About the Author
Raja Rao is Head of Growth Marketing at Redis. Redis makes apps faster, by creating a data foundation for a real-time world. It is the driving force behind Open-Source Redis, the world’s most loved in-memory database, and commercial provider of Redis Enterprise, a real time data platform. Redis Enterprise powers real-time services for over 8,000 organizations globally. It builds upon the unmatched simplicity and speed of Open-Source Redis along with an enterprise grade data platform that offers robustness of modern data models, management, automation, performance and resiliency to deploy and run modern applications at any scale from anywhere on the planet.
Featured image: ©Kras99