With the business stakeholders looking for swift, precise and informative summaries, real-time dashboards have become the need of the hour for at least a few sub-functions. More often than not, these solutions are built away from the transactional systems and need a robust foundation to be created in the application environment, thus ensuring precision. What might seem like a simple requirement usually has more complex hidden nuances?
This blog will be discussing the challenges that you should look out for while tackling such a requirement. Do we have the solutions to them in the Big Data ecosystem?
The first step is to know that every solution begins with understanding the requirements completely.
Expectation from an Actionable Real-Time Application
The ask – Need for a real-time reporting system to aid round the clock monitoring of operations and enable quick business actions. Let’s look at the verticals of this requirement:
The above picture depicts the high-level requirements and skips a lot of details. Let’s look at how these simple requirements can take the shape of a complex dashboarding demand.
This picture now gives you the real scenario. Let’s look at it closely:
- Every IUD operation to be captured. Every change in the state of the source system (transactional).
- If you have been working with the Big Data ecosystem, you’ll probably know how having normalised data is equivalent to a lag in processing and high resource consumption. The need to seamlessly link data sitting across 18 tables, is critical.
- All this and more KPI calculations to be done in near real-time latency.
To implement such a requirement you can take the below strategic points/questions as a reference.
The efficiency of any implementation depends on many factors. Choosing the right technology stack is the most important amongst them. To break down, you would need
- a tool to capture every change in the state of the transactional system(source)
- a tool to receive these changes captured, with resilience and make them available for consumption with a possibility of parallelism
- an engine to consume these events and perform the transformations at a real-time latency
- a storage solution that aids in achieving a golden record for the natural key
- engine to process the golden records, extract all relevant business information, calculate KPIs and finally generate a pre-aggregated dataset
- appropriate storage for the reporting layer
Once you have the design for the pipeline drawn and know of the component list to be developed, you would think a bunch of developers wouldn’t have it tough implementing the design.
Here’s the catch. There are more than a few nitty-gritty to be taken care of in the implementation phase or you can say the architect’s role isn’t over yet. Let’s look at a few of them
- Designing the storage objects for each of the chosen DBs/storage systems.
- Allocation of resources to processing components and designing them to allow parallel processing but in most optimal fashion (you don’t want an underperforming over the parallelised pipeline or a performing pipeline but at the cost of other applications)
- Designing the process flow mechanism, ensuring that the pipeline adheres to every necessary dependency, is capable of recovery over failure and most of all works towards generating a near-real-time latency.
A Solution in the Big Data Ecosystem
Technology stack identification
Let’s look at the tool stack chosen and I’ll let you in on why the said tools/techs were chosen.
Golden Gate –
To capture all actions on the transaction system.
With Oracle as the source DB and Big Data environment as the target, Golden Gate fills the position for a data movement tool quite well with capabilities of
- Movement in real-time
- Movement of only committed transactions, thus enabling consistency and improving performance.
- High performance with minimal overhead on the underlying databases and infrastructure (too many stagnant connections no more an issue)
Placing GG Oracle at source and GG Big Data at target solves the requirement for continuous change capture in the target system.
Golden Gate publishing redo logs to Kafka topics. Kafka demonstrates the properties of
- Guaranteed sequential integrity – events stored in sequence with offset as a marker.
- Resilience – default 7-day retention of events and replication of events across the cluster
- Durability – disk storage
- Distributed – configurable number of partitions allow parallelism in event consumption
Spark Streaming and Core/SQL –
With the distributed architecture, an option for consuming live data, an option for functional programming and SQL based transformations, Spark is the best fit for the processing requirements of the pipeline.
HBase was, at one point, the closest one could get to a transactional system in the Big Data ecosystem. This is due to several reasons. Along with it taking the advantage of the underlying HDFS, it allows a key-based update strategy, columnar optimised storage and is designed to cater to high read-write throughput.
With the data, in this case, having a natural key across transactional tables, keeping it as the row key of the table and updating each of the table’s data in isolation into different columns, allows us to seamlessly get a golden record without having to go through joins between all the concerned tables. This is a save when you are working with distributed processing where joins can prove to shatter down the performance.
Elastic Search –
ElasticSearch is an apt choice for a dashboard facing reporting layer. It not only gives you the known low latency (even when the underlying dataset is huge) but also aids many dashboarding features that traditional storage could not support. To name some, it allowed for pagination horizontal scrolling in the screen etc.
Kafka Topics –
We’ve seen that Kafka serves the purpose of traditional messaging service with an additional guarantee of parallelism. This is a configurable aspect and you need to keep a few points in mind while designing your storage objects. Two major questions are
- What amount of parallelism will aid the most optimised performance for publishing records in the topic?
- What is the level of parallelism your cluster is capable of withstanding while consuming these records?
One thing to know about Kafka topics is that though the number of partitions is configurable, you can only increase the number once set. After a topic is created with a certain number of partitions configured and is holding some logs, one can no longer decrease the partitions of a topic without facing the loss of data. Also, there is no thumb rule to identify or conclude an appropriate number. So, the trick here is simple, hit and trial starting from a moderate number of partitions, to the point it starts showing the best performance.
On the second point, a trick is to have the executors of the streaming job configured in multiples of the number of partitions. This was a suggestion we got from another team but was quite simple and obvious once we heard of it. The concept is as simple as avoiding any long-running instances or the job going into speculative execution. The equal distribution of data amongst the processing units naturally calls for smoother, exception-free, and optimized event consumption.
Though there is no formula for the apt design, these points will play a vital role in aiding the object design.
As you’ve understood by now, in this solution, one key strategy to dodge the overhead of joins for a golden record is addressed by having a single table for all the incoming data.
Too many columns in an HBase table result in poor performance in the jobs processing their data. With so many columns, you would think it obviously amounts to huge data volume. Well, that’s true but the problem here is rather the maintenance of these separate columns and their metadata.
I learned that columns clubbed with a delimiter and stored as one column improve the performance at least 10 folds. Meaning, when you get 100+ columns for every table (12 transactional here), you can rather club the columns per table and in the end, have the same data stored as fewer columns. This is the simplest and the best hack for the columnar storage of HBase.