Tuesday, October 4, 2016

Building Large-Scale Data Streaming Architectures using Data and Application Microservices – Part 1

This post was first published on LinkedIn

By now, we must all be aware of application microservices concept, its importance and need in software architecture. This post is not about educating application microservices, but taking a use case and see how application and data microservices could be used to solve large scale data ingesting, processing and loading use cases.

But before going into describing the use-case, I would like to mention few things about Data Microservices.

Data Microservices – What?
Data microservices are event driven. They need to react to streaming data (events) as and when data arrives to them. They are not like typical applications that expose RESTful APIs to be consumed by other applications. So most use cases for data microservices revolve around data pipeline, ie effectively move data from one system to another while processing and enriching data at each step.

Data Microservices – Why?
Following use cases could drive Data Microservices in your architecture –
  1. Events processing – Divide monolithic data processing unit into multiple data microservices, thus creating a data pipeline. Each microservice could be tested independently, scaled on demand and exposed to other teams to be consumed in their processing pipeline.
  2. Move computation from database to application layer – Stored procedures have played a key role in performing computation where data resides. However, with data volumes growing heavily, stored-procedures based computation going forward would be hard to catch up with our computational requirements. The computation needs to move to the application layer. This has several benefits, primarily being – applications could be tested (think when did you last wrote thorough test cases for stored procedures), could be scaled on demand and could be shared with other teams (think how many times a logic is re-written by application developers in stored procedures) 
So now let’s look at following problem which was released in DEBS Grand Challenge 2015

The problem is defined aptly in the diagram below. We receive data from taxi drivers once they drop off their passenger in New York region. Each line of data contains 17 fields delimited by “,” and contains information like taxi number, passenger pickup and dropoff time, passenger pickup/dropoff latitude and longitude, and fare received. New York region is divided into square area region of 1kmx1km.




Now we need to design an architecture that could answer following questions in real time. For last 10 seconds, collect all data received from multiple taxi drivers and find out
  1. which taxi driver is reporting inconsistent data – for example each line entry consist < 17 fields or latitude/longitude do not belong to new york region
  2. which are the top 10 areas (squares) where taxies are plying the most
  3. where are top 50 free taxies available based on their last reported dropoff location
  4. how many total events are processed by the system
  5. how much time is taken to process 10 seconds of collected data


Solution Architecture
At Pivotal, we have recently introduced Spring Cloud Data Flow (SCDF) that could be used to create data microservices deployed in Pivotal Cloud Foundry. We provide out-of-the-box application starters and provide a way to create your custom data microservices.

Benefits of running SCDF on PCF are as follows –
  • PCF could manage the entire lifecycle of deployed data microservices. It could deploy, scale, self-heal data microservices
  • PCF could automatically deploy and bind applications appropriately to the underlying messaging layer. Developers don’t have to specify exchanges and topics for data microservices to connect with each other
  • Upon scaling a particular microservice, PCF and SCDF make sure that message received from previous application in the data pipeline is handled by just one instance of the scaled microservices. The concept is called Consumer Group in SCDF. This is handled automatically and developers/operators don’t have to do anything special
  • Developers could also define partition logic (ie a particular processed data should go to a specific instance of the next in-line application).



The following diagram describes “simple” design for solving this problem. Remember this is just a demo and I might have excluded many other things here. We assume that we have a file with all data in a FTP server and data could be streamed from that.



SCDF provides a way to define a data pipeline, custom applications and “tap” (create duplicate streams) a stream and create a new one from there own.

Following applications are used –
1.     FTP "Source" Application - This out-of-the-box microservice’s job is to retrieve files from FTP server and then post each line (as a message) to the next in-line microservice.
2.     Custom Transformer "Processor" Application - This microservice is a custom processor using RxJava. RxJava has a concept of a moving window. A window is created which captures all incoming messages for 10 seconds. Once 10 seconds are over, RxJava processes all the data collected and does the required computation. Finally the derived output is sent to the next in-line microservice.
3.     Splitter "Processor" Application - This out-of-the-box microservice’s receives input from custom processor and splits messages based on keys. It then moves the message to the next in-line microservice.
4.     Redis "Sink" Application - This microservice takes the incoming message, connects to the redis database and pushes the data with an assigned key.

You could run this entire application here. Follow “Part 1 – Building Data Microservices” section.

My next post will focus on how to create application microservices using Spring Boot and Spring Cloud Service to fetch data from redis, display data received from redis on a UI and build a fault-tolerant architecture.

No comments:

Post a Comment