This post was first published on LinkedIn
The problem is defined aptly in the diagram
below. We receive data from taxi drivers once they drop off their passenger in
New York region. Each line of data contains 17 fields delimited by “,” and
contains information like taxi number, passenger pickup and dropoff time,
passenger pickup/dropoff latitude and longitude, and fare received. New York
region is divided into square area region of 1kmx1km.
By now, we must all be aware of application microservices
concept, its importance and need in software architecture. This post is not
about educating application microservices, but taking a use case and see how
application and data microservices could be used to solve large scale data
ingesting, processing and loading use cases.
But before going into describing the use-case, I would like
to mention few things about Data Microservices.
Data Microservices
– What?
Data microservices are event driven. They need to react to
streaming data (events) as and when data arrives to them. They are not like
typical applications that expose RESTful APIs to be consumed by other applications.
So most use cases for data microservices revolve around data pipeline, ie
effectively move data from one system to another while processing and enriching
data at each step.
Data Microservices
– Why?
Following use cases could drive Data Microservices in your
architecture –
- Events processing – Divide monolithic data processing unit into multiple data microservices, thus creating a data pipeline. Each microservice could be tested independently, scaled on demand and exposed to other teams to be consumed in their processing pipeline.
- Move computation from database to application layer – Stored procedures have played a key role in performing computation where data resides. However, with data volumes growing heavily, stored-procedures based computation going forward would be hard to catch up with our computational requirements. The computation needs to move to the application layer. This has several benefits, primarily being – applications could be tested (think when did you last wrote thorough test cases for stored procedures), could be scaled on demand and could be shared with other teams (think how many times a logic is re-written by application developers in stored procedures)
So now let’s look at following problem which was released in
DEBS Grand Challenge 2015
Now we need to design an architecture that could answer
following questions in real time. For last 10 seconds, collect all data
received from multiple taxi drivers and find out
- which taxi driver is reporting inconsistent data – for example each line entry consist < 17 fields or latitude/longitude do not belong to new york region
- which are the top 10 areas (squares) where taxies are plying the most
- where are top 50 free taxies available based on their last reported dropoff location
- how many total events are processed by the system
- how much time is taken to process 10 seconds of collected data
Solution
Architecture
At Pivotal, we have recently introduced Spring Cloud Data
Flow (SCDF) that could be used to create data microservices deployed in Pivotal
Cloud Foundry. We provide out-of-the-box application
starters and provide a way to create your custom data microservices.
Benefits of running SCDF on PCF are as follows –
- PCF could manage the entire lifecycle of deployed data microservices. It could deploy, scale, self-heal data microservices
- PCF could automatically deploy and bind applications appropriately to the underlying messaging layer. Developers don’t have to specify exchanges and topics for data microservices to connect with each other
- Upon scaling a particular microservice, PCF and SCDF make sure that message received from previous application in the data pipeline is handled by just one instance of the scaled microservices. The concept is called Consumer Group in SCDF. This is handled automatically and developers/operators don’t have to do anything special
- Developers could also define partition logic (ie a particular processed data should go to a specific instance of the next in-line application).
The following diagram describes “simple” design for solving
this problem. Remember this is just a demo and I might have excluded many other
things here. We assume that we have a file with all data in a FTP server and
data could be streamed from that.
SCDF provides a way to define a data pipeline, custom
applications and “tap” (create duplicate streams) a stream and create a new one
from there own.
Following applications are used –
1.
FTP
"Source" Application - This
out-of-the-box microservice’s job is to retrieve files from FTP server and then
post each line (as a message) to the next in-line microservice.
2.
Custom
Transformer "Processor" Application
- This microservice is a custom processor using RxJava. RxJava has a concept of
a moving window. A window is created which captures all incoming messages for
10 seconds. Once 10 seconds are over, RxJava processes all the data collected
and does the required computation. Finally the derived output is sent to the
next in-line microservice.
3.
Splitter
"Processor" Application - This
out-of-the-box microservice’s receives input from custom processor and splits
messages based on keys. It then moves the message to the next in-line
microservice.
4.
Redis
"Sink" Application - This
microservice takes the incoming message, connects to the redis database and
pushes the data with an assigned key.
You could run this entire application here.
Follow “Part 1 – Building Data Microservices” section.
My next post will focus on how to create application
microservices using Spring Boot and Spring Cloud Service to fetch data from
redis, display data received from redis on a UI and build a fault-tolerant
architecture.


No comments:
Post a Comment