Here, at Palo, data is the most vital asset we have. Data goes hand-in-hand with the processing pipelines the are in place, to serve our mission, which is to offer quality data services related to news and data analytics.
This is the first article in the series of our Technology articles. The aim of this series is to share the knowledge we have gained throughout the years of dealing with data related problems. More articles will follow, so stay tuned because as we progress we will delve deeper into technical issues and will go from general problem outlining to technical solution propositions.
A buzz term has made its appearance lately to outshine the term “Big Data“. Buzz terms are very frequent in the technology and most of them have a short life span. This does not seem to be the case for “Fast Data” because it seems to answer to a real need: retrieving, processing, storing and serving huge amounts of data is not enough. This needs to be done fast, reliably and in many cases in a continuous manner.
Let’s break a data processing pipeline to it’s most important parts.
First and foremost is the ingestion phase. The data must come into the system somehow and this is the responsibility of the ingestion components. The sources, from which data is retrieved, can be numerous and can range from social media APIs, web pages via crawling, integration with third party data stores and services etc. The job of this component is to bring the data in and possibly apply a pre-processing step before making it available to the next phase.
The breakthrough towards the “Fast Data” era comes, though, mainly from advances in regard to the next phase of the pipeline: the processing phase. Although this article is not meant to talk about specific technologies, Apache Spark cannot go unmentioned, because it is one of the key elements that keep pushing to a paradigm shift. Before Spark, Hadoop was dominant in the processing phase. One thing that Hadoop does not do well is speed and this is due to its batch-oriented nature. Batch jobs usually take a long time to complete and this can result in delays in response, of the whole system. Spark introduced a mini batch processing approach and helped alleviate this bottleneck, by providing the means to create a stream-like processing experience.
Last but not least is the actual data storage and of course the accompanying components, services and orchestration processes of the whole pipeline. While storage has its own caveats, orchestration and services are vital and deserve their own article. The problems to solve are many and these are some keywords in random order to help with our appetite: temporal decoupling, scaling out, resiliency, messaging, actor model, reactive applications, backpressure… and the list goes on…