Apache Spark

An in-memory data processing engine which is rapid, elegant, and expressively develops APIS for allowing data workers for efficiently executing stream and machine learning SQL workloads for the need of rapid access of datasets in an iterative manner. Developers omnipresently create applications with the spark running on Apache Hadoop YARN, derive insights, and developing their data science within a single workload, distributed dataset in Hadoop. This architecture offers the basis that permits Spark and other applications sharing a cluster common to the dataset while consistent ensuring levels of service and response.


As the momentum increases in Apache Spark, customers across various sectors can be seen seeking actual values using it. Here are few examples on the usage of Spark:

  1. Insurance: Claiming reimbursement procedures are optimized with the use of Spark’s machine learning skills for processing and analyzing the claims.
  2. Healthcare: Spark core is used for building a patient care system and streaming and SQL.
  3. Retail: For analyzing point of sale use spark data and coupon usage.
  4. Internet: For identifying the spark’s ML capability get to know the fake profiles and develop the product matches that reveals their customers.
  5. Banking: For predicting the retail banking’s profile use a machine learning model for the users financial product.
  6. Government: Spending across time, geography, and category and its analysis.
  7. Scientific Research: Time, depth, geography and future events prediction for analyzing earthquake.
  8. Investment Banking: Stock prices analysis with intra-day for predicting the future price movements.
  9. Geospatial Analysis: Uber trips are researched by geography and time for knowing the future pricing and demands.
  10. Twitter Sentiment Analysis: For finding negative, positive, and neutral sentiments research large volumes of tweets for particular products and organizations.
  11. Airlines: For predicting airline travel delays build a model.
  12. Devices: Choose likelihood of building with extra threshold temperatures.

What’s New in Spark 2.0?

This marks a big milestone for the project and it has new releases in targeted feature enhancement relying on community feedback. With respect to Spark’s enhancement there are four major areas of improvement.


SQL is the famous Apache Spark based applications and the most popular interface. 99 TPC-DS queries has spark2.0 offering support that is largely relied on SQL: 2003 specification. Current data loads can be ported into a spark backend with less replications of the application stack.

  • Machine Learning

A major emphasis in the new release is called machine learning. The package that as new spark.ml relies on DataFrames and will be replaced with the current Spark Mlib. Models and Pipelines of machine learning can now be persisted across all languages back up by Spark. Generalized Linear Models, K-Means, Survival Regression and Naïve Bayes are now backed in R.

  • Datasets:

For scala and java programming languages the data frames and datasets are now unified within the new Datasets class and it also offers an abstraction for structured streaming. Hive context and SQL context are now overwritten by unified SparkSession. For backward compati bility old APIs have been depricated.

Join the Institute of DBA training course to make your career in this field as a Certified DBA Professional.

Stay connected to CRB Tech for more technical optimization and other updates and information.

Don't be shellfish...Digg thisBuffer this pageEmail this to someoneShare on FacebookShare on Google+Pin on PinterestShare on StumbleUponShare on LinkedInTweet about this on TwitterPrint this pageShare on RedditShare on Tumblr

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>