Introduction to Spark and Machine Learning

Who Should Attend?


Duration: 5 Days

The ability to analyze huge data sets is definitely one of the most valuable technology skills today. This course is specifically designed to bring you up to speed on one of the best technologies for this task – Apache Spark. Spark can perform up to 100x faster than Hadoop MapReduce, which has caused an explosion in demand for this skillset. This course will cover the usage of Spark Data Frames with the latest Spark 2.0 syntax. Besides that, participants will learn on how to use the Machine Learning Library (MLlib) with the Data Frame syntax and Spark with the Python Programming Language. This course will be impactful to the participants and organizations through

  • Acquire the fundamental knowledge in Spark
  • Understand the Spark architecture and how it distributes computations to cluster nodes
  • Write and run Spark programs with Spark API
  • Prepare employees to embark on Spark programming and perform machine learning
  • Provide employees of the data analysis department:
    • A systematic approach to use Apache Spark, Python to perform big data analysis
    • The usage of Spark’s Machine Learning Libraries (MLlib) to create powerful machine learning models
    • Cost reduction because Apache Spark is open source
  • Understand the need for Spark in data processing
  • Understand the Spark architecture and how it distributes computations to cluster nodes
  • Be familiar with basic installation / setup / layout of Spark
  • Use the Spark shell for interactive and ad-hoc operations
  • Understand RDDs (Resilient Distributed Datasets), data partitioning, pipelining, and computations
  • Understand/use RDD ops such as map, filter, reduce, groupByKey, join, etc
  • Understand Spark’s data caching and its usage
  • Write/run standalone Spark programs with the Spark API
  • Be familiar with MLib
  1. Introduction to Spark
    • Overview and Motivations of Spark Systems
    • Spark Ecosystem
    • Spark vs Hadoop
    • Acquiring and Installing Spark and the Spark Shell
  2. Resilient Distributed Datasets (RDD)
    • RDDs and Spark Architecture
    • RDD Partitioning and Transformations
    • Working with RDDs – Creating and Transforming (map, filter, etc.)
    • Key-Value Pairs – Definition, Creation, and Operations
    • Caching – Concepts, Storage Type, Guidelines
  3. Spark API
    • Spark API
    • Overview, Basic Driver Code, SparkConf
    • Creating and Using a SparkContext
    • RDD API
  4. Application Development
    • Building and Running Applications
    • Application Lifecycle
  5. Spark Cluster
    • Cluster Managers
    • Logging and Debugging
  6. Spark SQL
    • Spark SQL
    • Introduction and Usage
    • DataFrames and SQLContext
    • Working with JSON
  7. Query
    • Querying – The DataFrame DSL, and SQL
  8. Spark Structured Streaming
    • Spark Structured Streaming
    • Architecture, Stateless, Stateful, and Windowed Transformations
    • Spark Structured Streaming API
  9. Spark Performance
    • Performance Characteristics and Tuning
    • Narrow vs Wide Dependencies
    • Minimizing Data Processing and Shuffling
    • Using Caching
    • Using Broadcast Variables and Accumulators
  10. Spark Machine Learning
    • Spark with Machine Learning MLib
    • Supervised Learning vs Unsupervised Learning
    • Feature Vectors
    • Performance and Model Evaluation

Hands-On Activity

12 Practical Labs – at least one practical lab for each topic

Register Now

Drop us your entry if you are interested to join this course.