Spark Fundamentals II

Building on your foundational knowledge of Spark, take this opportunity to move your skills to the next level. With a focus on Spark Resilient Distributed Data Set operations this course exposes you to concepts that are critical to your success in this field.
About this Course
- Get in-deptth knowledge on Spark’s architecture and how data is distributed and tasks are parallelized.
- Learn how to optimize your data for joins using Spark’s memory caching.
- Learn how to use the more advanced operations available in the API.
- The lab exercises for this course are performed exclusively on the Cloud and using a Notebook interface.
Course Syllabus
- Module 1 – Introduction to Notebooks
- Understand how to use Zeppelin in your Spark projects
- Identify the various notebooks you can use with Spark
- Module 2 – Spark RDD Architecture
- Understand how Spark generates RDDs
- Manage partitions to improve RDD performance
- Module 3 – Optimizing Transformations and Actions
- Use advanced Spark RDD operations
- Identify what operations cause shuffling
- Module 4 – Caching and Serialization
- Understand how and when to cache RDDs
- Understand storage levels and their uses
- Module 5 – Develop and Testing
- Understand how to use sbt to build Spark projects
- Understand how to use Eclipse and IntelliJ for Spark development
GENERAL INFORMATION
- This course is self-paced.
- It can be taken at any time.
- It can be audited as many times as you wish.
RECOMMENDED SKILLS PRIOR TO TAKING THIS COURSE
- Basic understanding of Apache Hadoop and Big Data.
- Have taken the Spark Fundamentals I course on BDU.
- Basic understanding of the Scala, Python, R, or Java programming languages.
- Basic Linux Operating System knowledge.
REQUIREMENTS
- None
