Cloudera Apache Spark Application Performance Tuning
- :
- : Lectures
$"
$"
Apache Spark Application Performance Tuning Course Overview
This hands-on training session consists of three days and is designed to provide developers with the core ideas and knowledge they need to boost the performance of their Apache Spark applications. The participants in the Apache Spark Application Performance Tuning Course
will learn how to identify typical reasons of poor performance in Spark applications, approaches for avoiding or correcting such problems, as well as best practices for monitoring Spark applications.
Target Audience:
This training is intended for software developers, engineers, and data scientists who are familiar with the process of developing Spark applications and are interested in enhancing the performance of their own code.
Learning Objective for Apache Spark Application Performance Tuning Course:
- Acquire an understanding of Apache Spark's architecture, job execution, and the ways in which efficiency at runtime may be improved using methods like as pipelining and lazy execution
- Conduct an analysis of the operational parameters of fundamental data structures such as RDD and DataFrames.
- Choose the document formats that will allow your program to function with the most efficiency.
- Determine and correct the performance issues brought on by the skew in the data.
- Improving SparkSQL speed may be accomplished via the use of partitioning, bucketing, and join improvements.
- Gain an understanding of the performance impact that Python-based user-defined functions, RDDs, and DataFrames may have on your application.
- Caching may help improve the speed of applications, so take use of it.
- Gain an understanding of the operation of the Catalyst and Tungsten optimizers.
- Gain an understanding of how Workload XM may assist in troubleshooting and proactively monitoring the performance of Spark applications.
- Discover the new features that are available in Spark 3.0 and, in particular, how the Adaptive Query Execution engine makes speed improvements.
- Due to the limited availability of this program, it may take up to three weeks to organize the necessary logistics.
You are going to learn:
Module 1: The Spark Architecture is Covered.
- RDDs
- Datasets as well as Dataframes
- Lazy Evaluation
- Pipelining
Module 2: It will cover various formats and sources of data.
- An Overview of the Available Formats
- Influence on the Level of Performance
- The Issue with Having Small Files
Module 3: Inferring Schemas
- The Price Paid for Inferences
- Approaches to Mitigate Risk
Module 4: Dealing With Biased Data Is Covered.
- Recognizing Skew
- Approaches to Mitigate Risk
Module 5: Overview of the Catalyst and Tungsten Components.
- An Overview of the Catalyst
- Tungsten Overview
Module 6: Spark Shuffles Mitigation is the topic.
- Denormalization
- Broadcast Comes Into Play
- Operations Conducted on the Map
- Sort Merge Joins
Module 7: Partitioned and Bucketed Tables.
- Tables With Different Partitions
- Tables with Buckets
- Influence on the Level of Performance
Module 9: Improving Join Performance is the Topic.
- Skewed Joins
- Joins That Are Bucketed
- Incremental Joins
Module 9: Pyspark Overhead and UDFs are the topics covered in this Module.
- Pyspark Above You
- UDFs for scalars
- UDF vectors generated with Apache Arrow
- Scala UDFs
Module 10: Storing Data in a Cache for Future Use
- Options for Caching
- Influence on the Level of Performance
- Caching Pitfalls
Module 11: Workload XM (WXM) Introduction.
- WXM Overview
- WXM for Spark Developers
Module 12: What's New in Spark 3.0 is Covered.
- Number of Shuffle Partitions That Can Be Adapted To
- Skew Joins
- Switch from using Sort Merge Joins to Broadcast Joins.
- Pruning of the Dynamic Partitions
- Partitions that Dynamically Coalesce and Shuffle