
APACHE SPARK
Developer Training for Apache Spark and Hadoop


Introduction to Apache Hadoop
and the Hadoop Ecosystem
• Apache Hadoop Overview
• Data Ingestion and Storage
• Data Processing
• Data Analysis and Exploration
• Other Ecosystem Tools
• Introduction to the Hands-On Exercises
Apache Hadoop File Storage
• Apache Hadoop Cluster Components
• HDFS Architecture
• Using HDFS
Distributed Processing on
an Apache Hadoop Cluster
• YARN Architecture
• Working With YARN
Apache Spark Basics
• What is Apache Spark?
• Starting the Spark Shell
• Using the Spark Shell
• Getting Started with Datasets
and DataFrames
• DataFrame Operations
Working with DataFrames and Schemas
• Creating DataFrames from Data Sources
• Saving DataFrames to Data Sources
• DataFrame Schemas
• Eager and Lazy Execution
Analyzing Data with DataFrame Queries
• Querying DataFrames Using
Column Expressions
• Grouping and Aggregation Queries
• Joining DataFrames
RDD Overview
• RDD Overview
• RDD Data Sources
• Creating and Saving RDDs
• RDD Operations
Transforming Data with RDDs
• Writing and Passing
Transformation Functions
• Transformation Execution
• Converting Between RDDs
and DataFrames
Aggregating Data with Pair RDDs
• Key-Value Pair RDDs
• Map-Reduce
• Other Pair RDD Operations
Querying Tables and Views
with Apache Spark SQL
• Querying Tables in Spark Using SQL
• Querying Files and Views
• The Catalog API
• Comparing Spark SQL, Apache Impala,and Apache Hive-on-Spark
Working with Datasets in Scala
• Datasets and DataFrames
• Creating Datasets
• Loading and Saving Datasets
• Dataset Operations
Writing, Configuring, and Running
Apache Spark Applications
• Writing a Spark Application
• Building and Running an Application
• Application Deployment Mode
• The Spark Application Web UI
• Configuring Application Properties

Distributed Processing
Review: Apache Spark on a Cluster
RDD Partitions
Example: Partitioning in Queries
Stages and Tasks
Job Execution Planning
Example: Catalyst Execution Plan
Example: RDD Execution Plan
Distributed Data Persistence
DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs
ommon Patterns in Apache Spark
a Processing
Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means
Apache Spark Streaming:
oduction to DStreams
Apache Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Streaming Applications
Apache Spark Streaming:
cessing Multiple Batches
Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations
Preview: Structured Streaming
Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka
Data Sources
Example: Using a Kafka Direct
Data Source
Developer Training for MapReduce
Four Day Course

Introduction
The Motivation for Hadoop
• Problems with Traditional
Large-Scale Systems
• Introducing Hadoop
• Hadoopable Problems
Hadoop: Basic Concepts and HDFS
• The Hadoop Project and
Hadoop Components
• The Hadoop Distributed File System
Introduction to MapReduce
• MapReduce Overview
• Example: WordCount
• Mappers
• Reducers
Hadoop Clusters and
the Hadoop Ecosystem
• Hadoop Cluster Overview
• Hadoop Jobs and Tasks
• Other Hadoop Ecosystem Components
Writing a MapReduce Program in Java
• Basic MapReduce API Concepts
• Writing MapReduce Drivers, Mappers,
and Reducers in Java
• Speeding Up Hadoop Development
by Using Eclipse
• Differences Between the Old
and New MapReduce APIs
Writing a MapReduce Program
Using Streaming
• Writing Mappers and Reducers
with the Streaming API



Unit Testing MapReduce Programs
• Unit Testing
• The JUnit and MRUnit Testing Framework• Writing Unit Tests with MRUnit
• Running Unit Tests
Delving Deeper into the Hadoop API
• Using the ToolRunner Class
• Setting Up and Tearing Down Mappers
and Reducers
• Decreasing the Amount of Intermediate
Data with Combiners
• Accessing HDFS Programmatically
• Using The Distributed Cache
• Using the Hadoop API’s Library of
Mappers, Reducers, and Partitioners
Practical Development Tips
and Techniques
• Strategies for Debugging MapReduce C
• Testing MapReduce Code Locally
by Using LocalJobRunner
• Writing and Viewing Log Files
• Retrieving Job Information with Counter• Reusing Objects
• Creating Map-Only MapReduce Jobs
Partitioners and Reducers
• How Partitioners and Reducers
Work Together
• Determining the Optimal Number
of Reducers for a Job
• Writing Customer Partitioners
Data Input and Output
• Creating Custom Writable and
WritableComparable Implementations
• Saving Binary Data Using SequenceFile
and Avro Data Files
• Issues to Consider When Using
File Compression
• Implementing Custom InputFormats
and OutputFormats
Common MapReduce Algorithms
• Sorting and Searching Large Data Sets
• Indexing Data
• Computing Term Frequency — Inverse
Document Frequency
• Calculating Word Co-Occurrence
• Performing Secondary Sort
Joining Data Sets in MapReduce Jobs
• Writing a Map-Side Join
• Writing a Reduce-Side Join
Integrating Hadoop into
the Enterprise Workflow
• Integrating Hadoop into
an Existing Enterprise
• Loading Data from an RDBMS
into HDFS by Using Sqoop
• Managing Real-Time Data Using Flume
• Accessing HDFS from Legacy Systems
with FuseDFS and HttpFS
An Introduction to Hive, Imapala, and Pig
• The Motivation for Hive, Impala, and Pig
• Hive Overview
• Impala Overview
• Pig Overview
• Choosing Between Hive, Impala, and Pig
An Introduction to Oozie
• Introduction to Oozie
• Creating Oozie Workflows
Conclusion
Cloudera Administrator Training
Four Day Course
Introduction
The Case for Apache Hadoop
-
Why Hadoop?
-
Fundamental Concepts
-
Core Hadoop Components
Hadoop Cluster Installation
-
Rationale for a Cluster Management Solution
-
Cloudera Manager Features
-
Cloudera Manager Installation
-
Hadoop (CDH) Installation
The Hadoop Distributed File System (HDFS)
-
HDFS Features
-
Writing and Reading Files
-
NameNode Memory Considerations
-
Overview of HDFS Security
-
Web UIs for HDFS
-
Using the Hadoop File Shell
MapReduce and Spark on YARN
-
The Role of Computational Frameworks
-
YARN: The Cluster Resource Manager
-
MapReduce Concepts
-
Apache Spark Concepts
-
Running Computational Frameworks on YARN
-
Exploring YARN Applications Through the
-
Web UIs, and the Shell
-
YARN Application Logs
Hadoop Configuration and Daemon Logs
-
Cloudera Manager Constructs for Managing Configurations
-
Locating Configurations and Applying Configuration Changes
-
Managing Role Instances and Adding Services
-
Configuring the HDFS Service
-
Configuring Hadoop Daemon Logs
-
Configuring the YARN Service
Getting Data Into HDFS
-
Ingesting Data From External Sources With Flume
-
Ingesting Data From Relational Databases With Sqoop
-
REST Interfaces
-
Best Practices for Importing Data
Planning Your Hadoop Cluster
-
General Planning Considerations
-
Choosing the Right Hardware
-
Virtualization Options*
-
Network Considerations
-
Configuring Nodes
Installing and Configuring Hive, Impala, and Pig
-
Hive
-
Impala
-
Pig
Hadoop Clients Including Hue
-
What Are Hadoop Clients?
-
Installing and Configuring Hadoop Clients
-
Installing and Configuring Hue
-
Hue Authentication and Authorization
Advanced Cluster Configuration
-
Advanced Configuration Parameters
-
Configuring Hadoop Ports
-
Configuring HDFS for Rack Awareness
-
Configuring HDFS High Availability
Hadoop Security
-
Why Hadoop Security Is Important
-
Hadoop’s Security System Concepts
-
What Kerberos Is and how it Works
-
Securing a Hadoop Cluster With Kerberos
-
Other Security Concepts
Managing Resources
-
Configuring cgroups with Static Service Pools
-
The Fair Scheduler
-
Configuring Dynamic Resource Pools
-
YARN Memory and CPU Settings
-
Impala Query Scheduling
Cluster Maintenance
-
Checking HDFS Status
-
Copying Data Between Clusters
-
Adding and Removing Cluster Nodes
-
Rebalancing the Cluster
-
Directory Snapshots
-
Cluster Upgrading
Cluster Monitoring and Troubleshooting
-
Cloudera Manager Monitoring Features
-
Monitoring Hadoop Clusters
-
Troubleshooting Hadoop Clusters
-
Common Misconfigurations
Conclusion



