top of page

APACHE SPARK

Developer Training for Apache Spark and Hadoop
Apache Spark
xcloudera-.png
spark logo.png

Introduction to Apache Hadoop 
and the Hadoop Ecosystem


• Apache Hadoop Overview
• Data Ingestion and Storage
• Data Processing
• Data Analysis and Exploration
• Other Ecosystem Tools
• Introduction to the Hands-On Exercises
Apache Hadoop File Storage
• Apache Hadoop Cluster Components
• HDFS Architecture
• Using HDFS
Distributed Processing on  
an Apache Hadoop Cluster
• YARN Architecture
• Working With YARN
Apache Spark Basics
• What is Apache Spark?
• Starting the Spark Shell
• Using the Spark Shell
• Getting Started with Datasets 
and DataFrames
• DataFrame Operations
Working with DataFrames and Schemas
• Creating DataFrames from Data Sources
• Saving DataFrames to Data Sources
• DataFrame Schemas
• Eager and Lazy Execution
Analyzing Data with DataFrame Queries
• Querying DataFrames Using 
Column Expressions
• Grouping and Aggregation Queries
• Joining DataFrames

RDD Overview
• RDD Overview
• RDD Data Sources
• Creating and Saving RDDs
• RDD Operations
Transforming Data with RDDs
• Writing and Passing 
Transformation Functions
• Transformation Execution
• Converting Between RDDs 
and DataFrames
Aggregating Data with Pair RDDs
• Key-Value Pair RDDs
• Map-Reduce
• Other Pair RDD Operations
Querying Tables and Views  
with Apache Spark SQL
• Querying Tables in Spark Using SQL
• Querying Files and Views
• The Catalog API
• Comparing Spark SQL, Apache Impala,and Apache Hive-on-Spark
Working with Datasets in Scala
• Datasets and DataFrames
• Creating Datasets
• Loading and Saving Datasets
• Dataset Operations
Writing, Configuring, and Running  
Apache Spark Applications
• Writing a Spark Application
• Building and Running an Application
• Application Deployment Mode
• The Spark Application Web UI
• Configuring Application Properties

hardoop.png

Distributed Processing
Review: Apache Spark on a Cluster
RDD Partitions
Example: Partitioning in Queries
Stages and Tasks
Job Execution Planning
Example: Catalyst Execution Plan
Example: RDD Execution Plan
Distributed Data Persistence
DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs
ommon Patterns in Apache Spark  
a Processing
Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means
Apache Spark Streaming:  
oduction to DStreams
Apache Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Streaming Applications
Apache Spark Streaming:  
cessing Multiple Batches
Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations
Preview: Structured Streaming
Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka 
Data Sources
Example: Using a Kafka Direct 
Data Source

Developer Training for MapReduce
Four Day Course
MapReduce
xcloudera-.png

Introduction


The Motivation for Hadoop
• Problems with Traditional 
Large-Scale Systems
• Introducing Hadoop
• Hadoopable Problems
Hadoop: Basic Concepts and HDFS
• The Hadoop Project and 
Hadoop Components
• The Hadoop Distributed File System
Introduction to MapReduce
• MapReduce Overview
• Example: WordCount
• Mappers
• Reducers
Hadoop Clusters and 
the Hadoop Ecosystem
• Hadoop Cluster Overview
• Hadoop Jobs and Tasks
• Other Hadoop Ecosystem Components
Writing a MapReduce Program in Java
• Basic MapReduce API Concepts
• Writing MapReduce Drivers, Mappers,  
and Reducers in Java
• Speeding Up Hadoop Development  
by Using Eclipse
• Differences Between the Old  
and New MapReduce APIs
Writing a MapReduce Program 
Using Streaming
• Writing Mappers and Reducers 
with the Streaming API

pig.jpg
hive.png
impala.png

Unit Testing MapReduce Programs
• Unit Testing
• The JUnit and MRUnit Testing Framework• Writing Unit Tests with MRUnit
• Running Unit Tests
Delving Deeper into the Hadoop API
•  Using the ToolRunner Class
• Setting Up and Tearing Down Mappers 
and Reducers
• Decreasing the Amount of Intermediate
Data with Combiners
• Accessing HDFS Programmatically
• Using The Distributed Cache
• Using the Hadoop API’s Library of 
Mappers, Reducers, and Partitioners
Practical Development Tips  
and Techniques
• Strategies for Debugging MapReduce C
• Testing MapReduce Code Locally 
by Using LocalJobRunner
• Writing and Viewing Log Files
• Retrieving Job Information with Counter• Reusing Objects
• Creating Map-Only MapReduce Jobs
Partitioners and Reducers
• How Partitioners and Reducers  
Work Together
• Determining the Optimal Number  
of Reducers for a Job
• Writing Customer Partitioners
Data Input and Output
• Creating Custom Writable and
WritableComparable Implementations
• Saving Binary Data Using SequenceFile 
and Avro Data Files
• Issues to Consider When Using 
File Compression
• Implementing Custom InputFormats 
and OutputFormats

Common MapReduce Algorithms
• Sorting and Searching Large Data Sets
• Indexing Data
• Computing Term Frequency — Inverse 
Document Frequency
• Calculating Word Co-Occurrence
• Performing Secondary Sort
Joining Data Sets in MapReduce Jobs
• Writing a Map-Side Join
• Writing a Reduce-Side Join
Integrating Hadoop into 
the Enterprise Workflow
• Integrating Hadoop into 
an Existing Enterprise
• Loading Data from an RDBMS 
into HDFS by Using Sqoop
• Managing Real-Time Data Using Flume
• Accessing HDFS from Legacy Systems  
with FuseDFS and HttpFS
An Introduction to Hive, Imapala, and Pig
• The Motivation for Hive, Impala, and Pig
• Hive Overview
• Impala Overview
• Pig Overview
• Choosing Between Hive, Impala, and Pig
An Introduction to Oozie
• Introduction to Oozie
• Creating Oozie Workflows
Conclusion

Cloudera Administrator Training
Four Day Course
Administrator

Introduction

The Case for Apache Hadoop 

  • Why Hadoop?

  • Fundamental Concepts

  • Core Hadoop Components

Hadoop Cluster Installation

  • Rationale for a Cluster Management Solution

  • Cloudera Manager Features

  • Cloudera Manager Installation

  • Hadoop (CDH) Installation

The Hadoop Distributed File System (HDFS)

  • HDFS Features

  • Writing and Reading Files

  • NameNode Memory Considerations

  • Overview of HDFS Security

  • Web UIs for HDFS

  • Using the Hadoop File Shell

MapReduce and Spark on YARN

  • The Role of Computational Frameworks

  • YARN: The Cluster Resource Manager

  • MapReduce Concepts

  • Apache Spark Concepts

  • Running Computational Frameworks on YARN

  • Exploring YARN Applications Through the

  • Web UIs, and the Shell

  • YARN Application Logs

Hadoop Configuration and Daemon Logs

  • Cloudera Manager Constructs for Managing Configurations

  • Locating Configurations and Applying Configuration Changes

  • Managing Role Instances and Adding Services

  • Configuring the HDFS Service

  • Configuring Hadoop Daemon Logs

  • Configuring the YARN Service

Getting Data Into HDFS

  • Ingesting Data From External Sources With Flume

  • Ingesting Data From Relational Databases With Sqoop

  • REST Interfaces

  • Best Practices for Importing Data

Planning Your Hadoop Cluster

  • General Planning Considerations

  • Choosing the Right Hardware

  • Virtualization Options*

  • Network Considerations

  • Configuring Nodes

Installing and Configuring Hive, Impala, and Pig

  • Hive

  • Impala

  • Pig

Hadoop Clients Including Hue

  • What Are Hadoop Clients?

  • Installing and Configuring Hadoop Clients

  • Installing and Configuring Hue

  • Hue Authentication and Authorization

Advanced Cluster Configuration

  • Advanced Configuration Parameters

  • Configuring Hadoop Ports

  • Configuring HDFS for Rack Awareness

  • Configuring HDFS High Availability

Hadoop Security

  • Why Hadoop Security Is Important

  • Hadoop’s Security System Concepts

  • What Kerberos Is and how it Works

  • Securing a Hadoop Cluster With Kerberos

  • Other Security Concepts

Managing Resources

  • Configuring cgroups with Static Service Pools

  • The Fair Scheduler

  • Configuring Dynamic Resource Pools

  • YARN Memory and CPU Settings

  • Impala Query Scheduling

Cluster Maintenance

  • Checking HDFS Status

  • Copying Data Between Clusters

  • Adding and Removing Cluster Nodes

  • Rebalancing the Cluster

  • Directory Snapshots

  • Cluster Upgrading

Cluster Monitoring and Troubleshooting

  • Cloudera Manager Monitoring Features

  • Monitoring Hadoop Clusters

  • Troubleshooting Hadoop Clusters

  • Common Misconfigurations

Conclusion

xcloudera-.png
hive.png
pig.jpg
impala.png
bottom of page