top of page
Cloudera Data Analyst Training
Four Day Course
Data Analytst




Apache Hadoop Fundamentals

  • The Motivation for Hadoop

  • Hadoop Overview

  • Data Storage: HDFS

  • Distributed Data Processing: YARN, MapReduce, and Spark

  • Data Processing and Analysis: Pig, Hive, and Impala

  • Database Integration: Sqoop

  • Other Hadoop Data Tools

  • Exercise Scenarios

Introduction to Apache Pig

  • What is Pig?

  • Pig’s Features

  • Pig Use Cases

  • Interacting with Pig

Basic Data Analysis with Apache Pig

  • Pig Latin Syntax

  • Loading Data

  • Simple Data Types

  • Field Definitions

  • Data Output

  • Viewing the Schema

  • Filtering and Sorting Data

  • Commonly Used Functions

Processing Complex Data with Apache Pig

  • Storage Formats

  • Complex/Nested Data Types

  • Grouping

  • Built-In Functions for Complex Data

  • Iterating Grouped Data.

Multi-Dataset Operations with Apache Pig

  • Techniques for Combining Datasets

  • Joining Datasets in Pig

  • Set Operations

  • Splitting Datasets

Apache Pig Troubleshooting and Optimization

  • Troubleshooting Pig

  • Logging

  • Using Hadoop’s Web UI

  • Data Sampling and Debugging

  • Performance Overview

  • Understanding the Execution Plan

  • Tips for Improving the Performance of Pig Jobs

Introduction to Apache Hive and Impala

  • What is Hive?

  • What is Impala?

  • Why Use Hive and Impala?

  • Schema and Data Storage

  • Comparing Hive and Impala to Traditional Databases

  • Use Cases

Querying with Apache Hive and Impala

  • Databases and Tables

  • Basic Hive and Impala Query Language Syntax

  • Data Types

  • Using Hue to Execute Queries

  • Using Beeline (Hive’s Shell)

  • Using the Impala Shell

Apache Hive and Impala Data Management

  • Data Storage

  • Creating Databases and Tables

  • Loading Data

  • Altering Databases and Tables

  • Simplifying Queries with Views

  • Storing Query Results

Data Storage and Performance

  • Partitioning Tables

  • Loading Data into Partitioned Tables

  • When to Use Partitioning

  • Choosing a File Format

  • Using Avro and Parquet File Formats

Relational Data Analysis with Apache Hive and Impala

  • Joining Datasets

  • Common Built-In Functions

  • Aggregation and Windowing

Complex Data with Apache Hive and Impala

  • Complex Data with Hive

  • Complex Data with Impala

Analyzing Text with Apache Hive and Impala

  • Using Regular Expressions with Hive and Impala

  • Processing Text Data with SerDes in Hive

  • Sentiment Analysis and n-grams in Hive

Apache Hive Optimization

  • Understanding Query Performance

  • Bucketing

  • Indexing Data

  • Hive on Spark

Apache Impala Optimization

  • How Impala Executes Queries

  • Improving Impala Performance

Extending Apache Hive and Impala

  • Custom SerDes and File Formats in Hive

  • Data Transformation with

  • Custom Scripts in Hive

  • User-Defined Functions

  • Parameterized Queries

Choosing the Best Tool for the Job

  • Comparing Pig, Hive, Impala, and Relational Databases

  • Which to Choose?


Cloudera Data Science Workbench (CDSW)

Three Day Training

Overview of CDSW 

  • Introduction to CDSW

  • How to Access CDSW

  • Navigating around CDSW

  • User Settings

  • Hadoop Authentication

Projects in CDSW 

  • Creating a New Project

  • Navigating around a Project

  • Project Settings

The CDSW Workbench Interface 

  • Using the Workbench

  • Using the Sidebar

  • Using the Code Editor

  • Engines and Sessions

Running Python and R Code in CDSW 

  • Running Code

  • Using the Session Prompt

  • Using the Terminal

  • Installing Packages

  • Using Markdown in Comments

Using Apache Spark 2 in CDSW

  • Scenario and Dataset

  • Copying Files to HDFS

  • Interfaces to Apache Spark 2

  • Connecting to Spark

  • Reading Data

  • Inspecting Data


Exploratory Data Science in CDSW

  • Transforming Data

  • Using SQL Queries

  • Visualizing Data from Spark

  • Machine Learning with MLlib

  • Session History

Teams and Collaboration in CDSW

  • Collaboration in CDSW

  • Teams in CDSW

  • Using Git for Collaboration

  • Conclusion

Developer Training for Apache Spark and Hadoop
Apache Spark

Introduction to Apache Hadoop 
and the Hadoop Ecosystem

• Apache Hadoop Overview
• Data Ingestion and Storage
• Data Processing
• Data Analysis and Exploration
• Other Ecosystem Tools
• Introduction to the Hands-On Exercises
Apache Hadoop File Storage
• Apache Hadoop Cluster Components
• HDFS Architecture
• Using HDFS
Distributed Processing on  
an Apache Hadoop Cluster
• YARN Architecture
• Working With YARN
Apache Spark Basics
• What is Apache Spark?
• Starting the Spark Shell
• Using the Spark Shell
• Getting Started with Datasets 
and DataFrames
• DataFrame Operations
Working with DataFrames and Schemas
• Creating DataFrames from Data Sources
• Saving DataFrames to Data Sources
• DataFrame Schemas
• Eager and Lazy Execution
Analyzing Data with DataFrame Queries
• Querying DataFrames Using 
Column Expressions
• Grouping and Aggregation Queries
• Joining DataFrames

RDD Overview
• RDD Overview
• RDD Data Sources
• Creating and Saving RDDs
• RDD Operations
Transforming Data with RDDs
• Writing and Passing 
Transformation Functions
• Transformation Execution
• Converting Between RDDs 
and DataFrames
Aggregating Data with Pair RDDs
• Key-Value Pair RDDs
• Map-Reduce
• Other Pair RDD Operations
Querying Tables and Views  
with Apache Spark SQL
• Querying Tables in Spark Using SQL
• Querying Files and Views
• The Catalog API
• Comparing Spark SQL, Apache Impala,and Apache Hive-on-Spark
Working with Datasets in Scala
• Datasets and DataFrames
• Creating Datasets
• Loading and Saving Datasets
• Dataset Operations
Writing, Configuring, and Running  
Apache Spark Applications
• Writing a Spark Application
• Building and Running an Application
• Application Deployment Mode
• The Spark Application Web UI
• Configuring Application Properties

Distributed Processing
Review: Apache Spark on a Cluster
RDD Partitions
Example: Partitioning in Queries
Stages and Tasks
Job Execution Planning
Example: Catalyst Execution Plan
Example: RDD Execution Plan
Distributed Data Persistence
DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs
ommon Patterns in Apache Spark  
a Processing
Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means
Apache Spark Streaming:  
oduction to DStreams
Apache Spark Streaming Overview
Example: Streaming Request Count
Developing Streaming Applications
Apache Spark Streaming:  
cessing Multiple Batches
Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations
Preview: Structured Streaming
Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka 
Data Sources
Example: Using a Kafka Direct 
Data Source

spark logo.png
Developer Training for MapReduce
Four Day Course


The Motivation for Hadoop
• Problems with Traditional 
Large-Scale Systems
• Introducing Hadoop
• Hadoopable Problems
Hadoop: Basic Concepts and HDFS
• The Hadoop Project and 
Hadoop Components
• The Hadoop Distributed File System
Introduction to MapReduce
• MapReduce Overview
• Example: WordCount
• Mappers
• Reducers
Hadoop Clusters and 
the Hadoop Ecosystem
• Hadoop Cluster Overview
• Hadoop Jobs and Tasks
• Other Hadoop Ecosystem Components
Writing a MapReduce Program in Java
• Basic MapReduce API Concepts
• Writing MapReduce Drivers, Mappers,  
and Reducers in Java
• Speeding Up Hadoop Development  
by Using Eclipse
• Differences Between the Old  
and New MapReduce APIs
Writing a MapReduce Program 
Using Streaming
• Writing Mappers and Reducers 
with the Streaming API

Unit Testing MapReduce Programs
• Unit Testing
• The JUnit and MRUnit Testing Framework• Writing Unit Tests with MRUnit
• Running Unit Tests
Delving Deeper into the Hadoop API
•  Using the ToolRunner Class
• Setting Up and Tearing Down Mappers 
and Reducers
• Decreasing the Amount of Intermediate
Data with Combiners
• Accessing HDFS Programmatically
• Using The Distributed Cache
• Using the Hadoop API’s Library of 
Mappers, Reducers, and Partitioners
Practical Development Tips  
and Techniques
• Strategies for Debugging MapReduce C
• Testing MapReduce Code Locally 
by Using LocalJobRunner
• Writing and Viewing Log Files
• Retrieving Job Information with Counter• Reusing Objects
• Creating Map-Only MapReduce Jobs
Partitioners and Reducers
• How Partitioners and Reducers  
Work Together
• Determining the Optimal Number  
of Reducers for a Job
• Writing Customer Partitioners
Data Input and Output
• Creating Custom Writable and
WritableComparable Implementations
• Saving Binary Data Using SequenceFile 
and Avro Data Files
• Issues to Consider When Using 
File Compression
• Implementing Custom InputFormats 
and OutputFormats


Common MapReduce Algorithms
• Sorting and Searching Large Data Sets
• Indexing Data
• Computing Term Frequency — Inverse 
Document Frequency
• Calculating Word Co-Occurrence
• Performing Secondary Sort
Joining Data Sets in MapReduce Jobs
• Writing a Map-Side Join
• Writing a Reduce-Side Join
Integrating Hadoop into 
the Enterprise Workflow
• Integrating Hadoop into 
an Existing Enterprise
• Loading Data from an RDBMS 
into HDFS by Using Sqoop
• Managing Real-Time Data Using Flume
• Accessing HDFS from Legacy Systems  
with FuseDFS and HttpFS
An Introduction to Hive, Imapala, and Pig
• The Motivation for Hive, Impala, and Pig
• Hive Overview
• Impala Overview
• Pig Overview
• Choosing Between Hive, Impala, and Pig
An Introduction to Oozie
• Introduction to Oozie
• Creating Oozie Workflows

Cloudera Administrator Training
Four Day Course


The Case for Apache Hadoop 

  • Why Hadoop?

  • Fundamental Concepts

  • Core Hadoop Components

Hadoop Cluster Installation

  • Rationale for a Cluster Management Solution

  • Cloudera Manager Features

  • Cloudera Manager Installation

  • Hadoop (CDH) Installation

The Hadoop Distributed File System (HDFS)

  • HDFS Features

  • Writing and Reading Files

  • NameNode Memory Considerations

  • Overview of HDFS Security

  • Web UIs for HDFS

  • Using the Hadoop File Shell

MapReduce and Spark on YARN

  • The Role of Computational Frameworks

  • YARN: The Cluster Resource Manager

  • MapReduce Concepts

  • Apache Spark Concepts

  • Running Computational Frameworks on YARN

  • Exploring YARN Applications Through the

  • Web UIs, and the Shell

  • YARN Application Logs

Hadoop Configuration and Daemon Logs

  • Cloudera Manager Constructs for Managing Configurations

  • Locating Configurations and Applying Configuration Changes

  • Managing Role Instances and Adding Services

  • Configuring the HDFS Service

  • Configuring Hadoop Daemon Logs

  • Configuring the YARN Service

Getting Data Into HDFS

  • Ingesting Data From External Sources With Flume

  • Ingesting Data From Relational Databases With Sqoop

  • REST Interfaces

  • Best Practices for Importing Data

Planning Your Hadoop Cluster

  • General Planning Considerations

  • Choosing the Right Hardware

  • Virtualization Options*

  • Network Considerations

  • Configuring Nodes

Installing and Configuring Hive, Impala, and Pig

  • Hive

  • Impala

  • Pig

Hadoop Clients Including Hue

  • What Are Hadoop Clients?

  • Installing and Configuring Hadoop Clients

  • Installing and Configuring Hue

  • Hue Authentication and Authorization

Advanced Cluster Configuration

  • Advanced Configuration Parameters

  • Configuring Hadoop Ports

  • Configuring HDFS for Rack Awareness

  • Configuring HDFS High Availability

Hadoop Security

  • Why Hadoop Security Is Important

  • Hadoop’s Security System Concepts

  • What Kerberos Is and how it Works

  • Securing a Hadoop Cluster With Kerberos

  • Other Security Concepts

Managing Resources

  • Configuring cgroups with Static Service Pools

  • The Fair Scheduler

  • Configuring Dynamic Resource Pools

  • YARN Memory and CPU Settings

  • Impala Query Scheduling

Cluster Maintenance

  • Checking HDFS Status

  • Copying Data Between Clusters

  • Adding and Removing Cluster Nodes

  • Rebalancing the Cluster

  • Directory Snapshots

  • Cluster Upgrading

Cluster Monitoring and Troubleshooting

  • Cloudera Manager Monitoring Features

  • Monitoring Hadoop Clusters

  • Troubleshooting Hadoop Clusters

  • Common Misconfigurations


bottom of page