Cloudera

Cloudera Data Analyst Training

Four Day Course

Data Analytst

Introduction

Apache Hadoop Fundamentals

The Motivation for Hadoop
Hadoop Overview
Data Storage: HDFS
Distributed Data Processing: YARN, MapReduce, and Spark
Data Processing and Analysis: Pig, Hive, and Impala
Database Integration: Sqoop
Other Hadoop Data Tools
Exercise Scenarios

Introduction to Apache Pig

What is Pig?
Pig’s Features
Pig Use Cases
Interacting with Pig

Basic Data Analysis with Apache Pig

Pig Latin Syntax
Loading Data
Simple Data Types
Field Definitions
Data Output
Viewing the Schema
Filtering and Sorting Data
Commonly Used Functions

Processing Complex Data with Apache Pig

Storage Formats
Complex/Nested Data Types
Grouping
Built-In Functions for Complex Data
Iterating Grouped Data.

Multi-Dataset Operations with Apache Pig

Techniques for Combining Datasets
Joining Datasets in Pig
Set Operations
Splitting Datasets

Apache Pig Troubleshooting and Optimization

Troubleshooting Pig
Logging
Using Hadoop’s Web UI
Data Sampling and Debugging
Performance Overview
Understanding the Execution Plan
Tips for Improving the Performance of Pig Jobs

Introduction to Apache Hive and Impala

What is Hive?
What is Impala?
Why Use Hive and Impala?
Schema and Data Storage
Comparing Hive and Impala to Traditional Databases
Use Cases

Querying with Apache Hive and Impala

Databases and Tables
Basic Hive and Impala Query Language Syntax
Data Types
Using Hue to Execute Queries
Using Beeline (Hive’s Shell)
Using the Impala Shell

Apache Hive and Impala Data Management

Data Storage
Creating Databases and Tables
Loading Data
Altering Databases and Tables
Simplifying Queries with Views
Storing Query Results

Data Storage and Performance

Partitioning Tables
Loading Data into Partitioned Tables
When to Use Partitioning
Choosing a File Format
Using Avro and Parquet File Formats

Relational Data Analysis with Apache Hive and Impala

Joining Datasets
Common Built-In Functions
Aggregation and Windowing

Complex Data with Apache Hive and Impala

Complex Data with Hive
Complex Data with Impala

Analyzing Text with Apache Hive and Impala

Using Regular Expressions with Hive and Impala
Processing Text Data with SerDes in Hive
Sentiment Analysis and n-grams in Hive

Apache Hive Optimization

Understanding Query Performance
Bucketing
Indexing Data
Hive on Spark

Apache Impala Optimization

How Impala Executes Queries
Improving Impala Performance

Extending Apache Hive and Impala

Custom SerDes and File Formats in Hive
Data Transformation with
Custom Scripts in Hive
User-Defined Functions
Parameterized Queries

Choosing the Best Tool for the Job

Comparing Pig, Hive, Impala, and Relational Databases
Which to Choose?

Conclusion

Cloudera Data Science Workbench (CDSW)

Three Day Training

Overview of CDSW

Introduction to CDSW
How to Access CDSW
Navigating around CDSW
User Settings
Hadoop Authentication

Projects in CDSW

Creating a New Project
Navigating around a Project
Project Settings

The CDSW Workbench Interface

Using the Workbench
Using the Sidebar
Using the Code Editor
Engines and Sessions

Running Python and R Code in CDSW

Running Code
Using the Session Prompt
Using the Terminal
Installing Packages
Using Markdown in Comments

Using Apache Spark 2 in CDSW

Scenario and Dataset
Copying Files to HDFS
Interfaces to Apache Spark 2
Connecting to Spark
Reading Data
Inspecting Data

CDSW

Exploratory Data Science in CDSW

Transforming Data
Using SQL Queries
Visualizing Data from Spark
Machine Learning with MLlib
Session History

Teams and Collaboration in CDSW

Collaboration in CDSW
Teams in CDSW
Using Git for Collaboration
Conclusion

Developer Training for Apache Spark and Hadoop

Apache Spark

Introduction to Apache Hadoop
and the Hadoop Ecosystem

• Apache Hadoop Overview
• Data Ingestion and Storage
• Data Processing
• Data Analysis and Exploration
• Other Ecosystem Tools
• Introduction to the Hands-On Exercises
Apache Hadoop File Storage
• Apache Hadoop Cluster Components
• HDFS Architecture
• Using HDFS
Distributed Processing on
an Apache Hadoop Cluster
• YARN Architecture
• Working With YARN
Apache Spark Basics
• What is Apache Spark?
• Starting the Spark Shell
• Using the Spark Shell
• Getting Started with Datasets
and DataFrames
• DataFrame Operations
Working with DataFrames and Schemas
• Creating DataFrames from Data Sources
• Saving DataFrames to Data Sources
• DataFrame Schemas
• Eager and Lazy Execution
Analyzing Data with DataFrame Queries
• Querying DataFrames Using
Column Expressions
• Grouping and Aggregation Queries
• Joining DataFrames

RDD Overview
• RDD Overview
• RDD Data Sources
• Creating and Saving RDDs
• RDD Operations
Transforming Data with RDDs
• Writing and Passing
Transformation Functions
• Transformation Execution
• Converting Between RDDs
and DataFrames
Aggregating Data with Pair RDDs
• Key-Value Pair RDDs
• Map-Reduce
• Other Pair RDD Operations
Querying Tables and Views
with Apache Spark SQL
• Querying Tables in Spark Using SQL
• Querying Files and Views
• The Catalog API
• Comparing Spark SQL, Apache Impala,and Apache Hive-on-Spark
Working with Datasets in Scala
• Datasets and DataFrames
• Creating Datasets
• Loading and Saving Datasets
• Dataset Operations
Writing, Configuring, and Running
Apache Spark Applications
• Writing a Spark Application
• Building and Running an Application
• Application Deployment Mode
• The Spark Application Web UI
• Configuring Application Properties

Distributed Processing
Review: Apache Spark on a Cluster
RDD Partitions
Example: Partitioning in Queries
Stages and Tasks
Job Execution Planning
Example: Catalyst Execution Plan
Example: RDD Execution Plan
Distributed Data Persistence
DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs
ommon Patterns in Apache Spark
a Processing
Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means
Apache Spark Streaming:
oduction to DStreams
Apache Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Streaming Applications
Apache Spark Streaming:
cessing Multiple Batches
Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations
Preview: Structured Streaming
Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka
Data Sources
Example: Using a Kafka Direct
Data Source

Developer Training for MapReduce

Four Day Course

MapReduce

Introduction

The Motivation for Hadoop
• Problems with Traditional
Large-Scale Systems
• Introducing Hadoop
• Hadoopable Problems
Hadoop: Basic Concepts and HDFS
• The Hadoop Project and
Hadoop Components
• The Hadoop Distributed File System
Introduction to MapReduce
• MapReduce Overview
• Example: WordCount
• Mappers
• Reducers
Hadoop Clusters and
the Hadoop Ecosystem
• Hadoop Cluster Overview
• Hadoop Jobs and Tasks
• Other Hadoop Ecosystem Components
Writing a MapReduce Program in Java
• Basic MapReduce API Concepts
• Writing MapReduce Drivers, Mappers,
and Reducers in Java
• Speeding Up Hadoop Development
by Using Eclipse
• Differences Between the Old
and New MapReduce APIs
Writing a MapReduce Program
Using Streaming
• Writing Mappers and Reducers
with the Streaming API

Unit Testing MapReduce Programs
• Unit Testing
• The JUnit and MRUnit Testing Framework• Writing Unit Tests with MRUnit
• Running Unit Tests
Delving Deeper into the Hadoop API
• Using the ToolRunner Class
• Setting Up and Tearing Down Mappers
and Reducers
• Decreasing the Amount of Intermediate
Data with Combiners
• Accessing HDFS Programmatically
• Using The Distributed Cache
• Using the Hadoop API’s Library of
Mappers, Reducers, and Partitioners
Practical Development Tips
and Techniques
• Strategies for Debugging MapReduce C
• Testing MapReduce Code Locally
by Using LocalJobRunner
• Writing and Viewing Log Files
• Retrieving Job Information with Counter• Reusing Objects
• Creating Map-Only MapReduce Jobs
Partitioners and Reducers
• How Partitioners and Reducers
Work Together
• Determining the Optimal Number
of Reducers for a Job
• Writing Customer Partitioners
Data Input and Output
• Creating Custom Writable and
WritableComparable Implementations
• Saving Binary Data Using SequenceFile
and Avro Data Files
• Issues to Consider When Using
File Compression
• Implementing Custom InputFormats
and OutputFormats

Common MapReduce Algorithms
• Sorting and Searching Large Data Sets
• Indexing Data
• Computing Term Frequency — Inverse
Document Frequency
• Calculating Word Co-Occurrence
• Performing Secondary Sort
Joining Data Sets in MapReduce Jobs
• Writing a Map-Side Join
• Writing a Reduce-Side Join
Integrating Hadoop into
the Enterprise Workflow
• Integrating Hadoop into
an Existing Enterprise
• Loading Data from an RDBMS
into HDFS by Using Sqoop
• Managing Real-Time Data Using Flume
• Accessing HDFS from Legacy Systems
with FuseDFS and HttpFS
An Introduction to Hive, Imapala, and Pig
• The Motivation for Hive, Impala, and Pig
• Hive Overview
• Impala Overview
• Pig Overview
• Choosing Between Hive, Impala, and Pig
An Introduction to Oozie
• Introduction to Oozie
• Creating Oozie Workflows
Conclusion

Cloudera Administrator Training

Four Day Course

Admin

Introduction

The Case for Apache Hadoop

Why Hadoop?
Fundamental Concepts
Core Hadoop Components

Hadoop Cluster Installation

Rationale for a Cluster Management Solution
Cloudera Manager Features
Cloudera Manager Installation
Hadoop (CDH) Installation

The Hadoop Distributed File System (HDFS)

HDFS Features
Writing and Reading Files
NameNode Memory Considerations
Overview of HDFS Security
Web UIs for HDFS
Using the Hadoop File Shell

MapReduce and Spark on YARN

The Role of Computational Frameworks
YARN: The Cluster Resource Manager
MapReduce Concepts
Apache Spark Concepts
Running Computational Frameworks on YARN
Exploring YARN Applications Through the
Web UIs, and the Shell
YARN Application Logs

Hadoop Configuration and Daemon Logs

Cloudera Manager Constructs for Managing Configurations
Locating Configurations and Applying Configuration Changes
Managing Role Instances and Adding Services
Configuring the HDFS Service
Configuring Hadoop Daemon Logs
Configuring the YARN Service

Getting Data Into HDFS

Ingesting Data From External Sources With Flume
Ingesting Data From Relational Databases With Sqoop
REST Interfaces
Best Practices for Importing Data

Planning Your Hadoop Cluster

General Planning Considerations
Choosing the Right Hardware
Virtualization Options*
Network Considerations
Configuring Nodes

Installing and Configuring Hive, Impala, and Pig

Hive
Impala
Pig

Hadoop Clients Including Hue

What Are Hadoop Clients?
Installing and Configuring Hadoop Clients
Installing and Configuring Hue
Hue Authentication and Authorization

Advanced Cluster Configuration

Advanced Configuration Parameters
Configuring Hadoop Ports
Configuring HDFS for Rack Awareness
Configuring HDFS High Availability

Hadoop Security

Why Hadoop Security Is Important
Hadoop’s Security System Concepts
What Kerberos Is and how it Works
Securing a Hadoop Cluster With Kerberos
Other Security Concepts

Managing Resources

Configuring cgroups with Static Service Pools
The Fair Scheduler
Configuring Dynamic Resource Pools
YARN Memory and CPU Settings
Impala Query Scheduling

Cluster Maintenance

Checking HDFS Status
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing the Cluster
Directory Snapshots
Cluster Upgrading

Cluster Monitoring and Troubleshooting

Cloudera Manager Monitoring Features
Monitoring Hadoop Clusters
Troubleshooting Hadoop Clusters
Common Misconfigurations

Conclusion

Back to training courses