Cloudera Data Analyst Training
Four Day Course
CLOUDERA
Introduction
​
Apache Hadoop Fundamentals
-
The Motivation for Hadoop
-
Hadoop Overview
-
Data Storage: HDFS
-
Distributed Data Processing: YARN, MapReduce, and Spark
-
Data Processing and Analysis: Pig, Hive, and Impala
-
Database Integration: Sqoop
-
Other Hadoop Data Tools
-
Exercise Scenarios
​
Introduction to Apache Pig
-
What is Pig?
-
Pig’s Features
-
Pig Use Cases
-
Interacting with Pig
​
Basic Data Analysis with Apache Pig
-
Pig Latin Syntax
-
Loading Data
-
Simple Data Types
-
Field Definitions
-
Data Output
-
Viewing the Schema
-
Filtering and Sorting Data
-
Commonly Used Functions
​
Processing Complex Data with Apache Pig
-
Storage Formats
-
Complex/Nested Data Types
-
Grouping
-
Built-In Functions for Complex Data
-
Iterating Grouped Data.
​
Multi-Dataset Operations with Apache Pig
-
Techniques for Combining Datasets
-
Joining Datasets in Pig
-
Set Operations
-
Splitting Datasets
Apache Pig Troubleshooting and Optimization
-
Troubleshooting Pig
-
Logging
-
Using Hadoop’s Web UI
-
Data Sampling and Debugging
-
Performance Overview
-
Understanding the Execution Plan
-
Tips for Improving the Performance of Pig Jobs
​
Introduction to Apache Hive and Impala
-
What is Hive?
-
What is Impala?
-
Why Use Hive and Impala?
-
Schema and Data Storage
-
Comparing Hive and Impala to Traditional Databases
-
Use Cases
​
Querying with Apache Hive and Impala
-
Databases and Tables
-
Basic Hive and Impala Query Language Syntax
-
Data Types
-
Using Hue to Execute Queries
-
Using Beeline (Hive’s Shell)
-
Using the Impala Shell
​
Apache Hive and Impala Data Management
-
Data Storage
-
Creating Databases and Tables
-
Loading Data
-
Altering Databases and Tables
-
Simplifying Queries with Views
-
Storing Query Results
​
Data Storage and Performance
-
Partitioning Tables
-
Loading Data into Partitioned Tables
-
When to Use Partitioning
-
Choosing a File Format
-
Using Avro and Parquet File Formats
Relational Data Analysis with Apache Hive and Impala
-
Joining Datasets
-
Common Built-In Functions
-
Aggregation and Windowing
​
Complex Data with Apache Hive and Impala
-
Complex Data with Hive
-
Complex Data with Impala
​
Analyzing Text with Apache Hive and Impala
-
Using Regular Expressions with Hive and Impala
-
Processing Text Data with SerDes in Hive
-
Sentiment Analysis and n-grams in Hive
​
Apache Hive Optimization
-
Understanding Query Performance
-
Bucketing
-
Indexing Data
-
Hive on Spark
​
Apache Impala Optimization
-
How Impala Executes Queries
-
Improving Impala Performance
​
Extending Apache Hive and Impala
-
Custom SerDes and File Formats in Hive
-
Data Transformation with
-
Custom Scripts in Hive
-
User-Defined Functions
-
Parameterized Queries
​
Choosing the Best Tool for the Job
-
Comparing Pig, Hive, Impala, and Relational Databases
-
Which to Choose?
​
Conclusion
Cloudera Data Science Workbench (CDSW)
Three Day Training
Overview of CDSW
​
-
Introduction to CDSW
-
How to Access CDSW
-
Navigating around CDSW
-
User Settings
-
Hadoop Authentication
​
Projects in CDSW
-
Creating a New Project
-
Navigating around a Project
-
Project Settings
​
The CDSW Workbench Interface ​
-
Using the Workbench
-
Using the Sidebar
-
Using the Code Editor
-
Engines and Sessions
Running Python and R Code in CDSW
-
Running Code
-
Using the Session Prompt
-
Using the Terminal
-
Installing Packages
-
Using Markdown in Comments
​
Using Apache Spark 2 in CDSW
-
Scenario and Dataset
-
Copying Files to HDFS
-
Interfaces to Apache Spark 2
-
Connecting to Spark
-
Reading Data
-
Inspecting Data
Exploratory Data Science in CDSW
-
Transforming Data
-
Using SQL Queries
-
Visualizing Data from Spark
-
Machine Learning with MLlib
-
Session History
​
Teams and Collaboration in CDSW
-
Collaboration in CDSW
-
Teams in CDSW
-
Using Git for Collaboration
-
Conclusion
Developer Training for Apache Spark and Hadoop
Introduction to Apache Hadoop
and the Hadoop Ecosystem
• Apache Hadoop Overview
• Data Ingestion and Storage
• Data Processing
• Data Analysis and Exploration
• Other Ecosystem Tools
• Introduction to the Hands-On Exercises
Apache Hadoop File Storage
• Apache Hadoop Cluster Components
• HDFS Architecture
• Using HDFS
Distributed Processing on
an Apache Hadoop Cluster
• YARN Architecture
• Working With YARN
Apache Spark Basics
• What is Apache Spark?
• Starting the Spark Shell
• Using the Spark Shell
• Getting Started with Datasets
and DataFrames
• DataFrame Operations
Working with DataFrames and Schemas
• Creating DataFrames from Data Sources
• Saving DataFrames to Data Sources
• DataFrame Schemas
• Eager and Lazy Execution
Analyzing Data with DataFrame Queries
• Querying DataFrames Using
Column Expressions
• Grouping and Aggregation Queries
• Joining DataFrames
RDD Overview
• RDD Overview
• RDD Data Sources
• Creating and Saving RDDs
• RDD Operations
Transforming Data with RDDs
• Writing and Passing
Transformation Functions
• Transformation Execution
• Converting Between RDDs
and DataFrames
Aggregating Data with Pair RDDs
• Key-Value Pair RDDs
• Map-Reduce
• Other Pair RDD Operations
Querying Tables and Views
with Apache Spark SQL
• Querying Tables in Spark Using SQL
• Querying Files and Views
• The Catalog API
• Comparing Spark SQL, Apache Impala,and Apache Hive-on-Spark
Working with Datasets in Scala
• Datasets and DataFrames
• Creating Datasets
• Loading and Saving Datasets
• Dataset Operations
Writing, Configuring, and Running
Apache Spark Applications
• Writing a Spark Application
• Building and Running an Application
• Application Deployment Mode
• The Spark Application Web UI
• Configuring Application Properties
Distributed Processing
Review: Apache Spark on a Cluster
RDD Partitions
Example: Partitioning in Queries
Stages and Tasks
Job Execution Planning
Example: Catalyst Execution Plan
Example: RDD Execution Plan
Distributed Data Persistence
DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs
ommon Patterns in Apache Spark
a Processing
Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means
Apache Spark Streaming:
oduction to DStreams
Apache Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Streaming Applications
Apache Spark Streaming:
cessing Multiple Batches
Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations
Preview: Structured Streaming
Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka
Data Sources
Example: Using a Kafka Direct
Data Source
Developer Training for MapReduce
Four Day Course
Introduction
The Motivation for Hadoop
• Problems with Traditional
Large-Scale Systems
• Introducing Hadoop
• Hadoopable Problems
Hadoop: Basic Concepts and HDFS
• The Hadoop Project and
Hadoop Components
• The Hadoop Distributed File System
Introduction to MapReduce
• MapReduce Overview
• Example: WordCount
• Mappers
• Reducers
Hadoop Clusters and
the Hadoop Ecosystem
• Hadoop Cluster Overview
• Hadoop Jobs and Tasks
• Other Hadoop Ecosystem Components
Writing a MapReduce Program in Java
• Basic MapReduce API Concepts
• Writing MapReduce Drivers, Mappers,
and Reducers in Java
• Speeding Up Hadoop Development
by Using Eclipse
• Differences Between the Old
and New MapReduce APIs
Writing a MapReduce Program
Using Streaming
• Writing Mappers and Reducers
with the Streaming API
Unit Testing MapReduce Programs
• Unit Testing
• The JUnit and MRUnit Testing Framework• Writing Unit Tests with MRUnit
• Running Unit Tests
Delving Deeper into the Hadoop API
• Using the ToolRunner Class
• Setting Up and Tearing Down Mappers
and Reducers
• Decreasing the Amount of Intermediate
Data with Combiners
• Accessing HDFS Programmatically
• Using The Distributed Cache
• Using the Hadoop API’s Library of
Mappers, Reducers, and Partitioners
Practical Development Tips
and Techniques
• Strategies for Debugging MapReduce C
• Testing MapReduce Code Locally
by Using LocalJobRunner
• Writing and Viewing Log Files
• Retrieving Job Information with Counter• Reusing Objects
• Creating Map-Only MapReduce Jobs
Partitioners and Reducers
• How Partitioners and Reducers
Work Together
• Determining the Optimal Number
of Reducers for a Job
• Writing Customer Partitioners
Data Input and Output
• Creating Custom Writable and
WritableComparable Implementations
• Saving Binary Data Using SequenceFile
and Avro Data Files
• Issues to Consider When Using
File Compression
• Implementing Custom InputFormats
and OutputFormats
Common MapReduce Algorithms
• Sorting and Searching Large Data Sets
• Indexing Data
• Computing Term Frequency — Inverse
Document Frequency
• Calculating Word Co-Occurrence
• Performing Secondary Sort
Joining Data Sets in MapReduce Jobs
• Writing a Map-Side Join
• Writing a Reduce-Side Join
Integrating Hadoop into
the Enterprise Workflow
• Integrating Hadoop into
an Existing Enterprise
• Loading Data from an RDBMS
into HDFS by Using Sqoop
• Managing Real-Time Data Using Flume
• Accessing HDFS from Legacy Systems
with FuseDFS and HttpFS
An Introduction to Hive, Imapala, and Pig
• The Motivation for Hive, Impala, and Pig
• Hive Overview
• Impala Overview
• Pig Overview
• Choosing Between Hive, Impala, and Pig
An Introduction to Oozie
• Introduction to Oozie
• Creating Oozie Workflows
Conclusion
Cloudera Administrator Training
Four Day Course
Introduction
The Case for Apache Hadoop
-
Why Hadoop?
-
Fundamental Concepts
-
Core Hadoop Components
Hadoop Cluster Installation
-
Rationale for a Cluster Management Solution
-
Cloudera Manager Features
-
Cloudera Manager Installation
-
Hadoop (CDH) Installation
The Hadoop Distributed File System (HDFS)
-
HDFS Features
-
Writing and Reading Files
-
NameNode Memory Considerations
-
Overview of HDFS Security
-
Web UIs for HDFS
-
Using the Hadoop File Shell
MapReduce and Spark on YARN
-
The Role of Computational Frameworks
-
YARN: The Cluster Resource Manager
-
MapReduce Concepts
-
Apache Spark Concepts
-
Running Computational Frameworks on YARN
-
Exploring YARN Applications Through the
-
Web UIs, and the Shell
-
YARN Application Logs
Hadoop Configuration and Daemon Logs
-
Cloudera Manager Constructs for Managing Configurations
-
Locating Configurations and Applying Configuration Changes
-
Managing Role Instances and Adding Services
-
Configuring the HDFS Service
-
Configuring Hadoop Daemon Logs
-
Configuring the YARN Service
Getting Data Into HDFS
-
Ingesting Data From External Sources With Flume
-
Ingesting Data From Relational Databases With Sqoop
-
REST Interfaces
-
Best Practices for Importing Data
Planning Your Hadoop Cluster
-
General Planning Considerations
-
Choosing the Right Hardware
-
Virtualization Options*
-
Network Considerations
-
Configuring Nodes
Installing and Configuring Hive, Impala, and Pig
-
Hive
-
Impala
-
Pig
Hadoop Clients Including Hue
-
What Are Hadoop Clients?
-
Installing and Configuring Hadoop Clients
-
Installing and Configuring Hue
-
Hue Authentication and Authorization
Advanced Cluster Configuration
-
Advanced Configuration Parameters
-
Configuring Hadoop Ports
-
Configuring HDFS for Rack Awareness
-
Configuring HDFS High Availability
Hadoop Security
-
Why Hadoop Security Is Important
-
Hadoop’s Security System Concepts
-
What Kerberos Is and how it Works
-
Securing a Hadoop Cluster With Kerberos
-
Other Security Concepts
Managing Resources
-
Configuring cgroups with Static Service Pools
-
The Fair Scheduler
-
Configuring Dynamic Resource Pools
-
YARN Memory and CPU Settings
-
Impala Query Scheduling
Cluster Maintenance
-
Checking HDFS Status
-
Copying Data Between Clusters
-
Adding and Removing Cluster Nodes
-
Rebalancing the Cluster
-
Directory Snapshots
-
Cluster Upgrading
Cluster Monitoring and Troubleshooting
-
Cloudera Manager Monitoring Features
-
Monitoring Hadoop Clusters
-
Troubleshooting Hadoop Clusters
-
Common Misconfigurations
Conclusion