top of page
Cloudera Data Analyst Training
Four Day Course
Data Analytst

CLOUDERA

xcloudera-.png
impala.png
pig.jpg
hive.png

Introduction

​

Apache Hadoop Fundamentals

  • The Motivation for Hadoop

  • Hadoop Overview

  • Data Storage: HDFS

  • Distributed Data Processing: YARN, MapReduce, and Spark

  • Data Processing and Analysis: Pig, Hive, and Impala

  • Database Integration: Sqoop

  • Other Hadoop Data Tools

  • Exercise Scenarios

​

Introduction to Apache Pig

  • What is Pig?

  • Pig’s Features

  • Pig Use Cases

  • Interacting with Pig

​

Basic Data Analysis with Apache Pig

  • Pig Latin Syntax

  • Loading Data

  • Simple Data Types

  • Field Definitions

  • Data Output

  • Viewing the Schema

  • Filtering and Sorting Data

  • Commonly Used Functions

​

Processing Complex Data with Apache Pig

  • Storage Formats

  • Complex/Nested Data Types

  • Grouping

  • Built-In Functions for Complex Data

  • Iterating Grouped Data.

​

Multi-Dataset Operations with Apache Pig

  • Techniques for Combining Datasets

  • Joining Datasets in Pig

  • Set Operations

  • Splitting Datasets

Apache Pig Troubleshooting and Optimization

  • Troubleshooting Pig

  • Logging

  • Using Hadoop’s Web UI

  • Data Sampling and Debugging

  • Performance Overview

  • Understanding the Execution Plan

  • Tips for Improving the Performance of Pig Jobs

​

Introduction to Apache Hive and Impala

  • What is Hive?

  • What is Impala?

  • Why Use Hive and Impala?

  • Schema and Data Storage

  • Comparing Hive and Impala to Traditional Databases

  • Use Cases

​

Querying with Apache Hive and Impala

  • Databases and Tables

  • Basic Hive and Impala Query Language Syntax

  • Data Types

  • Using Hue to Execute Queries

  • Using Beeline (Hive’s Shell)

  • Using the Impala Shell

​

Apache Hive and Impala Data Management

  • Data Storage

  • Creating Databases and Tables

  • Loading Data

  • Altering Databases and Tables

  • Simplifying Queries with Views

  • Storing Query Results

​

Data Storage and Performance

  • Partitioning Tables

  • Loading Data into Partitioned Tables

  • When to Use Partitioning

  • Choosing a File Format

  • Using Avro and Parquet File Formats

Relational Data Analysis with Apache Hive and Impala

  • Joining Datasets

  • Common Built-In Functions

  • Aggregation and Windowing

​

Complex Data with Apache Hive and Impala

  • Complex Data with Hive

  • Complex Data with Impala

​

Analyzing Text with Apache Hive and Impala

  • Using Regular Expressions with Hive and Impala

  • Processing Text Data with SerDes in Hive

  • Sentiment Analysis and n-grams in Hive

​

Apache Hive Optimization

  • Understanding Query Performance

  • Bucketing

  • Indexing Data

  • Hive on Spark

​

Apache Impala Optimization

  • How Impala Executes Queries

  • Improving Impala Performance

​

Extending Apache Hive and Impala

  • Custom SerDes and File Formats in Hive

  • Data Transformation with

  • Custom Scripts in Hive

  • User-Defined Functions

  • Parameterized Queries

​

Choosing the Best Tool for the Job

  • Comparing Pig, Hive, Impala, and Relational Databases

  • Which to Choose?

​

Conclusion

Cloudera Data Science Workbench (CDSW)

Three Day Training

Overview of CDSW 

​

  • Introduction to CDSW

  • How to Access CDSW

  • Navigating around CDSW

  • User Settings

  • Hadoop Authentication

​

Projects in CDSW 

  • Creating a New Project

  • Navigating around a Project

  • Project Settings

​

The CDSW Workbench Interface â€‹

  • Using the Workbench

  • Using the Sidebar

  • Using the Code Editor

  • Engines and Sessions

Running Python and R Code in CDSW 

  • Running Code

  • Using the Session Prompt

  • Using the Terminal

  • Installing Packages

  • Using Markdown in Comments

​

Using Apache Spark 2 in CDSW

  • Scenario and Dataset

  • Copying Files to HDFS

  • Interfaces to Apache Spark 2

  • Connecting to Spark

  • Reading Data

  • Inspecting Data

CDSW

Exploratory Data Science in CDSW

  • Transforming Data

  • Using SQL Queries

  • Visualizing Data from Spark

  • Machine Learning with MLlib

  • Session History

​

Teams and Collaboration in CDSW

  • Collaboration in CDSW

  • Teams in CDSW

  • Using Git for Collaboration

  • Conclusion

xcloudera-.png
Developer Training for Apache Spark and Hadoop
Apache Spark

Introduction to Apache Hadoop 
and the Hadoop Ecosystem


• Apache Hadoop Overview
• Data Ingestion and Storage
• Data Processing
• Data Analysis and Exploration
• Other Ecosystem Tools
• Introduction to the Hands-On Exercises
Apache Hadoop File Storage
• Apache Hadoop Cluster Components
• HDFS Architecture
• Using HDFS
Distributed Processing on  
an Apache Hadoop Cluster
• YARN Architecture
• Working With YARN
Apache Spark Basics
• What is Apache Spark?
• Starting the Spark Shell
• Using the Spark Shell
• Getting Started with Datasets 
and DataFrames
• DataFrame Operations
Working with DataFrames and Schemas
• Creating DataFrames from Data Sources
• Saving DataFrames to Data Sources
• DataFrame Schemas
• Eager and Lazy Execution
Analyzing Data with DataFrame Queries
• Querying DataFrames Using 
Column Expressions
• Grouping and Aggregation Queries
• Joining DataFrames

RDD Overview
• RDD Overview
• RDD Data Sources
• Creating and Saving RDDs
• RDD Operations
Transforming Data with RDDs
• Writing and Passing 
Transformation Functions
• Transformation Execution
• Converting Between RDDs 
and DataFrames
Aggregating Data with Pair RDDs
• Key-Value Pair RDDs
• Map-Reduce
• Other Pair RDD Operations
Querying Tables and Views  
with Apache Spark SQL
• Querying Tables in Spark Using SQL
• Querying Files and Views
• The Catalog API
• Comparing Spark SQL, Apache Impala,and Apache Hive-on-Spark
Working with Datasets in Scala
• Datasets and DataFrames
• Creating Datasets
• Loading and Saving Datasets
• Dataset Operations
Writing, Configuring, and Running  
Apache Spark Applications
• Writing a Spark Application
• Building and Running an Application
• Application Deployment Mode
• The Spark Application Web UI
• Configuring Application Properties

Distributed Processing
Review: Apache Spark on a Cluster
RDD Partitions
Example: Partitioning in Queries
Stages and Tasks
Job Execution Planning
Example: Catalyst Execution Plan
Example: RDD Execution Plan
Distributed Data Persistence
DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs
ommon Patterns in Apache Spark  
a Processing
Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means
Apache Spark Streaming:  
oduction to DStreams
Apache Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Streaming Applications
Apache Spark Streaming:  
cessing Multiple Batches
Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations
Preview: Structured Streaming
Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka 
Data Sources
Example: Using a Kafka Direct 
Data Source

xcloudera-.png
spark logo.png
hardoop.png
Developer Training for MapReduce
Four Day Course
MapReduce

Introduction


The Motivation for Hadoop
• Problems with Traditional 
Large-Scale Systems
• Introducing Hadoop
• Hadoopable Problems
Hadoop: Basic Concepts and HDFS
• The Hadoop Project and 
Hadoop Components
• The Hadoop Distributed File System
Introduction to MapReduce
• MapReduce Overview
• Example: WordCount
• Mappers
• Reducers
Hadoop Clusters and 
the Hadoop Ecosystem
• Hadoop Cluster Overview
• Hadoop Jobs and Tasks
• Other Hadoop Ecosystem Components
Writing a MapReduce Program in Java
• Basic MapReduce API Concepts
• Writing MapReduce Drivers, Mappers,  
and Reducers in Java
• Speeding Up Hadoop Development  
by Using Eclipse
• Differences Between the Old  
and New MapReduce APIs
Writing a MapReduce Program 
Using Streaming
• Writing Mappers and Reducers 
with the Streaming API

Unit Testing MapReduce Programs
• Unit Testing
• The JUnit and MRUnit Testing Framework• Writing Unit Tests with MRUnit
• Running Unit Tests
Delving Deeper into the Hadoop API
•  Using the ToolRunner Class
• Setting Up and Tearing Down Mappers 
and Reducers
• Decreasing the Amount of Intermediate
Data with Combiners
• Accessing HDFS Programmatically
• Using The Distributed Cache
• Using the Hadoop API’s Library of 
Mappers, Reducers, and Partitioners
Practical Development Tips  
and Techniques
• Strategies for Debugging MapReduce C
• Testing MapReduce Code Locally 
by Using LocalJobRunner
• Writing and Viewing Log Files
• Retrieving Job Information with Counter• Reusing Objects
• Creating Map-Only MapReduce Jobs
Partitioners and Reducers
• How Partitioners and Reducers  
Work Together
• Determining the Optimal Number  
of Reducers for a Job
• Writing Customer Partitioners
Data Input and Output
• Creating Custom Writable and
WritableComparable Implementations
• Saving Binary Data Using SequenceFile 
and Avro Data Files
• Issues to Consider When Using 
File Compression
• Implementing Custom InputFormats 
and OutputFormats

xcloudera-.png
hive.png
pig.jpg
impala.png

Common MapReduce Algorithms
• Sorting and Searching Large Data Sets
• Indexing Data
• Computing Term Frequency — Inverse 
Document Frequency
• Calculating Word Co-Occurrence
• Performing Secondary Sort
Joining Data Sets in MapReduce Jobs
• Writing a Map-Side Join
• Writing a Reduce-Side Join
Integrating Hadoop into 
the Enterprise Workflow
• Integrating Hadoop into 
an Existing Enterprise
• Loading Data from an RDBMS 
into HDFS by Using Sqoop
• Managing Real-Time Data Using Flume
• Accessing HDFS from Legacy Systems  
with FuseDFS and HttpFS
An Introduction to Hive, Imapala, and Pig
• The Motivation for Hive, Impala, and Pig
• Hive Overview
• Impala Overview
• Pig Overview
• Choosing Between Hive, Impala, and Pig
An Introduction to Oozie
• Introduction to Oozie
• Creating Oozie Workflows
Conclusion

Cloudera Administrator Training
Four Day Course
Admin

Introduction

The Case for Apache Hadoop 

  • Why Hadoop?

  • Fundamental Concepts

  • Core Hadoop Components

Hadoop Cluster Installation

  • Rationale for a Cluster Management Solution

  • Cloudera Manager Features

  • Cloudera Manager Installation

  • Hadoop (CDH) Installation

The Hadoop Distributed File System (HDFS)

  • HDFS Features

  • Writing and Reading Files

  • NameNode Memory Considerations

  • Overview of HDFS Security

  • Web UIs for HDFS

  • Using the Hadoop File Shell

MapReduce and Spark on YARN

  • The Role of Computational Frameworks

  • YARN: The Cluster Resource Manager

  • MapReduce Concepts

  • Apache Spark Concepts

  • Running Computational Frameworks on YARN

  • Exploring YARN Applications Through the

  • Web UIs, and the Shell

  • YARN Application Logs

Hadoop Configuration and Daemon Logs

  • Cloudera Manager Constructs for Managing Configurations

  • Locating Configurations and Applying Configuration Changes

  • Managing Role Instances and Adding Services

  • Configuring the HDFS Service

  • Configuring Hadoop Daemon Logs

  • Configuring the YARN Service

Getting Data Into HDFS

  • Ingesting Data From External Sources With Flume

  • Ingesting Data From Relational Databases With Sqoop

  • REST Interfaces

  • Best Practices for Importing Data

Planning Your Hadoop Cluster

  • General Planning Considerations

  • Choosing the Right Hardware

  • Virtualization Options*

  • Network Considerations

  • Configuring Nodes

Installing and Configuring Hive, Impala, and Pig

  • Hive

  • Impala

  • Pig

Hadoop Clients Including Hue

  • What Are Hadoop Clients?

  • Installing and Configuring Hadoop Clients

  • Installing and Configuring Hue

  • Hue Authentication and Authorization

Advanced Cluster Configuration

  • Advanced Configuration Parameters

  • Configuring Hadoop Ports

  • Configuring HDFS for Rack Awareness

  • Configuring HDFS High Availability

Hadoop Security

  • Why Hadoop Security Is Important

  • Hadoop’s Security System Concepts

  • What Kerberos Is and how it Works

  • Securing a Hadoop Cluster With Kerberos

  • Other Security Concepts

Managing Resources

  • Configuring cgroups with Static Service Pools

  • The Fair Scheduler

  • Configuring Dynamic Resource Pools

  • YARN Memory and CPU Settings

  • Impala Query Scheduling

Cluster Maintenance

  • Checking HDFS Status

  • Copying Data Between Clusters

  • Adding and Removing Cluster Nodes

  • Rebalancing the Cluster

  • Directory Snapshots

  • Cluster Upgrading

Cluster Monitoring and Troubleshooting

  • Cloudera Manager Monitoring Features

  • Monitoring Hadoop Clusters

  • Troubleshooting Hadoop Clusters

  • Common Misconfigurations

Conclusion

impala.png
pig.jpg
hive.png
xcloudera-.png

Contact Us

Need more details or want to know more about how we can help you with your data? Contact us

We are here to assist. Contact us by phone, email or via our social media channels below.

Important Links

Subscribe to Our Newsletter

Thanks for subscribing!

  • Instagram
  • Facebook
  • LinkedIn
  • Twitter

© 2023 by The Enigma Group of Companies. All Rights Reserved.

bottom of page