Saturday, 22 June 2019

Unlabelled hadoop

hadoop

by Mumbai ACademics on 07:09

INTRODUCTION

What is Hadoop?

Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models.
A Hadoop frame-worked application works in an environment that provides distributed
storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines,
each offering local computation and storage.

OBJECTIVES

Understand the various parts of Hadoop condition, for instance, Hadoop 2.7, Impala, Yarn, MapReduce, Pig, Hive, HBase, Sqoop, Flume, and Apache Spark.
you can learn about automatic Source Code Management using GIT and Continuous Integration using Jenkins.
Understand MapReduce and its qualities and retain advanced MapReduce thoughts.
Get a working learning of Pig and its parts.

TRAINING

Complete Hadoop Training - Learn Hadoop from beginner to advanced level.
Customized Hadoop Training - Customized your syllabus as per your requirement.
Hadoop Project based Training - Choose any project and get training on that project based.
Hadoop Application Training - Get our experts assistance in your existing project.

SYLLABUS

Hadoop Syllabus

MapReduce

Why MapReduce
How MapReduce works
Hadoop data types
Difference between Hadoop 1 & Hadoop 2
Main class
Mapper & Reducer Classes
The Job class
JobContext interface
Partioner & Reporter Interfaces
The Map & Reduce phases to process data
Identity mapper & reducer
Data flow in MapReduce
Input Splits
Relation Between Input Splits and HDFS Blocks
Flow of Job Submission in MapReduce
Combiners & Partitioners
Job submission & Monitoring

Yarn

Introduction to Yarn
Traditional MapReduce v/s Yarn
Yarn Architecture

Resource Manager
Node Manager
Application Master

Application submission in YARN
Node Manager containers
Resource Manager components
Yarn applications
Scheduling in Yarn

Fair Scheduler
Capacity Scheduler

Fault tolerance

Hadoop Ecosystems

Pig

What is Apache Pig
Why Apache Pig
Pig features
Where should Pig be used
Where not to use Pig
The Pig Architecture
Pig components
Pig v/s MapReduce
Pig v/s SQL
Pig v/s Hive
Pig Installation
Pig Execution Modes & Mechanisms
Grunt Shell Commands
Pig Latin - Data Model
Pig data types
Pig Latin operators
Case Sensitivity
Grouping & Co Grouping in Pig Latin
Sorting & Filtering
Joins in Pig latin
Built-in Function
Writing UDFs
Macros in Pig

Hive

What is Hive
Features of Hive
The Hive Architecture
Components of Hive
Installation & configuration
Primitive types
Complex types
Built in functions
Hive UDFs
Views & Indexes
Hive Data Models
Hive vs Pig
Co-groups
mporting data
Hive DDL statements
Hive Query Language
Data types & Operators
Type conversions
Joins
Sorting & controlling data flow
local vs mapreduce mode
Partitions
Buckets

Sqoop

Introducing Sqoop
Scoop installation
Working of Sqoop
Understanding connectors
Importing data from MySQL to Hadoop HDFS
Selective imports
Importing data to Hive
Importing to Hbase
Exporting data to MySQL from Hadoop
Controlling import process

Flume

What is Flume
Applications of Flume
Advantages of Flume
Flume architecture
Data flow in Flume
Flume features
Flume Event
Flume Agent

Sources
Channels
Sinks

Log Data in Flume

HBase

What is HBase
History Of HBase
The NoSQL Scenario
HBase & HDFS
Physical Storage
HBase v/s RDBMS
Features of Hbase
HBase Data model
Master server
Region servers & Regions
HBase Shell
Create table and column family
The HBase Client API

Spark

Introduction to Apache Spark
Features of Spark
Spark built on Hadoop
Components of Spark
Resilient Distributed Datasets
Data Sharing using Spark RDD
Iterative Operations on Spark RDD
Interactive Operations on Spark RDD
Spark shell
RDD transformations
Actions
Programming with RDD

Start Shell
Create RDD
Execute Transformations
Caching Transformations
Applying Action
Checking output

GraphX overview

Scala Overview

Introduction to Scala
Spark & Scala interdependence
Objects & Classes
Class definition in Scala
Basic Data Types
Operators in Scala
Control structures
Fields in Scala
Functions in Scala
Collections in Scala

Mutable collection
Immutable collection

Zookeeper Overview

Zookeeper Introduction
Distributed Application
Benefits of Distributed Applications
Why use Zookeeper
Zookeeper Architecture
Hierarchical namespace
Znodes
Stat structure of a Znode
Electing a leader

Oozie & Hue Overview

Introduction to Apache Oozie
Oozie Workflow
Oozie Coordinators
Property File
Oozie Bundle system
CLI and extensions
Overview of Hue

MongoDB Overview

Introduction to MongoDB
MongoDB v/s RDBMS
Why & Where to use MongoDB
Databases & Collections
Inserting & querying documents
Schema Design
CRUD Operations

Planning Hadoop Cluster

Architecture of Hadoop Cluster
Workflow of Hadoop Cluster
HDFS Writes
Preparing for HDFS Writes
Pipelined HDFS Write
NameNode Functionality
Replicating Missing Replicas
HDFS Reads
Factors for Planning Hadoop Cluster
Single-Node and Multi-Node Cluster Configuration
HDFS Block replication and rack awareness
Topology and Components of Hadoop Cluster

Cluster Maintenance

Checking HDFS Status
Breaking the cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing the cluster
Name Node Metadata Backup
Cluster Upgrading

Advanced Cluster Configuration Features

Hadoop Configuration Overview
Types of Configuration Files
Hadoop Cluster and Map Reduce Configuration Parameters with Values
Hadoop Environment Setup
Include and Exclude Configuration Files

Managing and Scheduling Jobs

Managing Jobs
The FIFO and Fair Schedule
How to stop and start jobs running on the cluster

Cluster Monitoring, Troubleshooting and Optimizing

General System conditions to Monitor
Name Node and Job Tracker Web Uis
View and Manage Hadoop's Log files
Ganglia Monitoring Tool
Common cluster issues and their resolutions

YARN

Introduction to YARN
Need for YARN
YARN Architecture
YARN Installation and Configuration

Extending Hadoop

Installing and Managing Hadoop Ecosystem

Sqoop
Flume
Hive
Pig
HBase
Oozie

Hadoop Analytics using R (For DataScientist)

Functions & plots In R

Measuring the central tendency – the model
Measuring spread – variance and standard deviation
Visualizing numeric variables – boxplots
Visualizing numeric variables – histograms
Visualizing numeric variables – qqplot
Understanding numeric data – uniform and normal distributions
Measuring the central tendency – the model
Exploring relationships between variables
Visualizing relationships – scatterplots
Exploring numeric variables

Read and Write Operations in R

Reading from CSV
Reading from URL
Reading from Excel
Writing to CSV & PMML

Integrating R

Implementing Association rule mining in R
Integrating R with Hadoop using RHadoop and RMR package
Writing MapReduce Jobs in R and executing them on Hadoop
Implementing Machine Learning Algorithms on larger Data Sets with Apache Mahout

Databases and Introduction to Machine Learning Concept

Use SQL databases to store and organize data
Access stored data with MySQL querying language
Introduction to Machine Learning
Supervised and Unsupervised Learning Techniques

Regression Methods and Supervised Learning Techniques

Creating predictive models
Classification Using Nearest Neighbors
Linear R egression
Multiple linear regression model
Logistic Regression
Decision Tree Classifier
Clustering
What is Random Forests?
Features of Random Forest
Out of Box Error Estimate
Naive Bayes Classifier

Unsupervised Machine Learning Techniques

Introduction of K-Means Clustering
K-means in Euclidean space
K-means as optimization
Understanding TF-IDF and Cosine
Similarity and their application to Vector Space Model

Deep learning

Deep Network
Optimization for Training Deep Models
Convolutional Networks
Understanding Support Vector Machines
Retrieve data using sql statements
Using kernels for non-linear spaces

Project

Project name: Live Project

Project description:Student will be assigned a project which they will have to execute under the careful guidance of the faculty.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)