INTRODUCTION
What is Hadoop?
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models.
A Hadoop frame-worked application works in an environment that provides distributed
storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines,
each offering local computation and storage.
of large datasets across clusters of computers using simple programming models.
A Hadoop frame-worked application works in an environment that provides distributed
storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines,
each offering local computation and storage.
OBJECTIVES
- Understand the various parts of Hadoop condition, for instance, Hadoop 2.7, Impala, Yarn, MapReduce, Pig, Hive, HBase, Sqoop, Flume, and Apache Spark.
- you can learn about automatic Source Code Management using GIT and Continuous Integration using Jenkins.
- Understand MapReduce and its qualities and retain advanced MapReduce thoughts.
- Get a working learning of Pig and its parts.
TRAINING
- Complete Hadoop Training - Learn Hadoop from beginner to advanced level.
- Customized Hadoop Training - Customized your syllabus as per your requirement.
- Hadoop Project based Training - Choose any project and get training on that project based.
- Hadoop Application Training - Get our experts assistance in your existing project.
SYLLABUS
Hadoop Syllabus
- Why MapReduce
- How MapReduce works
- Hadoop data types
- Difference between Hadoop 1 & Hadoop 2
- Main class
- Mapper & Reducer Classes
- The Job class
- JobContext interface
- Partioner & Reporter Interfaces
- The Map & Reduce phases to process data
- Identity mapper & reducer
- Data flow in MapReduce
- Input Splits
- Relation Between Input Splits and HDFS Blocks
- Flow of Job Submission in MapReduce
- Combiners & Partitioners
- Job submission & Monitoring
- Introduction to Yarn
- Traditional MapReduce v/s Yarn
- Yarn Architecture
- Resource Manager
- Node Manager
- Application Master
- Application submission in YARN
- Node Manager containers
- Resource Manager components
- Yarn applications
- Scheduling in Yarn
- Fair Scheduler
- Capacity Scheduler
- Fault tolerance
Hadoop Ecosystems
- What is Apache Pig
- Why Apache Pig
- Pig features
- Where should Pig be used
- Where not to use Pig
- The Pig Architecture
- Pig components
- Pig v/s MapReduce
- Pig v/s SQL
- Pig v/s Hive
- Pig Installation
- Pig Execution Modes & Mechanisms
- Grunt Shell Commands
- Pig Latin - Data Model
- Pig data types
- Pig Latin operators
- Case Sensitivity
- Grouping & Co Grouping in Pig Latin
- Sorting & Filtering
- Joins in Pig latin
- Built-in Function
- Writing UDFs
- Macros in Pig
- What is Hive
- Features of Hive
- The Hive Architecture
- Components of Hive
- Installation & configuration
- Primitive types
- Complex types
- Built in functions
- Hive UDFs
- Views & Indexes
- Hive Data Models
- Hive vs Pig
- Co-groups
- mporting data
- Hive DDL statements
- Hive Query Language
- Data types & Operators
- Type conversions
- Joins
- Sorting & controlling data flow
- local vs mapreduce mode
- Partitions
- Buckets
- Introducing Sqoop
- Scoop installation
- Working of Sqoop
- Understanding connectors
- Importing data from MySQL to Hadoop HDFS
- Selective imports
- Importing data to Hive
- Importing to Hbase
- Exporting data to MySQL from Hadoop
- Controlling import process
- What is Flume
- Applications of Flume
- Advantages of Flume
- Flume architecture
- Data flow in Flume
- Flume features
- Flume Event
- Flume Agent
- Sources
- Channels
- Sinks
- Log Data in Flume
- What is HBase
- History Of HBase
- The NoSQL Scenario
- HBase & HDFS
- Physical Storage
- HBase v/s RDBMS
- Features of Hbase
- HBase Data model
- Master server
- Region servers & Regions
- HBase Shell
- Create table and column family
- The HBase Client API
- Introduction to Apache Spark
- Features of Spark
- Spark built on Hadoop
- Components of Spark
- Resilient Distributed Datasets
- Data Sharing using Spark RDD
- Iterative Operations on Spark RDD
- Interactive Operations on Spark RDD
- Spark shell
- RDD transformations
- Actions
- Programming with RDD
- Start Shell
- Create RDD
- Execute Transformations
- Caching Transformations
- Applying Action
- Checking output
- GraphX overview
- Introduction to Scala
- Spark & Scala interdependence
- Objects & Classes
- Class definition in Scala
- Basic Data Types
- Operators in Scala
- Control structures
- Fields in Scala
- Functions in Scala
- Collections in Scala
- Mutable collection
- Immutable collection
- Zookeeper Introduction
- Distributed Application
- Benefits of Distributed Applications
- Why use Zookeeper
- Zookeeper Architecture
- Hierarchical namespace
- Znodes
- Stat structure of a Znode
- Electing a leader
- Introduction to Apache Oozie
- Oozie Workflow
- Oozie Coordinators
- Property File
- Oozie Bundle system
- CLI and extensions
- Overview of Hue
- Introduction to MongoDB
- MongoDB v/s RDBMS
- Why & Where to use MongoDB
- Databases & Collections
- Inserting & querying documents
- Schema Design
- CRUD Operations
- Architecture of Hadoop Cluster
- Workflow of Hadoop Cluster
- HDFS Writes
- Preparing for HDFS Writes
- Pipelined HDFS Write
- NameNode Functionality
- Replicating Missing Replicas
- HDFS Reads
- Factors for Planning Hadoop Cluster
- Single-Node and Multi-Node Cluster Configuration
- HDFS Block replication and rack awareness
- Topology and Components of Hadoop Cluster
- Checking HDFS Status
- Breaking the cluster
- Copying Data Between Clusters
- Adding and Removing Cluster Nodes
- Rebalancing the cluster
- Name Node Metadata Backup
- Cluster Upgrading
- Hadoop Configuration Overview
- Types of Configuration Files
- Hadoop Cluster and Map Reduce Configuration Parameters with Values
- Hadoop Environment Setup
- Include and Exclude Configuration Files
- General System conditions to Monitor
- Name Node and Job Tracker Web Uis
- View and Manage Hadoop's Log files
- Ganglia Monitoring Tool
- Common cluster issues and their resolutions
Hadoop Analytics using R (For DataScientist)
- Measuring the central tendency – the model
- Measuring spread – variance and standard deviation
- Visualizing numeric variables – boxplots
- Visualizing numeric variables – histograms
- Visualizing numeric variables – qqplot
- Understanding numeric data – uniform and normal distributions
- Measuring the central tendency – the model
- Exploring relationships between variables
- Visualizing relationships – scatterplots
- Exploring numeric variables
- Implementing Association rule mining in R
- Integrating R with Hadoop using RHadoop and RMR package
- Writing MapReduce Jobs in R and executing them on Hadoop
- Implementing Machine Learning Algorithms on larger Data Sets with Apache Mahout
- Use SQL databases to store and organize data
- Access stored data with MySQL querying language
- Introduction to Machine Learning
- Supervised and Unsupervised Learning Techniques
- Creating predictive models
- Classification Using Nearest Neighbors
- Linear R egression
- Multiple linear regression model
- Logistic Regression
- Decision Tree Classifier
- Clustering
- What is Random Forests?
- Features of Random Forest
- Out of Box Error Estimate
- Naive Bayes Classifier
- Introduction of K-Means Clustering
- K-means in Euclidean space
- K-means as optimization
- Understanding TF-IDF and Cosine
- Similarity and their application to Vector Space Model
- Deep Network
- Optimization for Training Deep Models
- Convolutional Networks
- Understanding Support Vector Machines
- Retrieve data using sql statements
- Using kernels for non-linear spaces
No comments:
Post a Comment