Zhen He

Associate Professor

Department of Computer Science and Computer Engineering
La Trobe University
Bundoora, Victoria 3086
Australia

Tel : + 61 3 9479 3036
Email: z.he@latrobe.edu.au

Building: Beth Gleeson, Room: 235

Home

Teaching

I am currently teaching a subject called Big Data Management in the Cloud. The subject is focused on Big Data analytics using the Hadoop ecosystem. We cover a large portion of the Hadoop ecosystem starting from the basics of MapReduce, Pig, Hive, and ZooKeeper to the 2nd generation components of Yarn, storm, giraph, Spark and Shark. There is a strong emphasis on gaining hands on programming experience using a combination of Cloudera's virtual machine and programming Hadoop on Amazon Web Services (AWS). We also cover the important web services of AWS including: elastic MapReduce (EMR), EC2, S3, the Elastic Load Balancer, Auto Scaler, DynamoDB, CloudFormation, CloudWatch, Route 53, etc. Finally we also cover important concepts in the area of NoSQL stores.

The details for the subject are below.

Subject name: Big Data Management in the Cloud
Course code: CSE3BDC and CSE4BDC
Time: 1st semester 2014

Lecture Outline

Lecture 1 (Introduction to cloud computing)

Motivations for cloud computing
Cloud architectures
Private versus public cloud
Infrastructure as a service
Virtual machines
Platform as a service
Software as a service
Basic services provided by Amazon Web Services

Lecture 2 (Introduction to BigData and MapReduce)

Big Data: updated intensive versus data analytics
Big Data analytics motivation
Introduction to MapReduce
9 big MapReduce concepts
MapReduce versus RDBMS
Language neutral MapReduce processing

Lecture 3 (Introduction to Hadoop)

Introduction to Hadoop
Hadoop internals
Programming Hadoop MapReduce

Lecture 4 (Advanced MapReduce Programming)

Pairs and stripes
The inner join
Multiple iterations of MapReduce
Amazon elastic MapReduce (EMR)

Lecture 5 (High level Hadoop programming languages)

Pig
Hive

Lecture 6 (Hadoop ecosystem)

Data importing into HDFS
Other Hadoop tools
Zookeeper
Hadoop next generation: Interactive querying

Dremel
Impala

Lecture 7 (Hadoop next generation)

Yarn
Data streaming with Storm
Graph analytics with Pregel

Lecture 8 (Hadoop next generation)

Motivation behind Spark
How Spark works
Programming Scala

Lecture 9 (Hadoop next generation)

Spark internals
Spark programming
Shark

Lecture 10 (NoSQL Stores)

Motivation for NoSQL
Introduction to NoSQL
Consistency models
Different NoSQL stores

Big Table / HBase
DynamoDB
Spanner

Lecture 11 (Amazon Web Services)

Different services offered by Amazon Web Services(AWS)

Cloud formation, cloud front, elastic load balancer
Auto scaling, relation database service, route 53, etc.

Disaster recovery versus high availability

Lecture 12 (Amazon Web Services)

Architecting AWS group exercise

Laboratory Classes Outline

Basic Amazon Web Services (AWS) including EC2 and S3

Creating EC2 instances using AWS management console
Storing data in S3
AWS command line interface within EC2 instances

Basic Hadoop MapReduce Programming

MapReduce programming using the Cloudera Vritual Machine

Various MapReduce basic exercises
Applying local aggregation to MapReduce programs

More advanced Hadoop MapReduce Programming

MapReduce programming using the Cloudera Virtual Machine

Taking advantage of the automatic sorting between Mapper and Reducers
Using combiners

Hadoop MapReduce on AWS Elastic MapReduce (EMR)

First, write and compile program on Cloudera Virtual Machine
Next, deploy compiled program onto EMR

Basics of Hive

Creating tables and querying data
See the relationship between MapReduce and Hive

Advanced Hive programming

More querying of data
Wrting your own user-defined function (UDF)
Performing data partitioning

Programming Scala

Scala programming inside the Spark Shell

Basics of the language like data structures, lambdas, etc.
A lot of exercises on applying functions to all elements of collections

E.g. reduce, map, filter, etc.

Programming Spark

Spark programming inside the Spark Shell

Applying functions to all elements of RDDs

E.g. reduceByKey, map, filter, etc.

Loading and saving data in/out textfiles, binary files, HDFS
Advanced Spark

Performing local aggregation via mapPartitions.

Programming Spark on AWS EMR

Creating AWS Spark EMR clusters using AWS command line interface
Loading data from S3 into Spark EMR cluster
Spark exercises on EMR cluster

Elasticity on Amazon Web Services

Elastic load balancing
Cloud watch
Auto scaling