Zhen He

Associate Professor

Department of Computer Science and Computer Engineering
La Trobe University
Bundoora, Victoria 3086
Australia

Tel : + 61 3 9479 3036
Email: z.he@latrobe.edu.au

Building: Beth Gleeson, Room: 235
 


Home

Biography

Publications

Research Grants

PhD thesis

Teaching

Spark related topics




Teaching

I am currently teaching a subject called Big Data Management in the Cloud. The subject is focused on Big Data analytics using the Hadoop ecosystem. We cover a large portion of the Hadoop ecosystem starting from the basics of MapReduce, Pig, Hive, and ZooKeeper to the 2nd generation components of Yarn, storm, giraph, Spark and Shark. There is a strong emphasis on gaining hands on programming experience using a combination of Cloudera's virtual machine and programming Hadoop on Amazon Web Services (AWS). We also cover the important web services of AWS including: elastic MapReduce (EMR), EC2, S3, the Elastic Load Balancer, Auto Scaler, DynamoDB, CloudFormation, CloudWatch, Route 53, etc. Finally we also cover important concepts in the area of NoSQL stores.

The details for the subject are below.

Subject name: Big Data Management in the Cloud
Course code: CSE3BDC and CSE4BDC
Time: 1st semester 2014

Lecture Outline
  1. Lecture 1  (Introduction to cloud computing)
    • Motivations for cloud computing
    • Cloud architectures
    • Private versus public cloud
    • Infrastructure as a service
    • Virtual machines
    • Platform as a service
    • Software as a service
    • Basic services provided by Amazon Web Services
  2. Lecture 2 (Introduction to BigData and MapReduce)
    • Big Data: updated intensive versus data analytics
    • Big Data analytics motivation
    • Introduction to MapReduce
    • 9 big MapReduce concepts
    • MapReduce versus RDBMS
    • Language neutral MapReduce processing
  3. Lecture 3 (Introduction to Hadoop)
    • Introduction to Hadoop
    • Hadoop internals
    • Programming Hadoop MapReduce
  4. Lecture 4 (Advanced MapReduce Programming)
    • Pairs and stripes
    • The inner join
    • Multiple iterations of MapReduce
    • Amazon elastic MapReduce (EMR)
  5. Lecture 5 (High level Hadoop programming languages)
    • Pig
    • Hive
  6. Lecture 6 (Hadoop ecosystem)
    • Data importing into HDFS
    • Other Hadoop tools
    • Zookeeper
    • Hadoop next generation: Interactive querying
      • Dremel
      • Impala
  7. Lecture 7 (Hadoop next generation)
    • Yarn
    • Data streaming with Storm
    • Graph analytics with Pregel
  8. Lecture 8 (Hadoop next generation)
    • Motivation behind Spark
    • How Spark works
    • Programming Scala
  9. Lecture 9 (Hadoop next generation)
    • Spark internals
    • Spark programming
    • Shark
  10. Lecture 10 (NoSQL Stores)
    • Motivation for NoSQL
    • Introduction to NoSQL
    • Consistency models
    • Different NoSQL stores
      • Big Table / HBase
      • DynamoDB
      • Spanner
  11. Lecture 11 (Amazon Web Services)
    • Different services offered by Amazon Web Services(AWS)
      • Cloud formation, cloud front, elastic load balancer
      • Auto scaling, relation database service, route 53, etc.
    • Disaster recovery versus high availability
  12. Lecture 12 (Amazon Web Services)
    • Architecting AWS group exercise
Laboratory Classes Outline
  1. Basic Amazon Web Services (AWS) including EC2 and S3
    • Creating EC2 instances using AWS management console
    • Storing data in S3
    • AWS command line interface within EC2 instances
  2. Basic Hadoop MapReduce Programming
    • MapReduce programming using the Cloudera Vritual Machine
      • Various MapReduce basic exercises
      • Applying local aggregation to MapReduce programs
  3. More advanced Hadoop MapReduce Programming
    • MapReduce programming using the Cloudera Virtual Machine
      • Taking advantage of the automatic sorting between Mapper and Reducers
      • Using combiners
  4. Hadoop MapReduce on AWS Elastic MapReduce (EMR)
    • First, write and compile program on Cloudera Virtual Machine
    • Next, deploy compiled program onto EMR
  5. Basics of Hive
    • Creating tables and querying data
    • See the relationship between MapReduce and Hive
  6. Advanced Hive programming
    • More querying of data
    • Wrting your own user-defined function (UDF)
    • Performing data partitioning
  7. Programming Scala
    • Scala programming inside the Spark Shell
      • Basics of the language like data structures, lambdas, etc.
      • A lot of exercises on applying functions to all elements of collections
        • E.g.  reduce, map, filter, etc.
  8. Programming Spark
    • Spark programming inside the Spark Shell
      • Applying functions to all elements of RDDs
        • E.g. reduceByKey, map, filter, etc.
      • Loading and saving data in/out textfiles, binary files, HDFS
      • Advanced Spark
        • Performing local aggregation via mapPartitions.
  9. Programming Spark on AWS EMR
    • Creating AWS Spark EMR clusters using AWS command line interface
    • Loading data from S3 into Spark EMR cluster
    • Spark exercises on EMR cluster
  10. Elasticity on Amazon Web Services
    • Elastic load balancing
    • Cloud watch
    • Auto scaling