COMP9313: Big Data Management Course Introduction Lecture in - - PowerPoint PPT Presentation
COMP9313: Big Data Management Course Introduction Lecture in - - PowerPoint PPT Presentation
COMP9313: Big Data Management Course Introduction Lecture in Charge Lecturer: Yifang Sun office: used to be K17-208, at home now email: yifangs@cse.unsw.edu.au use [comp9313] in subject Research interests Database
Lecture in Charge
- Lecturer: Yifang Sun
- office: used to be K17-208, at home now…
- email: yifangs@cse.unsw.edu.au
- use [comp9313] in subject
- Research interests
- Database
- High dimensional data
- Machine learning (Natural language processing)
- Integration of DB and AI
2
Course Aims
- Introduce the concepts behind Big Data
- Introduce the core technologies used in
managing large-scale data sets
- MapReduce
- Spark
- …
- Introduce technologies for developing solutions
to large-scale data analytics problems
- nearest neighbor search
- machine learning with big data
- …
3
Course Aims - cont.
- Not possible to cover every aspect of big data
management
- We will focus on
- concepts
- algorithms
- principles
- We will not focus on
- programming languages and API
- specific platforms
- Make use of tutorials and documents on the
Internet
4
Lectures
- Delivered through pre-recorded videos
- location: anywhere you like
- time: anytime you like
- links to videos available on Piazza every Mon and
Wed
- email LiC ASAP if you have no access to Piazza
- Slides on course website
- No QA sessions during lectures
- Ask in Piazza or online consultations
- Schedule and length of lectures may vary based
- n the progress of the course
- Note: watching every lecture is assumed.
5
Resources
- Books
- Hadoop: The Definitive Guide. Tom White. 4th Edition -
O’Reilly Media
- Learning PySpark. Tomasz Drabas and Denny Lee. O’Reilly
Media
- Data-Intensive Text Processing with MapReduce. Jimmy Lin
and Chris Dyer. University of Maryland, College Park.
- Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman,
Jeff Ullman. 3rd edition - Cambridge University Press
- Online resources:
- PySpark Tutorial
- Spark Python API Docs
- Online courses/tutorials in Youtube, coursera, …
6
Pre-requisite
- Official prerequisite
- Data Structures and Algorithms
- Database Systems
- Before commencing this course, you should
- have experiences and good knowledge of algorithm
design
- have solid background in database systems
- have solid programming skills in Python
- be familiar with Linux operating systems
- have basic knowledge of linear algebra, probability
theory and statistics
- No previous experience necessary in
- MapReduce/Spark
- Parallel and distributed programming
7
Please do not enrol if you…
- Don’t have COMP9024/9311 knowledge
- Cannot produce correct Python program on
your own
- Have poor time management
- Are too busy to watch lecture videos/labs
- Otherwise, you are likely to perform badly in
this subject
8
Assessment
- One written assignment (20%)
- Two programming projects (25% each)
- Final exam (30%)
- There’s no hurdle for any of the above
components
- All are individual tasks
- All are submitted through give
9
Written Assignment
- Exam-style questions
- Computational, short answer
- no essay, no multiple choice
- Regarding the lecture contents
- algorithms, principles, …
- to assess your understanding, not memory
- Late penalty
- firm deadline
- zero mark for late submission
10
Programming projects
- Tentative topics
- One on MapReduce + nearest neighbor search
- One on PySpark + machine learning
- Both results and source codes will be
checked.
- Zero mark if your codes cannot be run due to
some bugs.
- Late penalty
- 10% reduction of raw marks for the 1st day, 30%
reduction per day for the following 3 days
11
Final exam
- Open book exam
- Firm deadline
- No supplementary exam will be given
- Special consideration must be submitted prior
to the start of the exam
- More details on the way
12
Academic honesty and plagiarism
- Zero tolerance to plagiarism
- You will get 0 marks
- Examples of misconduct:
- Copy other students’ work
- Let other students copy your work
- Copy from GitHub
- Find a ghost writer
- …
- I will not accept the following excuses:
- “I’ve left the lab with my screen unlocked”
- “He stole it from my computer”
- “I only gave my code to A. A didn’t use it but gave it
to B”
- …
13
Tentative course schedule
14
Week Topic Assignment/Project 1 Course Introduction and Introduction to Big Data 2 Hadoop MapReduce 3 Hadoop MapReduce 4 Nearest Neighbor Search Project 1 5 Spark Assignment 6 Flexibility Week (no lecture) 7 Spark Project 2 8 Machine Learning with PySpark 9 Data Stream + NoSQL 10 Revision and Exam Preparation
Labs
- Labs to help you with programming and
projects
- nothing to submit, no mark
- using ipython notebooks
- Contents
- 1 lab on setting the environment
- 1 lab on PySpark and MapReduce
- 1 lab on NNS with MapReduce
- 1 lab on Machine learning with PySpark
15
Consultations
- Online QA discussions in Piazza
- encourage you all to participant
- Online consultation with tutor
- 1pm – 2pm every Friday
- using Zoom
- room number and password in Piazza
- Private online consultation with me
- please book an appointment with me with a brief
description of your questions, with [comp9313] in subject
16
General Recommendations
- Make use of LiC and tutors
- don’t hesitate to ask questions
- Make use of Piazza
- read the notices in course website and Piazza
- participate in the discussions in Piazza
- Make use of course materials
- understand lecture slides
- read specifications carefully
- try the labs although they are not compulsory
- Do not misconduct
17
Your Feedbacks are Always Welcome
- Please advice where I can improve after each
lecture, through Piazza or by email
- myExperience system
18