` Big Data and Scripting
University of Konstanz
Algorithmik
Prof. Dr. Ulrik Brandes

Big Data and Scripting (SS 2014)

+++ Announcements +++

  • second exam is graded, results are available in electronic form
  • you can inspect your graded exam on October 15., 3-4 p.m. in PZ1007

General Information

Lecture (U. Nagel)
Mon 13:30–15:00 (M628)
Thu 10:00–11:30 (E402)
Tutorial (M. Ortmann) Fri 10:00–11:30 (F420) (Group A)
Fri 13:30–15:00 (F420) (Group B)
written exams (exam time 90 minutes) July 30., 2 p.m (14 Uhr)A704
Oktober 14., 1 p.m. (13 Uhr) C336

Content

The term ``big data'' is often used to describe vast collections of semi-structured data in the range of tera- or even petabytes. Companies like Google and Amazon illustrate that mining and analyzing such collections yields the potential for completely new applications.

The lecture provides an overview of motivations to analyze big data and introduces techniques needed in the process. This includes introductions to scripting languages, NOSQL databases and Map/Reduce systems which are accompanied by practical exercises.

Course material will be made available on this page over the course of the lecture, mostly in form of the slides used in the lecture. In addition there will be regular practical assignments to be solved in groups which will be also topic of the tutorials.

Slides

lecturedateslides/topic
14/24prologue/introduction
24/28command line introduction
35/5sed introduction (includes part of next lecture)
45/8awk and streaming (mean/variance) (corrected formulas, 05/16)
55/12Flajolet-Martin, sampling, filtering
65/15estimating frequency moments, stream clustering (start)
75/19stream clustering (part 2) and  Python introduction (part 1)
85/22Python introduction (part 2)
95/26Python introduction (part 3) (includes additional slides)
106/2memory hieraries - B-Trees
116/5memory hieraries (part 2), parallel computations
126/12storage systems (small correction 7/25)
136/16storage systems (part 2)
146/23storage systems (part 3) and NOSQL systems
156/26NOSQL systems (mongoDB)
166/30NOSQL systems (mongoDB), map reduce
177/7map reduce (in mongoDB and in general) (new slide added July 10.)
187/10map reduce complexity and Hadoop/HDFS (includes slides not (yet) covered in lecture)
197/14map reduce Hadoop and design patterns
207/17map reduce algorithms for joins and decision trees
217/21systems on top and beyond map/reduce
227/24summary

Assignments

The assignments are returned and discussed during the tutorials. 50 percent of the total score of the assignments and regular attendance at the tutorials are necessary to be admitted to the final exam.

You are permitted and encouraged to work in groups of two. For your submission please follow the instructions given in assignment00.

No. Posted Tutorial Download files/intermediate results solutions
0 April 28. May 2. a00.pdf data.tar.gz
1 April 28. May 9. a01.pdf tables.txt
2 May 5. May 16. a02.pdf
3 May 12. May 23. a03.pdf random.awk
4 May 19. May 30. a04.pdf 2013_02_08_01.zip
5 May 26. June 6. a05.pdf multigauss.py, example.py
6 June 2. June 13. a06.pdf
7 June 10. June 20. a07.pdf newsgroups.zip
8 June 16. June 27. a08.pdf
9 June 23. July 11. a09.pdf
10 July 7. July 18. a10.pdf

Textbooks/ other material