VSCSE - Virtual School of Computational Science and Engineering

Sites

University of California San Diego, San Diego Supercomputer Center,San Diego, CA

Chicago Area (Northwestern/University of Chicago)

Louisiana State University, Baton Rouge, LA

Marshall University, Huntington, WV

Michigan State University, East Lansing, MI

Princeton University, Princeton, NJ

Purdue University, West Lafayette, IN

University of California Los Angeles, Los Angeles, CA

University of Delaware, Newark, DE

University of Illinois at Urbana-Champaign, National Center for Supercomputing Applications, Urbana, IL

University of Oklahoma, Norman, OK

University of Tennessee Knoxville, Knoxville, TN

University of Texas at Brownsville, Brownsville, TX

University of Texas at El Paso, El Paso, TX

University of Wisconsin Milwaukee, Milwaulkee, WI

Data Intensive Summer School

July 8–10, 2013

Data Intensive Video Series

The Data Intensive Summer School focuses on the skills needed to manage, process and gain insight from large amounts of data. It is targeted at researchers from the physical, biological, economic and social sciences that are beginning to drown in data. We will cover the nuts and bolts of data intensive computing, common tools and software, predictive analytics algorithms, data management and non-relational database models. Given the short duration of the summer school, the emphasis will be on providing a solid foundation that the attendees can use as a starting point for advanced topics of particular relevance to their work.

Download Agenda

Day 1:

Day 2:

Day 3:

Prerequisites:

  • Experience working in a Linux environment
  • Familiarity with relational data base models
  • Examples and assignments will most likely use R, MATLAB and Weka. We do not require experience in these languages or tools, but you should already have an understand of basic programming concepts (loops, conditionals, functions, arrays, variables, scoping, etc.)

Organizers:

  • Robert Sinkovits, San Diego Supercomputer Center

Course topics:

  • Nuts and bolts of data intensive computing
  • Computer hardware, storage devices and file systems
  • Cloud storage
  • Data compression
  • Networking and data movement
  • Data management
  • Digital libraries and archives
  • Data management plans
  • Access control, integrity and provenance
  • Introduction to R programming
  • Introduction to Weka
  • Predictive analytics
  • Standard algorithms: k-mean clustering, decision trees, SVM
  • Over-fitting and trusting results
  • Dealing with missing data
  • ETL (Extract, transfer and load)
  • The ETL life cycle
  • ETL tools – from scripts to commercial solutions
  • Non-relational databases
  • Brief refresher on relational model
  • Survey of non-relational models and technologies
  • Visualization
  • Presentation of data for maximum insight
  • R and ggplot package

NOTE: Students are required to provide their own laptops.