ECE 365 - Data Science and Engineering

Spring 2024

TitleRubricSectionCRNTypeHoursTimesDaysLocationInstructor
Data Science and EngineeringECE365ZJ175252LCD3 -    Venugopal V. Veeravalli

Official Description

Project-based course focused on exploring and understanding how data are collected, represented and stored, and computed/analyzed upon to arrive at appropriate and meaningful interpretation. Foundations of machine learning are developed and then applied in the context of two specific application areas, such as social network analytics, biological data analysis, and auto and video analytics. Course Information: Prerequisite: ECE 313.

Goals

Big Data is all around us. Petabytes of data are collected by Google and Facebook. Twenty-four hours of video are uploaded on Youtube every minute. Making sense of all this data in the relevant context is a critical question. The goal of the course is to given the students a holistic understanding of how this data is collected, represented and stored, retrieved and computed/analyzed upon to finally arrive at appropriate outcomes for the underlying context.

Topics

The course is divided into three parts, with the first part focusing on foundations of machine learning, and the remaining two on specific application areas. Each application topic is covered at four discrete levels.

  • We start with the context of where the data comes from, how it is acquired, what are the biases and noise levels in the data leading to statistical and physical models of the data acquired.
  • Appropriate data representation mechanisms and distributed storage and computing architectures are discussed next. Based on the type of the data, different compression/ coding methods are appropriate. Images, videos, genomic data, medical imaging data, smart grid data, each bring their own unique characteristics which can be harnessed towards efficient representation.
  • Once data is stored and represented efficiently, we look for the right statistical and algorithmic tools to analyze the data. Spectral methods (including Fourier methods and PCA), Clustering algorithms, SVM, Mining algorithms are studied in the specific context of the data.
  • Finally, the analyzed data leads to appropriate inferences or visualizations as appropriate to the physical problem we started out with. This closes the loop bringing utility to the original setting and context in which the data was acquired.

Examples of applications topics include: Machine learning for power systems, Biological Data Analytics, Audio and Video Data Analytics, and Social Network Analytics.

Detailed Description and Outline

The Course Plan for the Spring 2019 offering is listed below. The application topics can change from semester to semester.

Course Plan

Part 1 (Weeks 1-5): Foundations of Machine Learning

Lecture 1: Introduction to the course; Review of Linear Algebra and Probability
Lecture 2: k-Nearest Neighbor Classifiers and Bayes Classifiers
Lecture 3: Linear Classifiers and Linear Discriminant Analysis
Lecture 4: Naïve Bayes, Kernel Tricks
Lecture 5: Logistic Regression, SVM and Model Selection
Lecture 6: K-Means Clustering and Applications
Lecture 7: Linear Regression and Applications
Lecture 8: SVD and Eigen-Decomposition
Lecture 9: Principal Component Analysis
Lecture 10: Optimization Techniques for Machine Learning, Q&A

Labs (Weeks 1-5)
Lab 1: Introduction to Python and the Canopy environment
Lab 2: Linear Classification: k-NN and LDA
Lab 3: Linear Classification: SVM
Lab 4: Clustering and Linear Regression
Lab 5: Eigen-Decompositions, SVD and PCA

Grading: 30% pre-lab quizzes (in class), 70% labs and lab reports.


Part 2 (Weeks 6-10): Smart Grid

Lecture 1: Introduction to power systems, basics of neural networks
Lecture 2: Neural networks and load prediction
Lecture 3: Power flow equations
Lecture 4: SVM for detecting corrupt power system measurements
Lecture 5: Detecting network structure
Lecture 6: Basics of electricity markets, virtual bidding
Lecture 7: Trading strategies for virtual bidding
Lecture 8: Wrapping up virtual bidding, understand customer data
Lecture 9: Logistic regression for customer data analysis
Lecture 10: Customer billing and cost savings from solar

Labs
Lab 1: Day-ahead load prediction in ERCOT markets
Lab 2: Detecting bad sensors in power system measurements
Lab 3: Virtual bidding in NYISO’s markets
Lab 4: Analyze customer data from Austin, Texas.


Grading: 30% pre-lab quizzes (in class), 70% labs and lab reports


Part 3 (Weeks 11-15): Biological Data Analytics

Lecture 1: Introduction to bioinformatics. Biological data.
Lecture 2: Sequence alignment. Global vs local alignment. Dynamic programming.
Lecture 3: The Smith-Waterman and Needlman-Wunsch algorithms. BLAST.
Lecture 4. Suffix trees and the Burrows-Wheeler transform. Bowtie2.
Lecture 5: Dynamic programming for sequence folding prediction. Vienna and Mfold. Stochastic grammars for folding models.
Lecture 6: Sanger sequencing. Overview of Next Generation and Third Generation Sequencing technologies.
Lecture 7: Basics of graph theory. Genome assembly via de Bruijn Graphs. EULER and IDBA_UD.
Lecture 8: Statistical read error-correction for Illumina, PacBio and Oxford Nanopore sequencers. Quake.
Lecture 9: Biological data repositories and databases.
Lecture 10: Biological data compression. Reference-based compression. CRAM. Context-tree weighting.

Labs
Lab 1: Sequence alignment and applications of BLAST.
Lab 2: Bowtie and DNA forensics.
Lab 3: Genome assembly. Influence of sequencing errors on assembler accuracy.
Lab 4: -Omics data compression.
Lab 5: Genomic sequence amplification and primer selection.

Grading: 30% pre-lab quizzes (in class), 70% labs and lab reports.

Computer Usage

All the labs are computer based using software packages such as Python and R.

Lab Projects

See Detailed Description and Outline.

Topical Prerequisites

Probability

Basic linear algebra

Texts

No textbook.

Required, Elective, or Selected Elective

Elective

Course Goals

Big Data is all around us. Petabytes of data are collected by Google and Facebook. Twenty-four hours of video are uploaded on Youtube every minute. Making sense of all this data in the relevant context is a critical question. The goal of the course is to given the students a holistic understanding of how this data is collected, represented and stored, retrieved and computed/analyzed upon to finally arrive at appropriate outcomes for the underlying context.

Instructional Objectives

At the end of this course, the student will be able apply the machine learning and data science tools gained in this course to several different types of problems involving data analytics in engineering systems and beyond. The student will also consider the broader societal impacts of the solutions, e.g., fairness in machine learning algorithms. (4)

Examples of the problems considered include:

  1. Given a set of labelled images corresponding to handwritten digits, the student will be able design a classifier to effectively classify a new image that is outside the data set. (1) The student will learn systematic ways to choose the best classifier among a set of choices, through the process of training, validation and testing. (1) (2) (6) (7)
  2. The student will learn practical applications of data analysis in system and market operations for the power grid. Physics and other practical considerations often dictate what properties to expect from data in such problems. (1) Students will learn how to exploit these properties to choose the right tool for classification and regression tasks. (2) Tools used for this part expand on the ones learnt in the first part of the course, e.g., logistic regression and support vector machines, thus allowing the students to appreciate the theory in action. (6) Finally, the student will learn how to interpret the results based on the application context, and also understand the implications of the results in a broader societal context. (4) (7)
  3. The student will be able to apply the machine learning and data science tools learnt in the first part of the course to perform statistical hypothesis about molecular biology (genomics). In order to do that, the student will first learn basic concepts of molecular biology (genomics). (1) Then, the student will: (i) apply data normalization techniques to genomic sequencing data, (ii) perform statistical analyses over the preprocessed data, and (iii) make biological hypothesis based on statistical tests. (2) Specifically, the machine learning concepts that the student will use to solve such problems are: data standardization, linear regression, design matrices, Expectation-Maximitation, t-tests, hypothesis testing and multiple testing correction, among others. (6)

Last updated

9/29/2019by Venugopal V. Veeravalli