Statistical learning refers to a vast set of tools for understanding data.
Alternative text: I vaguely and irrationally resent how useful WebPlotDigitizer is.
These tools can broadly be thought of as
Examples:
Wage data
| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage |
|---|---|---|---|---|---|---|---|---|---|---|
| 2006 | 18 | 1. Never Married | 1. White | 1. < HS Grad | 2. Middle Atlantic | 1. Industrial | 1. <=Good | 2. No | 4.318063 | 75.04315 |
| 2004 | 24 | 1. Never Married | 1. White | 4. College Grad | 2. Middle Atlantic | 2. Information | 2. >=Very Good | 2. No | 4.255273 | 70.47602 |
| 2003 | 45 | 2. Married | 1. White | 3. Some College | 2. Middle Atlantic | 1. Industrial | 1. <=Good | 1. Yes | 4.875061 | 130.98218 |
Factors related to wages for a group of males from the Atlantic region of the United States. We might be interested in the association between an employee’s age, education, and the calendar year on his wage.
Gene Expression Data
Consider the NCI60 data, which consists of 6,830 gene expression measurements for 64 cancer lines. We are interested ind determining whether there are groups among the cell lines based on their gene expression measurements.
1 A Brief History
Although the term “statistical machine learning” is fairly new, many of the concepts are not. Here are some highlights:
2 Notation and Simple Matrix Algebra
I’ll try to keep things consistent notationally throughout this course. Please call me out if I don’t!
\(n\)
\(p\)
\(x_{ij}\)
\(\boldsymbol X\)
\(\boldsymbol y\)
\(a, \boldsymbol A, A\)
\(a \in \mathbb{R}\)
Matrix multiplication