What does the Data say? | Mathematics of Life

ESAM 375-1: What does the Data Say?

Far too many courses on analyzing data these days are an assembly of methods. Because of the relative speed of development in the field, such courses, necessarily, lack an overarching framework. These are important courses/methods that students should learn about. That said, there is a much more fundamental issue that can be summarized by the following question -- "Is this result surprising?".

This question can be asked upon seeing the outcome of a sequence of coin tosses (Am I surprised that a coin from the U.S. mint gave me 7 heads and 3 tails in a sequence of 10 tosses?), or the plot outputted by fashionable dimensionality-reduction techniques. Said another way, assessing confidence in the outcome of some analysis rests on a clearly stated quantitative expectation. Deviations from this expectation should then represent surprise, and potentially a novel result. This assessment mind-set is something that students are often not taught, leaving them to rely on running black-box test statistics on their data and crossing their fingers.

Mirroring the manner in which computation revolutionized our study of differential equations (we now use computers to simulate the flow of air over an aerofoil, rather than compute by hand), the near-universal access to powerful computation calls for a new era of statistical assessment. Students need not carry around a list of xyz tests, and throw the kitchen sink at their data. Instead they can use a combination of a few simple ideas and the power of their computers to assess the confidence or surprise, they have in their results. This is the philosophy that guides this course.

The course has 3 broad sections: 1) The basics, 2) When you have a model in mind, and 3) When you don't have a model at hand. Again, the overall theme running through the class is to assess confidence through constructing quantitative expectations in your computer.

The basics - The atom of coding and plotting (We will teach you how to code) - The atom of probability - The atom of Bayesian analysis

When you have a model: - The atom of empirical assessment - The atom of parameter-estimation - The atom of null distributions

When you don't have a model: - The atom of correlation - The atom of statistical models - The atom of dimensionality-reduction

Coding proficiency: We assume you have never coded and teach you how to use your computer to figure out what the data says. We will teach coding in the Python Jupyter environment.

Prereqs: We have no prereqs for this class, and yet it is suitable for undergrads and grad students interested in understanding what their data says.

Class style: The class mirrors an experimental lab, only that here the experiments are done in your computer. The bulk of classes involve students working on worksheets on their computers in class. Lecture notes are made available and the students have them to read outside the classroom. Assignments are also entirely computational, and each one involves a flipped classroom style where students are given the opportunity to ask coding/conceptual/statistical questions. Overall, the ideology is that you get good at this kind of analysis by doing, not by listening to someone lecturing on how to do it.

Presentation: In parallel, a big theme of the course is plotting and presentation. The question, what does the Data say, is a vague one? And deciding what constitutes a satisfactory answer, and how to communicate that in well-crafted plots is a skill. This is a keystone teaching goal of this class.

Biology? This course is not focused on biological phenomena or biological datasets. We use data from biology only because it is readily accessible to the teaching faculty. Over time we are including data from other areas of study.