Week 0 Friday#

Welcome to Math 10!

This class is an introduction to using Python for data science. There are two primary parts of the course:

  • Part 1. Exploratory Data Analysis. (Weeks 1-5)

  • Part 2. Introduction to Machine Learning. (Weeks 5-10)

Two in-class midterms: Monday Week 5 (10/30) and Monday Week 10 (12/04). They’re closed book and closed computer.

There’s no final exam; instead there is a class project.

There will be NO official textbook for this course. You may find the following references helpful:#


  • If you’re on the waitlist, please submit homework/take quizzes on the same schedule as the regular class. (Assignments won’t be excused, but I do drop the one lowest worksheet scores.)

What is Data Science?#

Three correlated concepts:

  • Data Science

  • Artificial Intelligence

  • Machine Learning

Battle of the Data Science Venn Diagrams

The original Venn diagram from Drew Conway:

Another diagram from Steven Geringer:

Perhaps the reality should be:

David Robinson’s Auto-pilot example:

  • Machine learning: predict whether there is a stop sign in the camera

  • Artificial intelligence: decide when to take the action of applying brakes (either by rules or from data)

  • Data science: provide the insights why it’s more likely to miss a stop sign before sunrise or after sunset

Example: Precision Medicine and Single-cell Sequencing.#

  • A structured data table, with \(n\) observations and \(p\) variables.

  • Mathematical representation: The data matrix \(X\in\mathbb{R}^{p\times n}\). For notations we write \(X=\left(\mathbf{x}^{(1)},\mathbf{x}^{(2)} \cdots, \mathbf{x}^{(n)} \right)\), where the \(i\)-th column vector represents \(i\)-th observation, \(\mathbf{x}^{(i)}=\left( \begin{matrix} x_{1}^{(i)}\\ x_{2}^{(i)} \\ \cdots \\ x_{p}^{(i)} \end{matrix} \right) \in\mathbb{R}^{p}\)

  • Roughly speaking, big data – large \(n\), high-dimensional data – large \(p\).

Why Python?#

Python is Good#

  • Stable Learning Curves

An entertaning cartoon from Tobias Hermann

  • Scalability of Computation (with the help with other packages)

benchmarking of scientific computation problems

comparison between Numpy and Matlab

Warm-up with Deepnote and some Python concepts#

You can type alone with me.

You execute a cell/block in Deepnote (or in a Jupyter notebook) by holding down shift and hitting return. The order in which you execute cells is important.

This is an example of a markdown cell. Markdown cells are used to write explanation for your code and format text nicely.

  • This is an example of making a list. To execute a cell, you can use command+enter. To edit a cell, make sure it’s highlighted, and then press enter.

Math 10#

Week 0#


If you are fimilar with \(\LaTeX\)

  • \(\int_0^1 x^2 dx\)

Here’s an example of how we can change text color:

Warning: Midterm1 is on Week 5 Monday.

Here are some examples of code cells:

a = 4
b = 8

The order in which we evaluate cells matters!

NameError                                 Traceback (most recent call last)
Cell In [3], line 1
----> 1 print(x)

NameError: name 'x' is not defined
x = 10
# take square of x
print('Hello World!')
Hello World!
list = [5, 6, 7, 8]

Indexing in Python starts at 0!

[8, 7, 6, 5]
  • NumPy is one of the most important python libraries

  • NumPy does not come with base python, we will need to import it every time we start a new notebook The abbreviation np is a standard convention, and we will always use it in Math 10.

# load Numpy
import numpy as np
# generate a random variable
# we can use the funtion in numpy random.default_rng

Here we specify a seed so that we have reproducible results:

>>> import numpy as np
>>> rng = np.random.default_rng(seed=42)
>>> print(rng)
>>> arr1 = rng.random((3, 3))
>>> arr1
array([[0.77395605, 0.43887844, 0.85859792],
       [0.69736803, 0.09417735, 0.97562235],
       [0.7611397 , 0.78606431, 0.12811363]])

If we exit and restart our Python interpreter, we'll see that we
generate the same random numbers again:

>>> import numpy as np
>>> rng = np.random.default_rng(seed=42)
>>> arr2 = rng.random((3, 3))
>>> arr2
array([[0.77395605, 0.43887844, 0.85859792],
       [0.69736803, 0.09417735, 0.97562235],
       [0.7611397 , 0.78606431, 0.12811363]])
rng = np.random.default_rng()
array([0.78238483, 0.81217947, 0.52612065, 0.91363018, 0.65672153])
array([0.39916726, 0.40975624, 0.24989856])