Week 0 Friday

Week 0 Friday#

Welcome to Math 10!

Canvas homepage

This class is an introduction to using Python for data science. There are two primary parts of the course:

Part 1. Exploratory Data Analysis. (Weeks 1-5)
Part 2. Introduction to Machine Learning. (Weeks 5-10)

Two in-class midterms: Monday Week 5 (10/30) and Monday Week 10 (12/04). They’re closed book and closed computer.

There’s no final exam; instead there is a class project.

There will be NO official textbook for this course. You may find the following references helpful:#

For Basic Python Programming: A Byte of Python
For Machine Learning Codes in Python: Python Data Science Handbook
For Machine Learning Applications and Theories: An Introduction to Statistical Learning with solutions in Python; The Elements of Statistical Learning; Probabilistic Machine Learning: An Introduction
For Deep Learning: Deep Learning

Announcements#

If you’re on the waitlist, please submit homework/take quizzes on the same schedule as the regular class. (Assignments won’t be excused, but I do drop the one lowest worksheet scores.)

What is Data Science?#

Three correlated concepts:

Data Science
Artificial Intelligence
Machine Learning

Battle of the Data Science Venn Diagrams

The original Venn diagram from Drew Conway:

Another diagram from Steven Geringer:

Perhaps the reality should be:

David Robinson’s Auto-pilot example:

Machine learning: predict whether there is a stop sign in the camera
Artificial intelligence: decide when to take the action of applying brakes (either by rules or from data)
Data science: provide the insights why it’s more likely to miss a stop sign before sunrise or after sunset

Example: Precision Medicine and Single-cell Sequencing.#

A structured data table, with \(n\) observations and \(p\) variables.
Mathematical representation: The data matrix \(X\in\mathbb{R}^{p\times n}\). For notations we write \(X=\left(\mathbf{x}^{(1)},\mathbf{x}^{(2)} \cdots, \mathbf{x}^{(n)} \right)\), where the \(i\)-th column vector represents \(i\)-th observation, \(\mathbf{x}^{(i)}=\left( \begin{matrix} x_{1}^{(i)}\\ x_{2}^{(i)} \\ \cdots \\ x_{p}^{(i)} \end{matrix} \right) \in\mathbb{R}^{p}\)
Roughly speaking, big data – large \(n\), high-dimensional data – large \(p\).

Why Python?#

Python is Popular#

How to measure popularity? It is indeed a data science problem!

TIOBE: Based on google search results
PYPL PopularitY: Based on google trends
GitHut 2.0: Based on Github
Redmonk: Based on Github+Stack Overflow

Python is Good#

Stable Learning Curves

An entertaning cartoon from Tobias Hermann

Scalability of Computation (with the help with other packages)

benchmarking of scientific computation problems

comparison between Numpy and Matlab

Useful Packages
- Numpy: Scientific Computing
- Pandas: Data Analysis and Manipulation
- Scikit-Learn: Machine Learning
- Matplotlib: Visualizing Functions/Datasets
- Seaborn: Visualizing Statistical Data

Warm-up with Deepnote and some Python concepts#

Deepnote vs Jupyter notebook

You can type alone with me.

You execute a cell/block in Deepnote (or in a Jupyter notebook) by holding down shift and hitting return. The order in which you execute cells is important.

This is an example of a markdown cell. Markdown cells are used to write explanation for your code and format text nicely.

This is an example of making a list. To execute a cell, you can use command+enter. To edit a cell, make sure it’s highlighted, and then press enter.

Math 10#

Week 0#

Friday#

If you are fimilar with \(\LaTeX\)

\(\int_0^1 x^2 dx\)

Here’s an example of how we can change text color:

Warning: Midterm1 is on Week 5 Monday.

Here are some examples of code cells:

2+2

a = 4
b = 8
a+b

The order in which we evaluate cells matters!

print(x)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [3], line 1
----> 1 print(x)

NameError: name 'x' is not defined

x = 10

print(x)

# take square of x
x**2

print('Hello World!')

Hello World!

list = [5, 6, 7, 8]

list[1]

Indexing in Python starts at 0!

list[0]

list[3]

list[-1]

list[::-1]

[8, 7, 6, 5]

NumPy is one of the most important python libraries
NumPy does not come with base python, we will need to import it every time we start a new notebook The abbreviation np is a standard convention, and we will always use it in Math 10.

# load Numpy
import numpy as np

# generate a random variable
# we can use the funtion in numpy random.default_rng

np.random.default_rng?

Docstring:
Construct a new Generator with the default BitGenerator (PCG64).

Parameters
----------
seed : {None, int, array_like[ints], SeedSequence, BitGenerator, Generator}, optional
    A seed to initialize the `BitGenerator`. If None, then fresh,
    unpredictable entropy will be pulled from the OS. If an ``int`` or
    ``array_like[ints]`` is passed, then it will be passed to
    `SeedSequence` to derive the initial `BitGenerator` state. One may also
    pass in a `SeedSequence` instance.
    Additionally, when passed a `BitGenerator`, it will be wrapped by
    `Generator`. If passed a `Generator`, it will be returned unaltered.

Returns
-------
Generator
    The initialized generator object.

Notes
-----
If ``seed`` is not a `BitGenerator` or a `Generator`, a new `BitGenerator`
is instantiated. This function does not manage a default global instance.

Examples
--------
``default_rng`` is the recommended constructor for the random number class
``Generator``. Here are several ways we can construct a random 
number generator using ``default_rng`` and the ``Generator`` class. 

Here we use ``default_rng`` to generate a random float:

>>> import numpy as np
>>> rng = np.random.default_rng(12345)
>>> print(rng)
Generator(PCG64)
>>> rfloat = rng.random()
>>> rfloat
0.22733602246716966
>>> type(rfloat)
<class 'float'>
 
Here we use ``default_rng`` to generate 3 random integers between 0 
(inclusive) and 10 (exclusive):
    
>>> import numpy as np
>>> rng = np.random.default_rng(12345)
>>> rints = rng.integers(low=0, high=10, size=3)
>>> rints
array([6, 2, 7])
>>> type(rints[0])
<class 'numpy.int64'>

Here we specify a seed so that we have reproducible results:

>>> import numpy as np
>>> rng = np.random.default_rng(seed=42)
>>> print(rng)
Generator(PCG64)
>>> arr1 = rng.random((3, 3))
>>> arr1
array([[0.77395605, 0.43887844, 0.85859792],
       [0.69736803, 0.09417735, 0.97562235],
       [0.7611397 , 0.78606431, 0.12811363]])

If we exit and restart our Python interpreter, we'll see that we
generate the same random numbers again:

>>> import numpy as np
>>> rng = np.random.default_rng(seed=42)
>>> arr2 = rng.random((3, 3))
>>> arr2
array([[0.77395605, 0.43887844, 0.85859792],
       [0.69736803, 0.09417735, 0.97562235],
       [0.7611397 , 0.78606431, 0.12811363]])
Type:      builtin_function_or_method

rng = np.random.default_rng()
rng.random(5)

array([0.78238483, 0.81217947, 0.52612065, 0.91363018, 0.65672153])

help(rng.random)

Help on built-in function random:

random(...) method of numpy.random._generator.Generator instance
    random(size=None, dtype=np.float64, out=None)
    
    Return random floats in the half-open interval [0.0, 1.0).
    
    Results are from the "continuous uniform" distribution over the
    stated interval.  To sample :math:`Unif[a, b), b > a` multiply
    the output of `random` by `(b-a)` and add `a`::
    
      (b - a) * random() + a
    
    Parameters
    ----------
    size : int or tuple of ints, optional
        Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        ``m * n * k`` samples are drawn.  Default is None, in which case a
        single value is returned.
    dtype : dtype, optional
        Desired dtype of the result, only `float64` and `float32` are supported.
        Byteorder must be native. The default value is np.float64.
    out : ndarray, optional
        Alternative output array in which to place the result. If size is not None,
        it must have the same shape as the provided size and must match the type of
        the output values.
    
    Returns
    -------
    out : float or ndarray of floats
        Array of random floats of shape `size` (unless ``size=None``, in which
        case a single float is returned).
    
    Examples
    --------
    >>> rng = np.random.default_rng()
    >>> rng.random()
    0.47108547995356098 # random
    >>> type(rng.random())
    <class 'float'>
    >>> rng.random((5,))
    array([ 0.30220482,  0.86820401,  0.1654503 ,  0.11659149,  0.54323428]) # random
    
    Three-by-two array of random numbers from [-5, 0):
    
    >>> 5 * rng.random((3, 2)) - 5
    array([[-3.99149989, -0.52338984], # random
           [-2.99091858, -0.79479508],
           [-1.23204345, -1.75224494]])

rng.random(3)

array([0.39916726, 0.40975624, 0.24989856])