Week 2 Monday

Week 2 Monday#

You can find the notesbooks at course notes.

Announcements#

HW2 distributed today.
HW1 is due 11:59pm tonight on Canvas.
Quiz 1 is Tuesday during discussion section.

Use Pandas to import the dataset#

Read in the data directly from our-own.csv using the pandas function read_csv and store it with the variable name df.

One advantage of this approach: the column names show up.

import pandas as pd
df = pd.read_csv('our-own.csv')
df

	Cost	Quantity	Seller	Rating
0	6.43	10	Alice	4.7
1	5.81	15	Bob	4.4
2	4.99	2	Eve	2.0

type(df)

pandas.core.frame.DataFrame

Evaluate the dtypes attribute of df. Notice how different columns hold different data types.

(A string data type, like "Alice", in pandas is usually reported as being an “object” data type.)

The fact that pandas DataFrames can have different columns with different data types is one of the big advantages of a pandas DataFrame over a NumPy array.

df.dtypes

Cost        float64
Quantity      int64
Seller       object
Rating      float64
dtype: object

Define a variable col to be equal to the “Rating” column.

There are many types of indexing in pandas, and you should expect to need some practice before you get used to them. By Column Label: Access a specific column using its label.

col = df['Rating']
col

  4.7
  4.4
  2.0
Name: Rating, dtype: float64

What is the type of df? What is the type of col? These are the two most important data types in pandas.

type(df)

pandas.core.frame.DataFrame

type(col)

pandas.core.series.Series

How many rows and columns are there in df? Just like in NumPy, pandas DataFrames have a shape attribute.

df.shape

(3, 4)

What the average of the values in the “Rating” column? Use the mean method.

Notice the exact same numerical precision issue is showing up (even with the same 6 at the end of the decimal). The library pandas has lots of advantages, but it does not solve numerical precision issues.

df['Rating'].mean()

3.7000000000000006

Ways to index in pandas#

There are two commonly used ways to index in pandas: using labels (with loc) and using integer position (with iloc).

df

	Cost	Quantity	Seller	Rating
0	6.43	10	Alice	4.7
1	5.81	15	Bob	4.4
2	4.99	2	Eve	2.0

By Row Index: Access a specific row using the .loc accessor with its index label.

df.loc[0]

Cost         6.43
Quantity       10
Seller      Alice
Rating        4.7
Name: 0, dtype: object

How can we access the entry in the row at index 2 and the column named “Seller” using loc?

`.loc` Accessor#

Single Label: Select data at a particular row or column label.

df.loc[2,'Seller']

'Eve'

List of Labels: Select multiple rows or columns using a list of labels.

df.loc[[0, 1], ['Seller', 'Rating']]

	Seller	Rating
0	Alice	4.7
1	Bob	4.4

Slice of Labels: Select a range of rows or columns using label slices.

The slice() function returns a slice object. slice(start (default 0), end)

df.loc[:1, 'Quantity':'Rating'] 
#equivalent to
#df.loc[slice(0,1), slice('Quantity','Rating')] 

	Quantity	Seller	Rating
0	10	Alice	4.7
1	15	Bob	4.4

Boolean Array: Select rows or columns based on a boolean condition.

df.loc[df['Quantity'] < 12] #select rows

	Cost	Quantity	Seller	Rating
0	6.43	10	Alice	4.7
2	4.99	2	Eve	2.0

How can we access that same value using iloc?

`.iloc` Accessor#

Single Integer: Select data at a particular row or column integer index.

df.iloc[2,2]

'Eve'

List of Integers: Select multiple rows or columns using a list of integer indices.

df.iloc[[0, 2], [0, 2]]

	Cost	Seller
0	6.43	Alice
2	4.99	Eve

Slice of Integers: Select a range of rows or columns using integer index slices.

Remember that .loc includes both the start and the end of the slice, while .iloc includes the start but excludes the end.

df.iloc[0:1, 0:1]

	Cost
0	6.43

Boolean indexing in pandas#

Boolean indexing in pandas works very similarly to how it works in NumPy.

Define sub_df to be the sub-DataFrame of df containing all the rows for which the quantity is strictly less than 12.

df['Quantity'] < 12 #indexing using a Boolean Series

   True
  False
   True
Name: Quantity, dtype: bool

type(df['Quantity'] < 12)

pandas.core.series.Series

The Boolean Series says to keep rows 0 and 2 but to discard row 1.

sub_df = df[df['Quantity'] < 12]
sub_df

	Cost	Quantity	Seller	Rating
0	6.43	10	Alice	4.7
2	4.99	2	Eve	2.0

# alternative way
df.loc[df['Quantity'] < 12]

	Cost	Quantity	Seller	Rating
0	6.43	10	Alice	4.7
2	4.99	2	Eve	2.0

Multiple Conditions: Combine multiple conditions using & (and), | (or), and ~ (not).

df[(df['Quantity'] > 2) & (df['Quantity'] < 12)]

	Cost	Quantity	Seller	Rating
0	6.43	10	Alice	4.7

We can use 2 with loc, since there is row with label 2. However, there is not a row at integer location 2.

sub_df.loc[2,'Rating']

2.0

sub_df.iloc[2,3]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In [16], line 1
----> 1 sub_df.iloc[2,3]

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:889, in _LocationIndexer.__getitem__(self, key)
    886         with suppress(KeyError, IndexError, AttributeError):
    887             # AttributeError for IntervalTree get_value
    888             return self.obj._get_value(*key, takeable=self._takeable)
--> 889     return self._getitem_tuple(key)
    890 else:
    891     # we by definition only have the 0th axis
    892     axis = self.axis or 0

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1450, in _iLocIndexer._getitem_tuple(self, tup)
   1448 def _getitem_tuple(self, tup: Tuple):
-> 1450     self._has_valid_tuple(tup)
   1451     with suppress(IndexingError):
   1452         return self._getitem_lowerdim(tup)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:723, in _LocationIndexer._has_valid_tuple(self, key)
    721 for i, k in enumerate(key):
    722     try:
--> 723         self._validate_key(k, i)
    724     except ValueError as err:
    725         raise ValueError(
    726             "Location based indexing can only have "
    727             f"[{self._valid_types}] types"
    728         ) from err

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1358, in _iLocIndexer._validate_key(self, key, axis)
   1356     return
   1357 elif is_integer(key):
-> 1358     self._validate_integer(key, axis)
   1359 elif isinstance(key, tuple):
   1360     # a tuple should already have been caught by this point
   1361     # so don't treat a tuple as a valid indexer
   1362     raise IndexingError("Too many indexers")

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1444, in _iLocIndexer._validate_integer(self, key, axis)
   1442 len_axis = len(self.obj._get_axis(axis))
   1443 if key >= len_axis or key < -len_axis:
-> 1444     raise IndexError("single positional indexer is out-of-bounds")

IndexError: single positional indexer is out-of-bounds

What is the average cost for those rows?

sub_df['Cost'].mean()

5.71

Slicing#

Row Slicing: Slice rows using integer indices.

Take a look at first two rows

df[:2]

	Cost	Quantity	Seller	Rating
0	6.43	10	Alice	4.7
1	5.81	15	Bob	4.4

Column Slicing: Slice columns using the .loc accessor and label slices.

df.loc[:, 'Quantity':'Rating']

	Quantity	Seller	Rating
0	10	Alice	4.7
1	15	Bob	4.4
2	2	Eve	2.0

df.iloc[:, 1:3]

	Quantity	Seller
0	10	Alice
1	15	Bob
2	2	Eve

Remember that .loc includes both the start and the end of the slice, while .iloc includes the start but excludes the end.

Exploring the taxis dataset#

Load the attached file taxis.csv using the pandas function read_csv. Store the resulting DataFrame in the variable df.

import pandas as pd
df = pd.read_csv("taxis.csv")

Get a sense for the contents of df using the following.

The shape attribute.
The head method, to view the first few rows.
The sample method, to view a random collection of rows.
The info method.
The describe method.

The shape attribute reports the number of rows (6433 in this case) and the number of columns (14).

df.shape

(6433, 14)

df.head(10) # we look at its first 10 rows

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan
5	2019-03-11 10:37:23	2019-03-11 10:47:31	1	0.49	7.5	2.16	12.96	yellow	credit card	Times Sq/Theatre District	Midtown East	Manhattan	Manhattan
6	2019-03-26 21:07:31	2019-03-26 21:17:29	1	3.65	13.0	2.00	18.80	yellow	credit card	Battery Park City	Two Bridges/Seward Park	Manhattan	Manhattan
7	2019-03-22 12:47:13	2019-03-22 12:58:17	0	1.40	8.5	0.00	11.80	yellow	NaN	Murray Hill	Flatiron	Manhattan	Manhattan
8	2019-03-23 11:48:50	2019-03-23 12:06:14	1	3.63	15.0	1.00	19.30	yellow	credit card	East Harlem South	Midtown Center	Manhattan	Manhattan
9	2019-03-08 16:18:37	2019-03-08 16:26:57	1	1.52	8.0	1.00	13.30	yellow	credit card	Lincoln Square East	Central Park	Manhattan	Manhattan

df[:5] #first 5 rows

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan

The sample method is similar, but it returns a random selection of the rows. Notice that they could also be in a different order.

df.sample(3)

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
2768	2019-03-14 18:46:48	2019-03-14 18:54:48	1	0.60	6.5	2.70	13.50	yellow	credit card	SoHo	Little Italy/NoLiTa	Manhattan	Manhattan
6147	2019-03-09 22:22:28	2019-03-09 22:46:25	1	4.03	18.0	4.41	26.46	green	credit card	Central Harlem	Times Sq/Theatre District	Manhattan	Manhattan
4962	2019-03-09 15:41:40	2019-03-09 15:56:57	1	2.54	12.5	3.16	18.96	yellow	credit card	Upper West Side South	Lenox Hill West	Manhattan	Manhattan

The info method contains lots of iinformation. At the very bottom, it tells us how much space the DataFrame is taking up (which can be very useful if you’ve loaded a huge DataFrame from some external source). For each row, it tells us the data type (in the usual rough sense, for example, strings are reported as the “object” data type).

A huge issue in data science is missing data (and how to deal with it). This info method also reports missing data. (Or more precisely, it reports non-missing data, so for example we can infer below that 6433-6389 = 44 values are missing from the “payment” column.)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pickup           6433 non-null   object 
 1   dropoff          6433 non-null   object 
 2   passengers       6433 non-null   int64  
 3   distance         6433 non-null   float64
 4   fare             6433 non-null   float64
 5   tip              6433 non-null   float64
 6   tolls            6433 non-null   float64
 7   total            6433 non-null   float64
 8   color            6433 non-null   object 
 9   payment          6389 non-null   object 
 10  pickup_zone      6407 non-null   object 
 11  dropoff_zone     6388 non-null   object 
 12  pickup_borough   6407 non-null   object 
 13  dropoff_borough  6388 non-null   object 
dtypes: float64(5), int64(1), object(8)
memory usage: 703.7+ KB

The describe method is returns information about the distribution of numbers in numeric columns. For example, we can see that the median fare is 6.5 (the 50-th percentile), whereas the mean fare is 13.09.

df.describe()

	passengers	distance	fare	tip	tolls	total
count	6433.000000	6433.000000	6433.000000	6433.00000	6433.000000	6433.000000
mean	1.539251	3.024617	13.091073	1.97922	0.325273	18.517794
std	1.203768	3.827867	11.551804	2.44856	1.415267	13.815570
min	0.000000	0.000000	1.000000	0.00000	0.000000	1.300000
25%	1.000000	0.980000	6.500000	0.00000	0.000000	10.800000
50%	1.000000	1.640000	9.500000	1.70000	0.000000	14.160000
75%	2.000000	3.210000	15.000000	2.80000	0.000000	20.300000
max	6.000000	36.700000	150.000000	33.20000	24.020000	174.820000