Week 2 Monday#

You can find the notesbooks at course notes.

Announcements#

  • HW2 distributed today.

  • HW1 is due 11:59pm tonight on Canvas.

  • Quiz 1 is Tuesday during discussion section.

Use Pandas to import the dataset#

  • Read in the data directly from our-own.csv using the pandas function read_csv and store it with the variable name df.

One advantage of this approach: the column names show up.

import pandas as pd
df = pd.read_csv('our-own.csv')
df
Cost Quantity Seller Rating
0 6.43 10 Alice 4.7
1 5.81 15 Bob 4.4
2 4.99 2 Eve 2.0
type(df)
pandas.core.frame.DataFrame
  • Evaluate the dtypes attribute of df. Notice how different columns hold different data types.

(A string data type, like "Alice", in pandas is usually reported as being an “object” data type.)

The fact that pandas DataFrames can have different columns with different data types is one of the big advantages of a pandas DataFrame over a NumPy array.

df.dtypes
Cost        float64
Quantity      int64
Seller       object
Rating      float64
dtype: object
  • Define a variable col to be equal to the “Rating” column.

There are many types of indexing in pandas, and you should expect to need some practice before you get used to them. By Column Label: Access a specific column using its label.

col = df['Rating']
col
0    4.7
1    4.4
2    2.0
Name: Rating, dtype: float64
  • What is the type of df? What is the type of col? These are the two most important data types in pandas.

type(df)
pandas.core.frame.DataFrame
type(col)
pandas.core.series.Series
  • How many rows and columns are there in df? Just like in NumPy, pandas DataFrames have a shape attribute.

df.shape
(3, 4)
  • What the average of the values in the “Rating” column? Use the mean method.

Notice the exact same numerical precision issue is showing up (even with the same 6 at the end of the decimal). The library pandas has lots of advantages, but it does not solve numerical precision issues.

df['Rating'].mean()
3.7000000000000006

Ways to index in pandas#

There are two commonly used ways to index in pandas: using labels (with loc) and using integer position (with iloc).

df
Cost Quantity Seller Rating
0 6.43 10 Alice 4.7
1 5.81 15 Bob 4.4
2 4.99 2 Eve 2.0

By Row Index: Access a specific row using the .loc accessor with its index label.

df.loc[0]
Cost         6.43
Quantity       10
Seller      Alice
Rating        4.7
Name: 0, dtype: object
  • How can we access the entry in the row at index 2 and the column named “Seller” using loc?

.loc Accessor#

Single Label: Select data at a particular row or column label.

df.loc[2,'Seller']
'Eve'

List of Labels: Select multiple rows or columns using a list of labels.

df.loc[[0, 1], ['Seller', 'Rating']]
Seller Rating
0 Alice 4.7
1 Bob 4.4

Slice of Labels: Select a range of rows or columns using label slices.

The slice() function returns a slice object. slice(start (default 0), end)

df.loc[:1, 'Quantity':'Rating'] 
#equivalent to
#df.loc[slice(0,1), slice('Quantity','Rating')] 
Quantity Seller Rating
0 10 Alice 4.7
1 15 Bob 4.4

Boolean Array: Select rows or columns based on a boolean condition.

df.loc[df['Quantity'] < 12] #select rows
Cost Quantity Seller Rating
0 6.43 10 Alice 4.7
2 4.99 2 Eve 2.0
  • How can we access that same value using iloc?

.iloc Accessor#

Single Integer: Select data at a particular row or column integer index.

df.iloc[2,2]
'Eve'

List of Integers: Select multiple rows or columns using a list of integer indices.

df.iloc[[0, 2], [0, 2]]
Cost Seller
0 6.43 Alice
2 4.99 Eve

Slice of Integers: Select a range of rows or columns using integer index slices.

Remember that .loc includes both the start and the end of the slice, while .iloc includes the start but excludes the end.

df.iloc[0:1, 0:1]
Cost
0 6.43

Boolean indexing in pandas#

Boolean indexing in pandas works very similarly to how it works in NumPy.

  • Define sub_df to be the sub-DataFrame of df containing all the rows for which the quantity is strictly less than 12.

df['Quantity'] < 12 #indexing using a Boolean Series
0     True
1    False
2     True
Name: Quantity, dtype: bool
type(df['Quantity'] < 12)
pandas.core.series.Series

The Boolean Series says to keep rows 0 and 2 but to discard row 1.

sub_df = df[df['Quantity'] < 12]
sub_df
Cost Quantity Seller Rating
0 6.43 10 Alice 4.7
2 4.99 2 Eve 2.0
# alternative way
df.loc[df['Quantity'] < 12]
Cost Quantity Seller Rating
0 6.43 10 Alice 4.7
2 4.99 2 Eve 2.0

Multiple Conditions: Combine multiple conditions using & (and), | (or), and ~ (not).

df[(df['Quantity'] > 2) & (df['Quantity'] < 12)]
Cost Quantity Seller Rating
0 6.43 10 Alice 4.7

We can use 2 with loc, since there is row with label 2. However, there is not a row at integer location 2.

sub_df.loc[2,'Rating']
2.0
sub_df.iloc[2,3]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In [16], line 1
----> 1 sub_df.iloc[2,3]

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:889, in _LocationIndexer.__getitem__(self, key)
    886         with suppress(KeyError, IndexError, AttributeError):
    887             # AttributeError for IntervalTree get_value
    888             return self.obj._get_value(*key, takeable=self._takeable)
--> 889     return self._getitem_tuple(key)
    890 else:
    891     # we by definition only have the 0th axis
    892     axis = self.axis or 0

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1450, in _iLocIndexer._getitem_tuple(self, tup)
   1448 def _getitem_tuple(self, tup: Tuple):
-> 1450     self._has_valid_tuple(tup)
   1451     with suppress(IndexingError):
   1452         return self._getitem_lowerdim(tup)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:723, in _LocationIndexer._has_valid_tuple(self, key)
    721 for i, k in enumerate(key):
    722     try:
--> 723         self._validate_key(k, i)
    724     except ValueError as err:
    725         raise ValueError(
    726             "Location based indexing can only have "
    727             f"[{self._valid_types}] types"
    728         ) from err

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1358, in _iLocIndexer._validate_key(self, key, axis)
   1356     return
   1357 elif is_integer(key):
-> 1358     self._validate_integer(key, axis)
   1359 elif isinstance(key, tuple):
   1360     # a tuple should already have been caught by this point
   1361     # so don't treat a tuple as a valid indexer
   1362     raise IndexingError("Too many indexers")

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1444, in _iLocIndexer._validate_integer(self, key, axis)
   1442 len_axis = len(self.obj._get_axis(axis))
   1443 if key >= len_axis or key < -len_axis:
-> 1444     raise IndexError("single positional indexer is out-of-bounds")

IndexError: single positional indexer is out-of-bounds
  • What is the average cost for those rows?

sub_df['Cost'].mean()
5.71

Slicing#

Row Slicing: Slice rows using integer indices.

  • Take a look at first two rows

df[:2]
Cost Quantity Seller Rating
0 6.43 10 Alice 4.7
1 5.81 15 Bob 4.4

Column Slicing: Slice columns using the .loc accessor and label slices.

df.loc[:, 'Quantity':'Rating']
Quantity Seller Rating
0 10 Alice 4.7
1 15 Bob 4.4
2 2 Eve 2.0
df.iloc[:, 1:3]
Quantity Seller
0 10 Alice
1 15 Bob
2 2 Eve

Remember that .loc includes both the start and the end of the slice, while .iloc includes the start but excludes the end.

Exploring the taxis dataset#

  • Load the attached file taxis.csv using the pandas function read_csv. Store the resulting DataFrame in the variable df.

import pandas as pd
df = pd.read_csv("taxis.csv")

Get a sense for the contents of df using the following.

  • The shape attribute.

  • The head method, to view the first few rows.

  • The sample method, to view a random collection of rows.

  • The info method.

  • The describe method.

The shape attribute reports the number of rows (6433 in this case) and the number of columns (14).

df.shape
(6433, 14)
df.head(10) # we look at its first 10 rows
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan
5 2019-03-11 10:37:23 2019-03-11 10:47:31 1 0.49 7.5 2.16 0.0 12.96 yellow credit card Times Sq/Theatre District Midtown East Manhattan Manhattan
6 2019-03-26 21:07:31 2019-03-26 21:17:29 1 3.65 13.0 2.00 0.0 18.80 yellow credit card Battery Park City Two Bridges/Seward Park Manhattan Manhattan
7 2019-03-22 12:47:13 2019-03-22 12:58:17 0 1.40 8.5 0.00 0.0 11.80 yellow NaN Murray Hill Flatiron Manhattan Manhattan
8 2019-03-23 11:48:50 2019-03-23 12:06:14 1 3.63 15.0 1.00 0.0 19.30 yellow credit card East Harlem South Midtown Center Manhattan Manhattan
9 2019-03-08 16:18:37 2019-03-08 16:26:57 1 1.52 8.0 1.00 0.0 13.30 yellow credit card Lincoln Square East Central Park Manhattan Manhattan
df[:5] #first 5 rows
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan

The sample method is similar, but it returns a random selection of the rows. Notice that they could also be in a different order.

df.sample(3)
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
2768 2019-03-14 18:46:48 2019-03-14 18:54:48 1 0.60 6.5 2.70 0.0 13.50 yellow credit card SoHo Little Italy/NoLiTa Manhattan Manhattan
6147 2019-03-09 22:22:28 2019-03-09 22:46:25 1 4.03 18.0 4.41 0.0 26.46 green credit card Central Harlem Times Sq/Theatre District Manhattan Manhattan
4962 2019-03-09 15:41:40 2019-03-09 15:56:57 1 2.54 12.5 3.16 0.0 18.96 yellow credit card Upper West Side South Lenox Hill West Manhattan Manhattan

The info method contains lots of iinformation. At the very bottom, it tells us how much space the DataFrame is taking up (which can be very useful if you’ve loaded a huge DataFrame from some external source). For each row, it tells us the data type (in the usual rough sense, for example, strings are reported as the “object” data type).

A huge issue in data science is missing data (and how to deal with it). This info method also reports missing data. (Or more precisely, it reports non-missing data, so for example we can infer below that 6433-6389 = 44 values are missing from the “payment” column.)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pickup           6433 non-null   object 
 1   dropoff          6433 non-null   object 
 2   passengers       6433 non-null   int64  
 3   distance         6433 non-null   float64
 4   fare             6433 non-null   float64
 5   tip              6433 non-null   float64
 6   tolls            6433 non-null   float64
 7   total            6433 non-null   float64
 8   color            6433 non-null   object 
 9   payment          6389 non-null   object 
 10  pickup_zone      6407 non-null   object 
 11  dropoff_zone     6388 non-null   object 
 12  pickup_borough   6407 non-null   object 
 13  dropoff_borough  6388 non-null   object 
dtypes: float64(5), int64(1), object(8)
memory usage: 703.7+ KB

The describe method is returns information about the distribution of numbers in numeric columns. For example, we can see that the median fare is 6.5 (the 50-th percentile), whereas the mean fare is 13.09.

df.describe()
passengers distance fare tip tolls total
count 6433.000000 6433.000000 6433.000000 6433.00000 6433.000000 6433.000000
mean 1.539251 3.024617 13.091073 1.97922 0.325273 18.517794
std 1.203768 3.827867 11.551804 2.44856 1.415267 13.815570
min 0.000000 0.000000 1.000000 0.00000 0.000000 1.300000
25% 1.000000 0.980000 6.500000 0.00000 0.000000 10.800000
50% 1.000000 1.640000 9.500000 1.70000 0.000000 14.160000
75% 2.000000 3.210000 15.000000 2.80000 0.000000 20.300000
max 6.000000 36.700000 150.000000 33.20000 24.020000 174.820000