Week 2 Monday#
You can find the notesbooks at course notes.
Announcements#
HW2 distributed today.
HW1 is due 11:59pm tonight on Canvas.
Quiz 1 is Tuesday during discussion section.
Use Pandas to import the dataset#
Read in the data directly from
our-own.csv
using the pandas functionread_csv
and store it with the variable namedf
.
One advantage of this approach: the column names show up.
import pandas as pd
df = pd.read_csv('our-own.csv')
df
Cost | Quantity | Seller | Rating | |
---|---|---|---|---|
0 | 6.43 | 10 | Alice | 4.7 |
1 | 5.81 | 15 | Bob | 4.4 |
2 | 4.99 | 2 | Eve | 2.0 |
type(df)
pandas.core.frame.DataFrame
Evaluate the
dtypes
attribute ofdf
. Notice how different columns hold different data types.
(A string data type, like "Alice"
, in pandas is usually reported as being an “object” data type.)
The fact that pandas DataFrames can have different columns with different data types is one of the big advantages of a pandas DataFrame over a NumPy array.
df.dtypes
Cost float64
Quantity int64
Seller object
Rating float64
dtype: object
Define a variable
col
to be equal to the “Rating” column.
There are many types of indexing in pandas, and you should expect to need some practice before you get used to them. By Column Label: Access a specific column using its label.
col = df['Rating']
col
0 4.7
1 4.4
2 2.0
Name: Rating, dtype: float64
What is the type of
df
? What is the type ofcol
? These are the two most important data types in pandas.
type(df)
pandas.core.frame.DataFrame
type(col)
pandas.core.series.Series
How many rows and columns are there in
df
? Just like in NumPy, pandas DataFrames have ashape
attribute.
df.shape
(3, 4)
What the average of the values in the “Rating” column? Use the
mean
method.
Notice the exact same numerical precision issue is showing up (even with the same 6
at the end of the decimal). The library pandas has lots of advantages, but it does not solve numerical precision issues.
df['Rating'].mean()
3.7000000000000006
Ways to index in pandas#
There are two commonly used ways to index in pandas: using labels (with loc
) and using integer position (with iloc
).
df
Cost | Quantity | Seller | Rating | |
---|---|---|---|---|
0 | 6.43 | 10 | Alice | 4.7 |
1 | 5.81 | 15 | Bob | 4.4 |
2 | 4.99 | 2 | Eve | 2.0 |
By Row Index: Access a specific row using the .loc accessor with its index label.
df.loc[0]
Cost 6.43
Quantity 10
Seller Alice
Rating 4.7
Name: 0, dtype: object
How can we access the entry in the row at index 2 and the column named “Seller” using
loc
?
.loc
Accessor#
Single Label: Select data at a particular row or column label.
df.loc[2,'Seller']
'Eve'
List of Labels: Select multiple rows or columns using a list of labels.
df.loc[[0, 1], ['Seller', 'Rating']]
Seller | Rating | |
---|---|---|
0 | Alice | 4.7 |
1 | Bob | 4.4 |
Slice of Labels: Select a range of rows or columns using label slices.
The slice()
function returns a slice object.
slice(start (default 0), end)
df.loc[:1, 'Quantity':'Rating']
#equivalent to
#df.loc[slice(0,1), slice('Quantity','Rating')]
Quantity | Seller | Rating | |
---|---|---|---|
0 | 10 | Alice | 4.7 |
1 | 15 | Bob | 4.4 |
Boolean Array: Select rows or columns based on a boolean condition.
df.loc[df['Quantity'] < 12] #select rows
Cost | Quantity | Seller | Rating | |
---|---|---|---|---|
0 | 6.43 | 10 | Alice | 4.7 |
2 | 4.99 | 2 | Eve | 2.0 |
How can we access that same value using
iloc
?
.iloc
Accessor#
Single Integer: Select data at a particular row or column integer index.
df.iloc[2,2]
'Eve'
List of Integers: Select multiple rows or columns using a list of integer indices.
df.iloc[[0, 2], [0, 2]]
Cost | Seller | |
---|---|---|
0 | 6.43 | Alice |
2 | 4.99 | Eve |
Slice of Integers: Select a range of rows or columns using integer index slices.
Remember that .loc
includes both the start and the end of the slice, while .iloc
includes the start but excludes the end.
df.iloc[0:1, 0:1]
Cost | |
---|---|
0 | 6.43 |
Boolean indexing in pandas#
Boolean indexing in pandas works very similarly to how it works in NumPy.
Define
sub_df
to be the sub-DataFrame ofdf
containing all the rows for which the quantity is strictly less than12
.
df['Quantity'] < 12 #indexing using a Boolean Series
0 True
1 False
2 True
Name: Quantity, dtype: bool
type(df['Quantity'] < 12)
pandas.core.series.Series
The Boolean Series says to keep rows 0
and 2
but to discard row 1
.
sub_df = df[df['Quantity'] < 12]
sub_df
Cost | Quantity | Seller | Rating | |
---|---|---|---|---|
0 | 6.43 | 10 | Alice | 4.7 |
2 | 4.99 | 2 | Eve | 2.0 |
# alternative way
df.loc[df['Quantity'] < 12]
Cost | Quantity | Seller | Rating | |
---|---|---|---|---|
0 | 6.43 | 10 | Alice | 4.7 |
2 | 4.99 | 2 | Eve | 2.0 |
Multiple Conditions: Combine multiple conditions using &
(and), |
(or), and ~
(not).
df[(df['Quantity'] > 2) & (df['Quantity'] < 12)]
Cost | Quantity | Seller | Rating | |
---|---|---|---|---|
0 | 6.43 | 10 | Alice | 4.7 |
We can use 2
with loc, since there is row with label 2
. However, there is not a row at integer location 2
.
sub_df.loc[2,'Rating']
2.0
sub_df.iloc[2,3]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In [16], line 1
----> 1 sub_df.iloc[2,3]
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:889, in _LocationIndexer.__getitem__(self, key)
886 with suppress(KeyError, IndexError, AttributeError):
887 # AttributeError for IntervalTree get_value
888 return self.obj._get_value(*key, takeable=self._takeable)
--> 889 return self._getitem_tuple(key)
890 else:
891 # we by definition only have the 0th axis
892 axis = self.axis or 0
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1450, in _iLocIndexer._getitem_tuple(self, tup)
1448 def _getitem_tuple(self, tup: Tuple):
-> 1450 self._has_valid_tuple(tup)
1451 with suppress(IndexingError):
1452 return self._getitem_lowerdim(tup)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:723, in _LocationIndexer._has_valid_tuple(self, key)
721 for i, k in enumerate(key):
722 try:
--> 723 self._validate_key(k, i)
724 except ValueError as err:
725 raise ValueError(
726 "Location based indexing can only have "
727 f"[{self._valid_types}] types"
728 ) from err
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1358, in _iLocIndexer._validate_key(self, key, axis)
1356 return
1357 elif is_integer(key):
-> 1358 self._validate_integer(key, axis)
1359 elif isinstance(key, tuple):
1360 # a tuple should already have been caught by this point
1361 # so don't treat a tuple as a valid indexer
1362 raise IndexingError("Too many indexers")
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1444, in _iLocIndexer._validate_integer(self, key, axis)
1442 len_axis = len(self.obj._get_axis(axis))
1443 if key >= len_axis or key < -len_axis:
-> 1444 raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
What is the average cost for those rows?
sub_df['Cost'].mean()
5.71
Slicing#
Row Slicing: Slice rows using integer indices.
Take a look at first two rows
df[:2]
Cost | Quantity | Seller | Rating | |
---|---|---|---|---|
0 | 6.43 | 10 | Alice | 4.7 |
1 | 5.81 | 15 | Bob | 4.4 |
Column Slicing: Slice columns using the .loc
accessor and label slices.
df.loc[:, 'Quantity':'Rating']
Quantity | Seller | Rating | |
---|---|---|---|
0 | 10 | Alice | 4.7 |
1 | 15 | Bob | 4.4 |
2 | 2 | Eve | 2.0 |
df.iloc[:, 1:3]
Quantity | Seller | |
---|---|---|
0 | 10 | Alice |
1 | 15 | Bob |
2 | 2 | Eve |
Remember that .loc
includes both the start and the end of the slice, while .iloc
includes the start but excludes the end.
Exploring the taxis dataset#
Load the attached file
taxis.csv
using the pandas functionread_csv
. Store the resulting DataFrame in the variabledf
.
import pandas as pd
df = pd.read_csv("taxis.csv")
Get a sense for the contents of df
using the following.
The
shape
attribute.The
head
method, to view the first few rows.The
sample
method, to view a random collection of rows.The
info
method.The
describe
method.
The shape attribute reports the number of rows (6433
in this case) and the number of columns (14
).
df.shape
(6433, 14)
df.head(10) # we look at its first 10 rows
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
5 | 2019-03-11 10:37:23 | 2019-03-11 10:47:31 | 1 | 0.49 | 7.5 | 2.16 | 0.0 | 12.96 | yellow | credit card | Times Sq/Theatre District | Midtown East | Manhattan | Manhattan |
6 | 2019-03-26 21:07:31 | 2019-03-26 21:17:29 | 1 | 3.65 | 13.0 | 2.00 | 0.0 | 18.80 | yellow | credit card | Battery Park City | Two Bridges/Seward Park | Manhattan | Manhattan |
7 | 2019-03-22 12:47:13 | 2019-03-22 12:58:17 | 0 | 1.40 | 8.5 | 0.00 | 0.0 | 11.80 | yellow | NaN | Murray Hill | Flatiron | Manhattan | Manhattan |
8 | 2019-03-23 11:48:50 | 2019-03-23 12:06:14 | 1 | 3.63 | 15.0 | 1.00 | 0.0 | 19.30 | yellow | credit card | East Harlem South | Midtown Center | Manhattan | Manhattan |
9 | 2019-03-08 16:18:37 | 2019-03-08 16:26:57 | 1 | 1.52 | 8.0 | 1.00 | 0.0 | 13.30 | yellow | credit card | Lincoln Square East | Central Park | Manhattan | Manhattan |
df[:5] #first 5 rows
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
The sample method is similar, but it returns a random selection of the rows. Notice that they could also be in a different order.
df.sample(3)
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2768 | 2019-03-14 18:46:48 | 2019-03-14 18:54:48 | 1 | 0.60 | 6.5 | 2.70 | 0.0 | 13.50 | yellow | credit card | SoHo | Little Italy/NoLiTa | Manhattan | Manhattan |
6147 | 2019-03-09 22:22:28 | 2019-03-09 22:46:25 | 1 | 4.03 | 18.0 | 4.41 | 0.0 | 26.46 | green | credit card | Central Harlem | Times Sq/Theatre District | Manhattan | Manhattan |
4962 | 2019-03-09 15:41:40 | 2019-03-09 15:56:57 | 1 | 2.54 | 12.5 | 3.16 | 0.0 | 18.96 | yellow | credit card | Upper West Side South | Lenox Hill West | Manhattan | Manhattan |
The info
method contains lots of iinformation. At the very bottom, it tells us how much space the DataFrame is taking up (which can be very useful if you’ve loaded a huge DataFrame from some external source). For each row, it tells us the data type (in the usual rough sense, for example, strings are reported as the “object” data type).
A huge issue in data science is missing data (and how to deal with it). This info
method also reports missing data. (Or more precisely, it reports non-missing data, so for example we can infer below that 6433-6389 = 44
values are missing from the “payment” column.)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pickup 6433 non-null object
1 dropoff 6433 non-null object
2 passengers 6433 non-null int64
3 distance 6433 non-null float64
4 fare 6433 non-null float64
5 tip 6433 non-null float64
6 tolls 6433 non-null float64
7 total 6433 non-null float64
8 color 6433 non-null object
9 payment 6389 non-null object
10 pickup_zone 6407 non-null object
11 dropoff_zone 6388 non-null object
12 pickup_borough 6407 non-null object
13 dropoff_borough 6388 non-null object
dtypes: float64(5), int64(1), object(8)
memory usage: 703.7+ KB
The describe
method is returns information about the distribution of numbers in numeric columns. For example, we can see that the median fare is 6.5
(the 50-th percentile), whereas the mean fare is 13.09
.
df.describe()
passengers | distance | fare | tip | tolls | total | |
---|---|---|---|---|---|---|
count | 6433.000000 | 6433.000000 | 6433.000000 | 6433.00000 | 6433.000000 | 6433.000000 |
mean | 1.539251 | 3.024617 | 13.091073 | 1.97922 | 0.325273 | 18.517794 |
std | 1.203768 | 3.827867 | 11.551804 | 2.44856 | 1.415267 | 13.815570 |
min | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 0.000000 | 1.300000 |
25% | 1.000000 | 0.980000 | 6.500000 | 0.00000 | 0.000000 | 10.800000 |
50% | 1.000000 | 1.640000 | 9.500000 | 1.70000 | 0.000000 | 14.160000 |
75% | 2.000000 | 3.210000 | 15.000000 | 2.80000 | 0.000000 | 20.300000 |
max | 6.000000 | 36.700000 | 150.000000 | 33.20000 | 24.020000 | 174.820000 |