Week 4 Friday#
Announcements#
The midterm is Monday (10/30). A sample midterm and solutions are posted on Canvas.
The best way to study is to go over the HW, quizzes, and the sample midterm. Next priority would be the lecture notes.
HW4 is due Monday night (the usual schedule). Working on it will help you prepare for the midterm.
Plan:#
Lecture ~ 30 mins. Last 20 mins: work on HW or study for the midterm.
Load the “taxis” dataset from Seaborn and drop rows with missing values.
import seaborn as sns
import altair as alt
import pandas as pd
df = sns.load_dataset("taxis")
df = df.dropna(axis=0)
Here is a reminder of how the dataset looks.
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
Multiply the values in columns from “distance” to “tolls” by 10
Notice the :
inside the square brackets, which says we want “every row”.
df.loc[:,"distance":"tolls"]*10
distance | fare | tip | tolls | |
---|---|---|---|---|
0 | 16.0 | 70.0 | 21.5 | 0.0 |
1 | 7.9 | 50.0 | 0.0 | 0.0 |
2 | 13.7 | 75.0 | 23.6 | 0.0 |
3 | 77.0 | 270.0 | 61.5 | 0.0 |
4 | 21.6 | 90.0 | 11.0 | 0.0 |
... | ... | ... | ... | ... |
6428 | 7.5 | 45.0 | 10.6 | 0.0 |
6429 | 187.4 | 580.0 | 0.0 | 0.0 |
6430 | 41.4 | 160.0 | 0.0 | 0.0 |
6431 | 11.2 | 60.0 | 0.0 | 0.0 |
6432 | 38.5 | 150.0 | 33.6 | 0.0 |
6341 rows Ă— 4 columns
Here is another way using a lambda function and applymap
. This is definitely more confusing than simply doing *10
like we did above (just showing it as an example of using applymap
to perform this basic task).
df.loc[:,"distance":"tolls"].applymap(lambda x: x*10)
distance | fare | tip | tolls | |
---|---|---|---|---|
0 | 16.0 | 70.0 | 21.5 | 0.0 |
1 | 7.9 | 50.0 | 0.0 | 0.0 |
2 | 13.7 | 75.0 | 23.6 | 0.0 |
3 | 77.0 | 270.0 | 61.5 | 0.0 |
4 | 21.6 | 90.0 | 11.0 | 0.0 |
... | ... | ... | ... | ... |
6428 | 7.5 | 45.0 | 10.6 | 0.0 |
6429 | 187.4 | 580.0 | 0.0 | 0.0 |
6430 | 41.4 | 160.0 | 0.0 | 0.0 |
6431 | 11.2 | 60.0 | 0.0 | 0.0 |
6432 | 38.5 | 150.0 | 33.6 | 0.0 |
6341 rows Ă— 4 columns
Here is an example of slicing using iloc
. We get the column at integer location 3
through the column at integer location 7
(exclusive).
df.iloc[:,3:7]*10
distance | fare | tip | tolls | |
---|---|---|---|---|
0 | 16.0 | 70.0 | 21.5 | 0.0 |
1 | 7.9 | 50.0 | 0.0 | 0.0 |
2 | 13.7 | 75.0 | 23.6 | 0.0 |
3 | 77.0 | 270.0 | 61.5 | 0.0 |
4 | 21.6 | 90.0 | 11.0 | 0.0 |
... | ... | ... | ... | ... |
6428 | 7.5 | 45.0 | 10.6 | 0.0 |
6429 | 187.4 | 580.0 | 0.0 | 0.0 |
6430 | 41.4 | 160.0 | 0.0 | 0.0 |
6431 | 11.2 | 60.0 | 0.0 | 0.0 |
6432 | 38.5 | 150.0 | 33.6 | 0.0 |
6341 rows Ă— 4 columns
Practice with groupby
and f-strings#
Make a new column named “Day” in the DataFrame, containing the day of the pickup (as a string, like
"Monday"
).
Here we add a new column to the far right side of the DataFrame. Notice that, unlike usual, we did not need to use pd.to_datetime
.
df['Day'] = df['pickup'].dt.day_name()
df
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | Day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | Saturday |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | Monday |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | Wednesday |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | Sunday |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | Saturday |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan | Sunday |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx | Sunday |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn | Saturday |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn | Monday |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn | Wednesday |
6341 rows Ă— 15 columns
Why didn’t we have to use
pd.to_datetime
? Evaluatedf.dtypes
anddf.dtypes["pickup"]
to see the reason.
The reason is that when we imported the DataFrame from Seaborn (as opposed to from an attached csv file), Seaborn automatically specified the appropriate data types for these columns.
df.dtypes
pickup datetime64[ns]
dropoff datetime64[ns]
passengers int64
distance float64
fare float64
tip float64
tolls float64
total float64
color object
payment object
pickup_zone object
dropoff_zone object
pickup_borough object
dropoff_borough object
Day object
dtype: object
Here is an example of getting the value corresponding to the index entry “pickup”. (The value is displayed a little strangely, but it’s indicating that this column has a datetime data type.)
df.dtypes['pickup']
dtype('<M8[ns]')
If we look at the value corresponding to “distance”, it’s easier to read, because we recognize the term float
for this numeric column.
df.dtypes['distance']
dtype('float64')
What day occurs most often in this dataset?
vc = df['Day'].value_counts()
vc
Friday 1091
Saturday 1028
Wednesday 958
Thursday 896
Sunday 857
Tuesday 811
Monday 700
Name: Day, dtype: int64
Here we check the index explicitly, using the index
attribute.
vc.index
Index(['Friday', 'Saturday', 'Wednesday', 'Thursday', 'Sunday', 'Tuesday',
'Monday'],
dtype='object')
If instead we want the values, we can use the values
attribute.
vc.values
array([1091, 1028, 958, 896, 857, 811, 700])
If we want to know the most frequent entry, we can get the initial element in the value_counts
index.
vc.index[0] #most frequent
'Friday'
Can you answer the same thing using Altair?
df.shape
(6341, 15)
By default, Altair does not allow us to use DataFrames with more than 5000 rows. This is an example of the error message you’ll see if you try.
Here we tell Altair to allow up to 7,000 rows. Warning: Don’t try to do this with huge datasets. I wouldn’t try it with more than about 20,000 rows at most.
Don’t worry about learning this data_transformers
syntax for the midterms or quizzes.
alt.data_transformers.enable('default', max_rows = 7000)
DataTransformerRegistry.enable('default')
Here we put the days of the week along the x-axis, and we use the special Altair syntax "count()"
to make the heights of the bars equal to the number of entries for that day. Notice how, like we saw above, Friday seems to be the most common.
It’s a little annoying that the days are alphabetized, rather than given in consecutive order.
import altair as alt
alt.Chart(df).mark_bar().encode(
x = "Day",
y = "count()"
)
Here is a way to reorder the days.
alt.Chart(df).mark_bar().encode(
x=alt.X("Day", sort=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday','Sunday']),
y="count()"
)
For each day of the week, what was the median tip for that day? Put the results into a pandas Series using
groupby
.
df.groupby("Day").median()["tip"]
Day
Friday 1.750
Monday 1.770
Saturday 1.500
Sunday 1.560
Thursday 1.835
Tuesday 1.760
Wednesday 1.850
Name: tip, dtype: float64
Can you answer the same thing using Altair?
Instead of "count()"
, we could use another function like "median"
, but in that case, we need to specify what is the value we are taking the median of. Here we take the median of the “tip” column.
Using the tooltip, we can see that the median value is 1.76
for the “tip” column on Tuesdays.
alt.Chart(df).mark_bar().encode(
x = "Day",
y = "median(tip)",
tooltip = ["median(tip)"]
)
For each day of the week, print out a string stating the median tip amount for that day. Use
:.2f
in the string formatting part so that only two decimal places get printed.
median_tip = df.groupby("Day").median()["tip"]
median_tip
Day
Friday 1.750
Monday 1.770
Saturday 1.500
Sunday 1.560
Thursday 1.835
Tuesday 1.760
Wednesday 1.850
Name: tip, dtype: float64
median_tip.index
Index(['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday',
'Wednesday'],
dtype='object', name='Day')
median_tip.values
array([1.75 , 1.77 , 1.5 , 1.56 , 1.835, 1.76 , 1.85 ])
for i in range(len(median_tip)):
day = median_tip.index[i]
m_tip = median_tip.values[i]
print(f"The median tip on {day} is {m_tip:.2f}")
The median tip on Friday is 1.75
The median tip on Monday is 1.77
The median tip on Saturday is 1.50
The median tip on Sunday is 1.56
The median tip on Thursday is 1.83
The median tip on Tuesday is 1.76
The median tip on Wednesday is 1.85