Week 4 Friday#

Announcements#

  • The midterm is Monday (10/30). A sample midterm and solutions are posted on Canvas.

  • The best way to study is to go over the HW, quizzes, and the sample midterm. Next priority would be the lecture notes.

  • HW4 is due Monday night (the usual schedule). Working on it will help you prepare for the midterm.

Plan:#

Lecture ~ 30 mins. Last 20 mins: work on HW or study for the midterm.

  • Load the “taxis” dataset from Seaborn and drop rows with missing values.

import seaborn as sns
import altair as alt
import pandas as pd
df = sns.load_dataset("taxis")
df = df.dropna(axis=0)

Here is a reminder of how the dataset looks.

df.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan
  • Multiply the values in columns from “distance” to “tolls” by 10

Notice the : inside the square brackets, which says we want “every row”.

df.loc[:,"distance":"tolls"]*10
distance fare tip tolls
0 16.0 70.0 21.5 0.0
1 7.9 50.0 0.0 0.0
2 13.7 75.0 23.6 0.0
3 77.0 270.0 61.5 0.0
4 21.6 90.0 11.0 0.0
... ... ... ... ...
6428 7.5 45.0 10.6 0.0
6429 187.4 580.0 0.0 0.0
6430 41.4 160.0 0.0 0.0
6431 11.2 60.0 0.0 0.0
6432 38.5 150.0 33.6 0.0

6341 rows Ă— 4 columns

Here is another way using a lambda function and applymap. This is definitely more confusing than simply doing *10 like we did above (just showing it as an example of using applymap to perform this basic task).

df.loc[:,"distance":"tolls"].applymap(lambda x: x*10)
distance fare tip tolls
0 16.0 70.0 21.5 0.0
1 7.9 50.0 0.0 0.0
2 13.7 75.0 23.6 0.0
3 77.0 270.0 61.5 0.0
4 21.6 90.0 11.0 0.0
... ... ... ... ...
6428 7.5 45.0 10.6 0.0
6429 187.4 580.0 0.0 0.0
6430 41.4 160.0 0.0 0.0
6431 11.2 60.0 0.0 0.0
6432 38.5 150.0 33.6 0.0

6341 rows Ă— 4 columns

Here is an example of slicing using iloc. We get the column at integer location 3 through the column at integer location 7 (exclusive).

df.iloc[:,3:7]*10
distance fare tip tolls
0 16.0 70.0 21.5 0.0
1 7.9 50.0 0.0 0.0
2 13.7 75.0 23.6 0.0
3 77.0 270.0 61.5 0.0
4 21.6 90.0 11.0 0.0
... ... ... ... ...
6428 7.5 45.0 10.6 0.0
6429 187.4 580.0 0.0 0.0
6430 41.4 160.0 0.0 0.0
6431 11.2 60.0 0.0 0.0
6432 38.5 150.0 33.6 0.0

6341 rows Ă— 4 columns

Practice with groupby and f-strings#

  • Make a new column named “Day” in the DataFrame, containing the day of the pickup (as a string, like "Monday").

Here we add a new column to the far right side of the DataFrame. Notice that, unlike usual, we did not need to use pd.to_datetime.

df['Day'] = df['pickup'].dt.day_name()
df
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough Day
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan Saturday
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan Monday
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan Wednesday
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan Sunday
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan Saturday
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6428 2019-03-31 09:51:53 2019-03-31 09:55:27 1 0.75 4.5 1.06 0.0 6.36 green credit card East Harlem North Central Harlem North Manhattan Manhattan Sunday
6429 2019-03-31 17:38:00 2019-03-31 18:34:23 1 18.74 58.0 0.00 0.0 58.80 green credit card Jamaica East Concourse/Concourse Village Queens Bronx Sunday
6430 2019-03-23 22:55:18 2019-03-23 23:14:25 1 4.14 16.0 0.00 0.0 17.30 green cash Crown Heights North Bushwick North Brooklyn Brooklyn Saturday
6431 2019-03-04 10:09:25 2019-03-04 10:14:29 1 1.12 6.0 0.00 0.0 6.80 green credit card East New York East Flatbush/Remsen Village Brooklyn Brooklyn Monday
6432 2019-03-13 19:31:22 2019-03-13 19:48:02 1 3.85 15.0 3.36 0.0 20.16 green credit card Boerum Hill Windsor Terrace Brooklyn Brooklyn Wednesday

6341 rows Ă— 15 columns

  • Why didn’t we have to use pd.to_datetime? Evaluate df.dtypes and df.dtypes["pickup"] to see the reason.

The reason is that when we imported the DataFrame from Seaborn (as opposed to from an attached csv file), Seaborn automatically specified the appropriate data types for these columns.

df.dtypes
pickup             datetime64[ns]
dropoff            datetime64[ns]
passengers                  int64
distance                  float64
fare                      float64
tip                       float64
tolls                     float64
total                     float64
color                      object
payment                    object
pickup_zone                object
dropoff_zone               object
pickup_borough             object
dropoff_borough            object
Day                        object
dtype: object

Here is an example of getting the value corresponding to the index entry “pickup”. (The value is displayed a little strangely, but it’s indicating that this column has a datetime data type.)

df.dtypes['pickup']
dtype('<M8[ns]')

If we look at the value corresponding to “distance”, it’s easier to read, because we recognize the term float for this numeric column.

df.dtypes['distance']
dtype('float64')
  • What day occurs most often in this dataset?

vc = df['Day'].value_counts()
vc
Friday       1091
Saturday     1028
Wednesday     958
Thursday      896
Sunday        857
Tuesday       811
Monday        700
Name: Day, dtype: int64

Here we check the index explicitly, using the index attribute.

vc.index
Index(['Friday', 'Saturday', 'Wednesday', 'Thursday', 'Sunday', 'Tuesday',
       'Monday'],
      dtype='object')

If instead we want the values, we can use the values attribute.

vc.values
array([1091, 1028,  958,  896,  857,  811,  700])

If we want to know the most frequent entry, we can get the initial element in the value_counts index.

vc.index[0] #most frequent
'Friday'
  • Can you answer the same thing using Altair?

df.shape
(6341, 15)

By default, Altair does not allow us to use DataFrames with more than 5000 rows. This is an example of the error message you’ll see if you try.

Here we tell Altair to allow up to 7,000 rows. Warning: Don’t try to do this with huge datasets. I wouldn’t try it with more than about 20,000 rows at most.

Don’t worry about learning this data_transformers syntax for the midterms or quizzes.

alt.data_transformers.enable('default', max_rows = 7000)
DataTransformerRegistry.enable('default')

Here we put the days of the week along the x-axis, and we use the special Altair syntax "count()" to make the heights of the bars equal to the number of entries for that day. Notice how, like we saw above, Friday seems to be the most common.

It’s a little annoying that the days are alphabetized, rather than given in consecutive order.

import altair as alt

alt.Chart(df).mark_bar().encode(
    x = "Day",
    y = "count()"
)

Here is a way to reorder the days.

alt.Chart(df).mark_bar().encode(
    x=alt.X("Day", sort=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday','Sunday']),
    y="count()"
)
  • For each day of the week, what was the median tip for that day? Put the results into a pandas Series using groupby.

df.groupby("Day").median()["tip"]
Day
Friday       1.750
Monday       1.770
Saturday     1.500
Sunday       1.560
Thursday     1.835
Tuesday      1.760
Wednesday    1.850
Name: tip, dtype: float64
  • Can you answer the same thing using Altair?

Instead of "count()", we could use another function like "median", but in that case, we need to specify what is the value we are taking the median of. Here we take the median of the “tip” column.

Using the tooltip, we can see that the median value is 1.76 for the “tip” column on Tuesdays.

alt.Chart(df).mark_bar().encode(
    x = "Day",
    y = "median(tip)",
    tooltip = ["median(tip)"]
)
  • For each day of the week, print out a string stating the median tip amount for that day. Use :.2f in the string formatting part so that only two decimal places get printed.

median_tip = df.groupby("Day").median()["tip"]
median_tip
Day
Friday       1.750
Monday       1.770
Saturday     1.500
Sunday       1.560
Thursday     1.835
Tuesday      1.760
Wednesday    1.850
Name: tip, dtype: float64
median_tip.index
Index(['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday',
       'Wednesday'],
      dtype='object', name='Day')
median_tip.values
array([1.75 , 1.77 , 1.5  , 1.56 , 1.835, 1.76 , 1.85 ])
for i in range(len(median_tip)):
    day = median_tip.index[i]
    m_tip = median_tip.values[i]
    print(f"The median tip on {day} is {m_tip:.2f}")
The median tip on Friday is 1.75
The median tip on Monday is 1.77
The median tip on Saturday is 1.50
The median tip on Sunday is 1.56
The median tip on Thursday is 1.83
The median tip on Tuesday is 1.76
The median tip on Wednesday is 1.85
Created in deepnote.com Created in Deepnote