Week 4 Friday

Week 4 Friday#

Announcements#

The midterm is Monday (10/30). A sample midterm and solutions are posted on Canvas.
The best way to study is to go over the HW, quizzes, and the sample midterm. Next priority would be the lecture notes.
HW4 is due Monday night (the usual schedule). Working on it will help you prepare for the midterm.

Plan:#

Lecture ~ 30 mins. Last 20 mins: work on HW or study for the midterm.

Load the “taxis” dataset from Seaborn and drop rows with missing values.

import seaborn as sns
import altair as alt
import pandas as pd
df = sns.load_dataset("taxis")
df = df.dropna(axis=0)

Here is a reminder of how the dataset looks.

df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan

Multiply the values in columns from “distance” to “tolls” by 10

Notice the : inside the square brackets, which says we want “every row”.

df.loc[:,"distance":"tolls"]*10

	distance	fare	tip	tolls
0	16.0	70.0	21.5	0.0
1	7.9	50.0	0.0	0.0
2	13.7	75.0	23.6	0.0
3	77.0	270.0	61.5	0.0
4	21.6	90.0	11.0	0.0
...	...	...	...	...
6428	7.5	45.0	10.6	0.0
6429	187.4	580.0	0.0	0.0
6430	41.4	160.0	0.0	0.0
6431	11.2	60.0	0.0	0.0
6432	38.5	150.0	33.6	0.0

6341 rows × 4 columns

Here is another way using a lambda function and applymap. This is definitely more confusing than simply doing *10 like we did above (just showing it as an example of using applymap to perform this basic task).

df.loc[:,"distance":"tolls"].applymap(lambda x: x*10)

	distance	fare	tip	tolls
0	16.0	70.0	21.5	0.0
1	7.9	50.0	0.0	0.0
2	13.7	75.0	23.6	0.0
3	77.0	270.0	61.5	0.0
4	21.6	90.0	11.0	0.0
...	...	...	...	...
6428	7.5	45.0	10.6	0.0
6429	187.4	580.0	0.0	0.0
6430	41.4	160.0	0.0	0.0
6431	11.2	60.0	0.0	0.0
6432	38.5	150.0	33.6	0.0

6341 rows × 4 columns

Here is an example of slicing using iloc. We get the column at integer location 3 through the column at integer location 7 (exclusive).

df.iloc[:,3:7]*10

	distance	fare	tip	tolls
0	16.0	70.0	21.5	0.0
1	7.9	50.0	0.0	0.0
2	13.7	75.0	23.6	0.0
3	77.0	270.0	61.5	0.0
4	21.6	90.0	11.0	0.0
...	...	...	...	...
6428	7.5	45.0	10.6	0.0
6429	187.4	580.0	0.0	0.0
6430	41.4	160.0	0.0	0.0
6431	11.2	60.0	0.0	0.0
6432	38.5	150.0	33.6	0.0

6341 rows × 4 columns

Practice with `groupby` and f-strings#

Make a new column named “Day” in the DataFrame, containing the day of the pickup (as a string, like "Monday").

Here we add a new column to the far right side of the DataFrame. Notice that, unlike usual, we did not need to use pd.to_datetime.

df['Day'] = df['pickup'].dt.day_name()
df

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	Day
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	0.0	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan	Saturday
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	0.0	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan	Monday
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	0.0	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan	Wednesday
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	0.0	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan	Sunday
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	0.0	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan	Saturday
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6428	2019-03-31 09:51:53	2019-03-31 09:55:27	1	0.75	4.5	1.06	0.0	6.36	green	credit card	East Harlem North	Central Harlem North	Manhattan	Manhattan	Sunday
6429	2019-03-31 17:38:00	2019-03-31 18:34:23	1	18.74	58.0	0.00	0.0	58.80	green	credit card	Jamaica	East Concourse/Concourse Village	Queens	Bronx	Sunday
6430	2019-03-23 22:55:18	2019-03-23 23:14:25	1	4.14	16.0	0.00	0.0	17.30	green	cash	Crown Heights North	Bushwick North	Brooklyn	Brooklyn	Saturday
6431	2019-03-04 10:09:25	2019-03-04 10:14:29	1	1.12	6.0	0.00	0.0	6.80	green	credit card	East New York	East Flatbush/Remsen Village	Brooklyn	Brooklyn	Monday
6432	2019-03-13 19:31:22	2019-03-13 19:48:02	1	3.85	15.0	3.36	0.0	20.16	green	credit card	Boerum Hill	Windsor Terrace	Brooklyn	Brooklyn	Wednesday

6341 rows × 15 columns

Why didn’t we have to use pd.to_datetime? Evaluate df.dtypes and df.dtypes["pickup"] to see the reason.

The reason is that when we imported the DataFrame from Seaborn (as opposed to from an attached csv file), Seaborn automatically specified the appropriate data types for these columns.

df.dtypes

pickup             datetime64[ns]
dropoff            datetime64[ns]
passengers                  int64
distance                  float64
fare                      float64
tip                       float64
tolls                     float64
total                     float64
color                      object
payment                    object
pickup_zone                object
dropoff_zone               object
pickup_borough             object
dropoff_borough            object
Day                        object
dtype: object

Here is an example of getting the value corresponding to the index entry “pickup”. (The value is displayed a little strangely, but it’s indicating that this column has a datetime data type.)

df.dtypes['pickup']

dtype('<M8[ns]')

If we look at the value corresponding to “distance”, it’s easier to read, because we recognize the term float for this numeric column.

df.dtypes['distance']

dtype('float64')

What day occurs most often in this dataset?

vc = df['Day'].value_counts()
vc

Friday       1091
Saturday     1028
Wednesday     958
Thursday      896
Sunday        857
Tuesday       811
Monday        700
Name: Day, dtype: int64

Here we check the index explicitly, using the index attribute.

vc.index

Index(['Friday', 'Saturday', 'Wednesday', 'Thursday', 'Sunday', 'Tuesday',
       'Monday'],
      dtype='object')

If instead we want the values, we can use the values attribute.

vc.values

array([1091, 1028,  958,  896,  857,  811,  700])

If we want to know the most frequent entry, we can get the initial element in the value_counts index.

vc.index[0] #most frequent

'Friday'

Can you answer the same thing using Altair?

df.shape

(6341, 15)

By default, Altair does not allow us to use DataFrames with more than 5000 rows. This is an example of the error message you’ll see if you try.

Here we tell Altair to allow up to 7,000 rows. Warning: Don’t try to do this with huge datasets. I wouldn’t try it with more than about 20,000 rows at most.

Don’t worry about learning this data_transformers syntax for the midterms or quizzes.

alt.data_transformers.enable('default', max_rows = 7000)

DataTransformerRegistry.enable('default')

Here we put the days of the week along the x-axis, and we use the special Altair syntax "count()" to make the heights of the bars equal to the number of entries for that day. Notice how, like we saw above, Friday seems to be the most common.

It’s a little annoying that the days are alphabetized, rather than given in consecutive order.

import altair as alt

alt.Chart(df).mark_bar().encode(
    x = "Day",
    y = "count()"
)

Here is a way to reorder the days.

alt.Chart(df).mark_bar().encode(
    x=alt.X("Day", sort=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday','Sunday']),
    y="count()"
)

For each day of the week, what was the median tip for that day? Put the results into a pandas Series using groupby.

df.groupby("Day").median()["tip"]

Day
Friday       1.750
Monday       1.770
Saturday     1.500
Sunday       1.560
Thursday     1.835
Tuesday      1.760
Wednesday    1.850
Name: tip, dtype: float64

Can you answer the same thing using Altair?

Instead of "count()", we could use another function like "median", but in that case, we need to specify what is the value we are taking the median of. Here we take the median of the “tip” column.

Using the tooltip, we can see that the median value is 1.76 for the “tip” column on Tuesdays.

alt.Chart(df).mark_bar().encode(
    x = "Day",
    y = "median(tip)",
    tooltip = ["median(tip)"]
)

For each day of the week, print out a string stating the median tip amount for that day. Use :.2f in the string formatting part so that only two decimal places get printed.

median_tip = df.groupby("Day").median()["tip"]

median_tip

Day
Friday       1.750
Monday       1.770
Saturday     1.500
Sunday       1.560
Thursday     1.835
Tuesday      1.760
Wednesday    1.850
Name: tip, dtype: float64

median_tip.index

Index(['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday',
       'Wednesday'],
      dtype='object', name='Day')

median_tip.values

array([1.75 , 1.77 , 1.5  , 1.56 , 1.835, 1.76 , 1.85 ])

for i in range(len(median_tip)):
    day = median_tip.index[i]
    m_tip = median_tip.values[i]
    print(f"The median tip on {day} is {m_tip:.2f}")

The median tip on Friday is 1.75
The median tip on Monday is 1.77
The median tip on Saturday is 1.50
The median tip on Sunday is 1.56
The median tip on Thursday is 1.83
The median tip on Tuesday is 1.76
The median tip on Wednesday is 1.85

Created in Deepnote

Week 4 Friday

Contents

Week 4 Friday#

Announcements#

Plan:#

Practice with groupby and f-strings#

Practice with `groupby` and f-strings#