Week 4 Wednesday#
Announcements#
The midterm is Monday (10/30). A sample midterm is posted on canvas.
The best way to study is to go over the HW, quizzes, and the sample midterm. Next priority would be the lecture notes.
Practice with the DataFrame method apply
#
Load the “taxis” dataset from Seaborn and drop rows with missing values.
import seaborn as sns
import pandas as pd
df = sns.load_dataset("taxis")
df = df.dropna(axis=0)
help(df.dropna)
Help on method dropna in module pandas.core.frame:
dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
Remove missing values.
See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.
Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
Determine if rows or columns which contain missing values are
removed.
* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.
.. versionchanged:: 1.0.0
Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.
how : {'any', 'all'}, default 'any'
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False
If True, do operation inplace and return None.
Returns
-------
DataFrame or None
DataFrame with NA entries dropped from it or None if ``inplace=True``.
See Also
--------
DataFrame.isna: Indicate missing values.
DataFrame.notna : Indicate existing (non-missing) values.
DataFrame.fillna : Replace missing values.
Series.dropna : Drop missing values.
Index.dropna : Drop missing indices.
Examples
--------
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
... "toy": [np.nan, 'Batmobile', 'Bullwhip'],
... "born": [pd.NaT, pd.Timestamp("1940-04-25"),
... pd.NaT]})
>>> df
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna()
name toy born
1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns')
name
0 Alfred
1 Batman
2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all')
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2)
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'toy'])
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Keep the DataFrame with valid entries in the same variable.
>>> df.dropna(inplace=True)
>>> df
name toy born
1 Batman Batmobile 1940-04-25
As a warm-up with
apply
, try applying thelen
function using firstaxis=0
and thenaxis=1
. We’ll see more serious uses below and in the homework.
df.shape
(6341, 14)
Notice how the row labels have disappeared but the column labels are still here. We have changed the row labels, which corresponds to axis=0
. The numbers 6341
correspond to the length of each of these columns.
df.apply(len, axis=0)
pickup 6341
dropoff 6341
passengers 6341
distance 6341
fare 6341
tip 6341
tolls 6341
total 6341
color 6341
payment 6341
pickup_zone 6341
dropoff_zone 6341
pickup_borough 6341
dropoff_borough 6341
dtype: int64
If we use axis=1
instead, then we are plugging in one row at a time. Those rows all have length 14
, which is where the following values come from.
df.apply(len, axis=1)
0 14
1 14
2 14
3 14
4 14
..
6428 14
6429 14
6430 14
6431 14
6432 14
Length: 6341, dtype: int64
Here is an example where we see different values for different columns. Here the input is a column, and the output is its dtype
.
df.apply(lambda x: x.dtype, axis=0)
pickup datetime64[ns]
dropoff datetime64[ns]
passengers int64
distance float64
fare float64
tip float64
tolls float64
total float64
color object
payment object
pickup_zone object
dropoff_zone object
pickup_borough object
dropoff_borough object
dtype: object
Compare the above use of apply
to map
, which is used on a single pandas Series, rather than on a whole pandas DataFrame.
ser = df["payment"]
ser
0 credit card
1 cash
2 credit card
3 credit card
4 credit card
...
6428 credit card
6429 credit card
6430 cash
6431 credit card
6432 credit card
Name: payment, Length: 6341, dtype: object
Here we are trying to add 10
to each entry, but we get an error because the entries are strings in ser
.
ser.map(lambda x: x+10) # x is a srting
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [8], line 1
----> 1 ser.map(lambda x: x+10)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/series.py:3909, in Series.map(self, arg, na_action)
3834 def map(self, arg, na_action=None) -> "Series":
3835 """
3836 Map values of Series according to input correspondence.
3837
(...)
3907 dtype: object
3908 """
-> 3909 new_values = super()._map_values(arg, na_action=na_action)
3910 return self._constructor(new_values, index=self.index).__finalize__(
3911 self, method="map"
3912 )
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/base.py:937, in IndexOpsMixin._map_values(self, mapper, na_action)
934 raise ValueError(msg)
936 # mapper is a function
--> 937 new_values = map_f(values, mapper)
939 return new_values
File pandas/_libs/lib.pyx:2467, in pandas._libs.lib.map_infer()
Cell In [8], line 1, in <lambda>(x)
----> 1 ser.map(lambda x: x+10)
TypeError: can only concatenate str (not "int") to str
If we convert 10
to a string (we could also put it in quotation marks), then we don’t get an error. Notice how the function here is a little simpler than what we used above, in the sense that the input to this lambda function is a string, whereas the inputs above were pandas Series.
ser.map(lambda x: x+str(10))
0 credit card10
1 cash10
2 credit card10
3 credit card10
4 credit card10
...
6428 credit card10
6429 credit card10
6430 cash10
6431 credit card10
6432 credit card10
Name: payment, Length: 6341, dtype: object
ser.map(lambda x: f"{x}10")
0 credit card10
1 cash10
2 credit card10
3 credit card10
4 credit card10
...
6428 credit card10
6429 credit card10
6430 cash10
6431 credit card10
6432 credit card10
Name: payment, Length: 6341, dtype: object
Write a function
has_airport
which takes as input a row fromdf
(not a row label but an entire row) and as output returnsTrue
if the “pickup_zone” or “dropoff_zone” entry contains the substring “Airport”.
def has_airport(row):
return ("Airport" in row["pickup_zone"]) or ("Airport" in row["dropoff_zone"])
z = df.iloc[30] #This will attempt to select the 31th row by position.
z
pickup 2019-03-01 02:55:55
dropoff 2019-03-01 02:57:59
passengers 3
distance 0.74
fare 4.0
tip 0.0
tolls 0.0
total 7.8
color yellow
payment cash
pickup_zone Clinton East
dropoff_zone West Chelsea/Hudson Yards
pickup_borough Manhattan
dropoff_borough Manhattan
Name: 31, dtype: object
z["pickup_zone"]
'Clinton East'
Notice that neither the “pickup_zone” value nor the “dropoff_zone” value contains "Airport"
, so that is why our function returns False
.
has_airport(z)
False
df.head(11)
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
5 | 2019-03-11 10:37:23 | 2019-03-11 10:47:31 | 1 | 0.49 | 7.5 | 2.16 | 0.0 | 12.96 | yellow | credit card | Times Sq/Theatre District | Midtown East | Manhattan | Manhattan |
6 | 2019-03-26 21:07:31 | 2019-03-26 21:17:29 | 1 | 3.65 | 13.0 | 2.00 | 0.0 | 18.80 | yellow | credit card | Battery Park City | Two Bridges/Seward Park | Manhattan | Manhattan |
8 | 2019-03-23 11:48:50 | 2019-03-23 12:06:14 | 1 | 3.63 | 15.0 | 1.00 | 0.0 | 19.30 | yellow | credit card | East Harlem South | Midtown Center | Manhattan | Manhattan |
9 | 2019-03-08 16:18:37 | 2019-03-08 16:26:57 | 1 | 1.52 | 8.0 | 1.00 | 0.0 | 13.30 | yellow | credit card | Lincoln Square East | Central Park | Manhattan | Manhattan |
10 | 2019-03-16 10:02:25 | 2019-03-16 10:22:29 | 1 | 3.90 | 17.0 | 0.00 | 0.0 | 17.80 | yellow | cash | LaGuardia Airport | Astoria | Queens | Queens |
11 | 2019-03-20 19:39:42 | 2019-03-20 19:45:36 | 1 | 1.53 | 6.5 | 2.16 | 0.0 | 12.96 | yellow | credit card | Upper West Side South | Manhattan Valley | Manhattan | Manhattan |
df.loc[10] # select the row with an index label of 10.
pickup 2019-03-16 10:02:25
dropoff 2019-03-16 10:22:29
passengers 1
distance 3.9
fare 17.0
tip 0.0
tolls 0.0
total 17.8
color yellow
payment cash
pickup_zone LaGuardia Airport
dropoff_zone Astoria
pickup_borough Queens
dropoff_borough Queens
Name: 10, dtype: object
has_airport(df.loc[10])
True
df.iloc[10] #select the 11th row by position, it is now the same as df.loc[11] because we removed rows with missing vlaues
pickup 2019-03-20 19:39:42
dropoff 2019-03-20 19:45:36
passengers 1
distance 1.53
fare 6.5
tip 2.16
tolls 0.0
total 12.96
color yellow
payment credit card
pickup_zone Upper West Side South
dropoff_zone Manhattan Valley
pickup_borough Manhattan
dropoff_borough Manhattan
Name: 11, dtype: object
df.loc[11]
pickup 2019-03-20 19:39:42
dropoff 2019-03-20 19:45:36
passengers 1
distance 1.53
fare 6.5
tip 2.16
tolls 0.0
total 12.96
color yellow
payment credit card
pickup_zone Upper West Side South
dropoff_zone Manhattan Valley
pickup_borough Manhattan
dropoff_borough Manhattan
Name: 11, dtype: object
Make a new column
"Airport0"
which containsTrue
if the “pickup_zone” or “dropoff_zone” contains the substring “Airport”, and otherwise containsFalse
. Use thehas_airport
function anddf.apply(???, axis=???)
.
Because we are evaluating this to a row at a time, we are using axis=1
.
df["Airport0"] = df.apply(has_airport, axis = 1)
df
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | Airport0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | False |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | False |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | False |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | False |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan | False |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx | False |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn | False |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn | False |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn | False |
6341 rows Ă— 15 columns
If we try to apply using axis=0
in this case, we get an error, because the code included row["pickup_zone"]
, which doesn’t make sense for a column in this DataFrame. (There is no row with label “pickup_zone”.)
df.apply(has_airport, axis = 0)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3081, in Index.get_loc(self, key, method, tolerance)
3080 try:
-> 3081 return self._engine.get_loc(casted_key)
3082 except KeyError as err:
File pandas/_libs/index.pyx:70, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:98, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index_class_helper.pxi:89, in pandas._libs.index.Int64Engine._check_type()
KeyError: 'pickup_zone'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In [20], line 1
----> 1 df.apply(has_airport, axis = 0)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/frame.py:7768, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwds)
7757 from pandas.core.apply import frame_apply
7759 op = frame_apply(
7760 self,
7761 func=func,
(...)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/apply.py:185, in FrameApply.get_result(self)
182 elif self.raw:
183 return self.apply_raw()
--> 185 return self.apply_standard()
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/apply.py:276, in FrameApply.apply_standard(self)
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
278 # wrap results
279 return self.wrap_results(results, res_index)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/apply.py:290, in FrameApply.apply_series_generator(self)
287 with option_context("mode.chained_assignment", None):
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
293 # series_generator will swap out the underlying data
294 results[i] = results[i].copy(deep=False)
Cell In [11], line 2, in has_airport(row)
1 def has_airport(row):
----> 2 return ("Airport" in row["pickup_zone"]) or ("Airport" in row["dropoff_zone"])
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/series.py:853, in Series.__getitem__(self, key)
850 return self._values[key]
852 elif key_is_scalar:
--> 853 return self._get_value(key)
855 if is_hashable(key):
856 # Otherwise index.get_value will raise InvalidIndexError
857 try:
858 # For labels that don't resolve as scalars like tuples and frozensets
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/series.py:961, in Series._get_value(self, label, takeable)
958 return self._values[label]
960 # Similar to Index.get_value, but we do not fall back to positional
--> 961 loc = self.index.get_loc(label)
962 return self.index._get_values_for_loc(self, loc, label)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3083, in Index.get_loc(self, key, method, tolerance)
3081 return self._engine.get_loc(casted_key)
3082 except KeyError as err:
-> 3083 raise KeyError(key) from err
3085 if tolerance is not None:
3086 tolerance = self._convert_tolerance(tolerance, np.asarray(key))
KeyError: 'pickup_zone'
Make a column
"Airport1"
with the same values, again usingapply
, but this time using a lambda function.
The following is the exact same as above, just using an anonymous lambda function instead of our named has_airport
function. I’ve broken it up over multiple lines just to make it a little easier to read.
df["Airport1"] = df.apply(
lambda row: ("Airport" in row["pickup_zone"]) or ("Airport" in row["dropoff_zone"]),
axis=1
)
df
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | Airport0 | Airport1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | False | False |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | False | False |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | False | False |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | False | False |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan | False | False |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx | False | False |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn | False | False |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn | False | False |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn | False | False |
6341 rows Ă— 16 columns
Here is a fancier but more confusing approach.
Notice that
df[["pickup_zone", "dropoff_zone"]]
is a DataFrame containing just those two columns we’re interested in. Useapply
this time to apply a lambda function to the columns rather than the rows, then useany
. Name the result"Airport2"
.
We start off getting a two-column DataFrame, that only contains the two columns of interest.
df_sub = df[["pickup_zone", "dropoff_zone"]]
df_sub
pickup_zone | dropoff_zone | |
---|---|---|
0 | Lenox Hill West | UN/Turtle Bay South |
1 | Upper West Side South | Upper West Side South |
2 | Alphabet City | West Village |
3 | Hudson Sq | Yorkville West |
4 | Midtown East | Yorkville West |
... | ... | ... |
6428 | East Harlem North | Central Harlem North |
6429 | Jamaica | East Concourse/Concourse Village |
6430 | Crown Heights North | Bushwick North |
6431 | East New York | East Flatbush/Remsen Village |
6432 | Boerum Hill | Windsor Terrace |
6341 rows Ă— 2 columns
We now use apply
again, but this time to the columns, not to the rows. This approach is nicer in some ways, because there is less repetition. Unlike in the code ("Airport" in row["pickup_zone"]) or ("Airport" in row["dropoff_zone"])
, where we had the same thing repeated almost entirely twice.
df_sub.apply(lambda col: col.str.contains("Airport"), axis=0)
pickup_zone | dropoff_zone | |
---|---|---|
0 | False | False |
1 | False | False |
2 | False | False |
3 | False | False |
4 | False | False |
... | ... | ... |
6428 | False | False |
6429 | False | False |
6430 | False | False |
6431 | False | False |
6432 | False | False |
6341 rows Ă— 2 columns
As an example of how axis
works, if we apply any(axis=0)
, we are checking if there are any True
values in the columns. This Series has length two, because there are two columns.
df_sub.apply(lambda col: col.str.contains("Airport"), axis=0).any(axis=0)
pickup_zone True
dropoff_zone True
dtype: bool
But what we really want to do is apply any(axis=1)
, to find out if any of the rows have a True
in them. This produces a much longer Series, and we store it in the “Airport2” column.
df_sub.apply(lambda col: col.str.contains("Airport"), axis=0).any(axis=1)
0 False
1 False
2 False
3 False
4 False
...
6428 False
6429 False
6430 False
6431 False
6432 False
Length: 6341, dtype: bool
df["Airport2"] = df_sub.apply(lambda col: col.str.contains("Airport"), axis=0).any(axis=1)
df
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | Airport0 | Airport1 | Airport2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | False | False | False |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | False | False | False |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | False | False | False |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | False | False | False |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan | False | False | False |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx | False | False | False |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn | False | False | False |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn | False | False | False |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn | False | False | False |
6341 rows Ă— 17 columns
Make a list
air_cols
containing the names of those three columns we just made. Use list comprehension and the stringstartswith
method.
Before we do that, here is an example of including an if
statement at the end of a list comprehension. It indicates which values we will use in our list. For example in the following, we are iterating over x = 0
, 1
, 2
, 3
, 4
(remember that the right endpoint is typically not included in Python), but we are only using the values which are strictly greater than 2
. So we are putting 3**2
and 4**2
into a list. (Side comment: In Python, we do not use the caret symbol ^
for powers, instead we use **
.)
[x**2 for x in range(5) if x > 2]
[9, 16]
[c for c in df.columns if c == "Airport1"]
['Airport1']
We are making a list containing all the column names which start with the string “Airport”.
[c for c in df.columns if c.startswith("Airport")]
['Airport0', 'Airport1', 'Airport2']
Here is almost another way to make the list. Notice how we have not done anything with i
yet.
[f"Airport{i}" for i in range(3)]
['Airport0', 'Airport1', 'Airport2']
Check that the values in these three columns are the same. One approach: compute
row.unique()
for every row in the DataFrame, and check that the length is always1
.
df[[f"Airport{i}" for i in range(3)]].apply(lambda row: len(row.unique()) == 1, axis = 1)
0 True
1 True
2 True
3 True
4 True
...
6428 True
6429 True
6430 True
6431 True
6432 True
Length: 6341, dtype: bool
df[[f"Airport{i}" for i in range(3)]].apply(lambda row: len(row.unique()) == 1, axis = 1).all(axis = 0)
True