Week 4 Wednesday

Week 4 Wednesday#

Announcements#

The midterm is Monday (10/30). A sample midterm is posted on canvas.
The best way to study is to go over the HW, quizzes, and the sample midterm. Next priority would be the lecture notes.

Practice with the DataFrame method `apply`#

Load the “taxis” dataset from Seaborn and drop rows with missing values.

import seaborn as sns
import pandas as pd

df = sns.load_dataset("taxis")
df = df.dropna(axis=0)

help(df.dropna)

Help on method dropna in module pandas.core.frame:

dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If any NA values are present, drop that row or column.
        * 'all' : If all values are NA, drop that row or column.
    
    thresh : int, optional
        Require that many non-NA values.
    subset : array-like, optional
        Labels along other axis to consider, e.g. if you are dropping rows
        these would be a list of columns to include.
    inplace : bool, default False
        If True, do operation inplace and return None.
    
    Returns
    -------
    DataFrame or None
        DataFrame with NA entries dropped from it or None if ``inplace=True``.
    
    See Also
    --------
    DataFrame.isna: Indicate missing values.
    DataFrame.notna : Indicate existing (non-missing) values.
    DataFrame.fillna : Replace missing values.
    Series.dropna : Drop missing values.
    Index.dropna : Drop missing indices.
    
    Examples
    --------
    >>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
    ...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
    ...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
    ...                             pd.NaT]})
    >>> df
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Drop the rows where at least one element is missing.
    
    >>> df.dropna()
         name        toy       born
    1  Batman  Batmobile 1940-04-25
    
    Drop the columns where at least one element is missing.
    
    >>> df.dropna(axis='columns')
           name
    0    Alfred
    1    Batman
    2  Catwoman
    
    Drop the rows where all elements are missing.
    
    >>> df.dropna(how='all')
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Keep only the rows with at least 2 non-NA values.
    
    >>> df.dropna(thresh=2)
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Define in which columns to look for missing values.
    
    >>> df.dropna(subset=['name', 'toy'])
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Keep the DataFrame with valid entries in the same variable.
    
    >>> df.dropna(inplace=True)
    >>> df
         name        toy       born
    1  Batman  Batmobile 1940-04-25

As a warm-up with apply, try applying the len function using first axis=0 and then axis=1. We’ll see more serious uses below and in the homework.

df.shape

(6341, 14)

Notice how the row labels have disappeared but the column labels are still here. We have changed the row labels, which corresponds to axis=0. The numbers 6341 correspond to the length of each of these columns.

df.apply(len, axis=0)

pickup             6341
dropoff            6341
passengers         6341
distance           6341
fare               6341
tip                6341
tolls              6341
total              6341
color              6341
payment            6341
pickup_zone        6341
dropoff_zone       6341
pickup_borough     6341
dropoff_borough    6341
dtype: int64

If we use axis=1 instead, then we are plugging in one row at a time. Those rows all have length 14, which is where the following values come from.

df.apply(len, axis=1)

     14
     14
     14
     14
     14
        ..
  14
  14
  14
  14
  14
Length: 6341, dtype: int64

Here is an example where we see different values for different columns. Here the input is a column, and the output is its dtype.

df.apply(lambda x: x.dtype, axis=0)

pickup             datetime64[ns]
dropoff            datetime64[ns]
passengers                  int64
distance                  float64
fare                      float64
tip                       float64
tolls                     float64
total                     float64
color                      object
payment                    object
pickup_zone                object
dropoff_zone               object
pickup_borough             object
dropoff_borough            object
dtype: object

Compare the above use of apply to map, which is used on a single pandas Series, rather than on a whole pandas DataFrame.

ser = df["payment"]
ser

     credit card
            cash
     credit card
     credit card
     credit card
           ...     
  credit card
  credit card
         cash
  credit card
  credit card
Name: payment, Length: 6341, dtype: object

Here we are trying to add 10 to each entry, but we get an error because the entries are strings in ser.

ser.map(lambda x: x+10) # x is a srting

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [8], line 1
----> 1 ser.map(lambda x: x+10)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/series.py:3909, in Series.map(self, arg, na_action)
   3834 def map(self, arg, na_action=None) -> "Series":
   3835     """
   3836     Map values of Series according to input correspondence.
   3837 
   (...)
   3907     dtype: object
   3908     """
-> 3909     new_values = super()._map_values(arg, na_action=na_action)
   3910     return self._constructor(new_values, index=self.index).__finalize__(
   3911         self, method="map"
   3912     )

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/base.py:937, in IndexOpsMixin._map_values(self, mapper, na_action)
    934         raise ValueError(msg)
    936 # mapper is a function
--> 937 new_values = map_f(values, mapper)
    939 return new_values

File pandas/_libs/lib.pyx:2467, in pandas._libs.lib.map_infer()

Cell In [8], line 1, in <lambda>(x)
----> 1 ser.map(lambda x: x+10)

TypeError: can only concatenate str (not "int") to str

If we convert 10 to a string (we could also put it in quotation marks), then we don’t get an error. Notice how the function here is a little simpler than what we used above, in the sense that the input to this lambda function is a string, whereas the inputs above were pandas Series.

ser.map(lambda x: x+str(10))

     credit card10
            cash10
     credit card10
     credit card10
     credit card10
            ...      
  credit card10
  credit card10
         cash10
  credit card10
  credit card10
Name: payment, Length: 6341, dtype: object

ser.map(lambda x: f"{x}10")

     credit card10
            cash10
     credit card10
     credit card10
     credit card10
            ...      
  credit card10
  credit card10
         cash10
  credit card10
  credit card10
Name: payment, Length: 6341, dtype: object

Write a function has_airport which takes as input a row from df (not a row label but an entire row) and as output returns True if the “pickup_zone” or “dropoff_zone” entry contains the substring “Airport”.

def has_airport(row):
    return ("Airport" in row["pickup_zone"]) or ("Airport" in row["dropoff_zone"])

z = df.iloc[30] #This will attempt to select the 31th row by position.
z

pickup                   2019-03-01 02:55:55
dropoff                  2019-03-01 02:57:59
passengers                                 3
distance                                0.74
fare                                     4.0
tip                                      0.0
tolls                                    0.0
total                                    7.8
color                                 yellow
payment                                 cash
pickup_zone                     Clinton East
dropoff_zone       West Chelsea/Hudson Yards
pickup_borough                     Manhattan
dropoff_borough                    Manhattan
Name: 31, dtype: object

z["pickup_zone"]

'Clinton East'

Notice that neither the “pickup_zone” value nor the “dropoff_zone” value contains "Airport", so that is why our function returns False.

has_airport(z)

False

df.head(11)

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan
5	2019-03-11 10:37:23	2019-03-11 10:47:31	1	0.49	7.5	2.16	12.96	yellow	credit card	Times Sq/Theatre District	Midtown East	Manhattan	Manhattan
6	2019-03-26 21:07:31	2019-03-26 21:17:29	1	3.65	13.0	2.00	18.80	yellow	credit card	Battery Park City	Two Bridges/Seward Park	Manhattan	Manhattan
8	2019-03-23 11:48:50	2019-03-23 12:06:14	1	3.63	15.0	1.00	19.30	yellow	credit card	East Harlem South	Midtown Center	Manhattan	Manhattan
9	2019-03-08 16:18:37	2019-03-08 16:26:57	1	1.52	8.0	1.00	13.30	yellow	credit card	Lincoln Square East	Central Park	Manhattan	Manhattan
10	2019-03-16 10:02:25	2019-03-16 10:22:29	1	3.90	17.0	0.00	17.80	yellow	cash	LaGuardia Airport	Astoria	Queens	Queens
11	2019-03-20 19:39:42	2019-03-20 19:45:36	1	1.53	6.5	2.16	12.96	yellow	credit card	Upper West Side South	Manhattan Valley	Manhattan	Manhattan

df.loc[10] #  select the row with an index label of 10.

pickup             2019-03-16 10:02:25
dropoff            2019-03-16 10:22:29
passengers                           1
distance                           3.9
fare                              17.0
tip                                0.0
tolls                              0.0
total                             17.8
color                           yellow
payment                           cash
pickup_zone          LaGuardia Airport
dropoff_zone                   Astoria
pickup_borough                  Queens
dropoff_borough                 Queens
Name: 10, dtype: object

has_airport(df.loc[10]) 

True

df.iloc[10] #select the 11th row by position, it is now the same as df.loc[11] because we removed rows with missing vlaues

pickup               2019-03-20 19:39:42
dropoff              2019-03-20 19:45:36
passengers                             1
distance                            1.53
fare                                 6.5
tip                                 2.16
tolls                                0.0
total                              12.96
color                             yellow
payment                      credit card
pickup_zone        Upper West Side South
dropoff_zone            Manhattan Valley
pickup_borough                 Manhattan
dropoff_borough                Manhattan
Name: 11, dtype: object

df.loc[11]

pickup               2019-03-20 19:39:42
dropoff              2019-03-20 19:45:36
passengers                             1
distance                            1.53
fare                                 6.5
tip                                 2.16
tolls                                0.0
total                              12.96
color                             yellow
payment                      credit card
pickup_zone        Upper West Side South
dropoff_zone            Manhattan Valley
pickup_borough                 Manhattan
dropoff_borough                Manhattan
Name: 11, dtype: object

Make a new column "Airport0" which contains True if the “pickup_zone” or “dropoff_zone” contains the substring “Airport”, and otherwise contains False. Use the has_airport function and df.apply(???, axis=???).

Because we are evaluating this to a row at a time, we are using axis=1.

df["Airport0"] = df.apply(has_airport, axis = 1)
df

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	Airport0
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	0.0	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan	False
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	0.0	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan	False
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	0.0	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan	False
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	0.0	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan	False
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	0.0	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6428	2019-03-31 09:51:53	2019-03-31 09:55:27	1	0.75	4.5	1.06	0.0	6.36	green	credit card	East Harlem North	Central Harlem North	Manhattan	Manhattan	False
6429	2019-03-31 17:38:00	2019-03-31 18:34:23	1	18.74	58.0	0.00	0.0	58.80	green	credit card	Jamaica	East Concourse/Concourse Village	Queens	Bronx	False
6430	2019-03-23 22:55:18	2019-03-23 23:14:25	1	4.14	16.0	0.00	0.0	17.30	green	cash	Crown Heights North	Bushwick North	Brooklyn	Brooklyn	False
6431	2019-03-04 10:09:25	2019-03-04 10:14:29	1	1.12	6.0	0.00	0.0	6.80	green	credit card	East New York	East Flatbush/Remsen Village	Brooklyn	Brooklyn	False
6432	2019-03-13 19:31:22	2019-03-13 19:48:02	1	3.85	15.0	3.36	0.0	20.16	green	credit card	Boerum Hill	Windsor Terrace	Brooklyn	Brooklyn	False

6341 rows × 15 columns

If we try to apply using axis=0 in this case, we get an error, because the code included row["pickup_zone"], which doesn’t make sense for a column in this DataFrame. (There is no row with label “pickup_zone”.)

df.apply(has_airport, axis = 0)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3081, in Index.get_loc(self, key, method, tolerance)
   3080 try:
-> 3081     return self._engine.get_loc(casted_key)
   3082 except KeyError as err:

File pandas/_libs/index.pyx:70, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:98, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index_class_helper.pxi:89, in pandas._libs.index.Int64Engine._check_type()

KeyError: 'pickup_zone'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [20], line 1
----> 1 df.apply(has_airport, axis = 0)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/frame.py:7768, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwds)
   7757 from pandas.core.apply import frame_apply
   7759 op = frame_apply(
   7760     self,
   7761     func=func,
   (...)
   7766     kwds=kwds,
   7767 )
-> 7768 return op.get_result()

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/apply.py:185, in FrameApply.get_result(self)
    182 elif self.raw:
    183     return self.apply_raw()
--> 185 return self.apply_standard()

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/apply.py:276, in FrameApply.apply_standard(self)
    275 def apply_standard(self):
--> 276     results, res_index = self.apply_series_generator()
    278     # wrap results
    279     return self.wrap_results(results, res_index)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/apply.py:290, in FrameApply.apply_series_generator(self)
    287 with option_context("mode.chained_assignment", None):
    288     for i, v in enumerate(series_gen):
    289         # ignore SettingWithCopy here in case the user mutates
--> 290         results[i] = self.f(v)
    291         if isinstance(results[i], ABCSeries):
    292             # If we have a view on v, we need to make a copy because
    293             #  series_generator will swap out the underlying data
    294             results[i] = results[i].copy(deep=False)

Cell In [11], line 2, in has_airport(row)
      1 def has_airport(row):
----> 2     return ("Airport" in row["pickup_zone"]) or ("Airport" in row["dropoff_zone"])

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/series.py:853, in Series.__getitem__(self, key)
    850     return self._values[key]
    852 elif key_is_scalar:
--> 853     return self._get_value(key)
    855 if is_hashable(key):
    856     # Otherwise index.get_value will raise InvalidIndexError
    857     try:
    858         # For labels that don't resolve as scalars like tuples and frozensets

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/series.py:961, in Series._get_value(self, label, takeable)
    958     return self._values[label]
    960 # Similar to Index.get_value, but we do not fall back to positional
--> 961 loc = self.index.get_loc(label)
    962 return self.index._get_values_for_loc(self, loc, label)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3083, in Index.get_loc(self, key, method, tolerance)
   3081         return self._engine.get_loc(casted_key)
   3082     except KeyError as err:
-> 3083         raise KeyError(key) from err
   3085 if tolerance is not None:
   3086     tolerance = self._convert_tolerance(tolerance, np.asarray(key))

KeyError: 'pickup_zone'

Make a column "Airport1" with the same values, again using apply, but this time using a lambda function.

The following is the exact same as above, just using an anonymous lambda function instead of our named has_airport function. I’ve broken it up over multiple lines just to make it a little easier to read.

df["Airport1"] = df.apply(
    lambda row: ("Airport" in row["pickup_zone"]) or ("Airport" in row["dropoff_zone"]),
    axis=1
)

df

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	Airport0	Airport1
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	0.0	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan	False	False
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	0.0	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan	False	False
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	0.0	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan	False	False
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	0.0	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan	False	False
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	0.0	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6428	2019-03-31 09:51:53	2019-03-31 09:55:27	1	0.75	4.5	1.06	0.0	6.36	green	credit card	East Harlem North	Central Harlem North	Manhattan	Manhattan	False	False
6429	2019-03-31 17:38:00	2019-03-31 18:34:23	1	18.74	58.0	0.00	0.0	58.80	green	credit card	Jamaica	East Concourse/Concourse Village	Queens	Bronx	False	False
6430	2019-03-23 22:55:18	2019-03-23 23:14:25	1	4.14	16.0	0.00	0.0	17.30	green	cash	Crown Heights North	Bushwick North	Brooklyn	Brooklyn	False	False
6431	2019-03-04 10:09:25	2019-03-04 10:14:29	1	1.12	6.0	0.00	0.0	6.80	green	credit card	East New York	East Flatbush/Remsen Village	Brooklyn	Brooklyn	False	False
6432	2019-03-13 19:31:22	2019-03-13 19:48:02	1	3.85	15.0	3.36	0.0	20.16	green	credit card	Boerum Hill	Windsor Terrace	Brooklyn	Brooklyn	False	False

6341 rows × 16 columns

Here is a fancier but more confusing approach.

Notice that df[["pickup_zone", "dropoff_zone"]] is a DataFrame containing just those two columns we’re interested in. Use apply this time to apply a lambda function to the columns rather than the rows, then use any. Name the result "Airport2".

We start off getting a two-column DataFrame, that only contains the two columns of interest.

df_sub = df[["pickup_zone", "dropoff_zone"]]
df_sub

	pickup_zone	dropoff_zone
0	Lenox Hill West	UN/Turtle Bay South
1	Upper West Side South	Upper West Side South
2	Alphabet City	West Village
3	Hudson Sq	Yorkville West
4	Midtown East	Yorkville West
...	...	...
6428	East Harlem North	Central Harlem North
6429	Jamaica	East Concourse/Concourse Village
6430	Crown Heights North	Bushwick North
6431	East New York	East Flatbush/Remsen Village
6432	Boerum Hill	Windsor Terrace

6341 rows × 2 columns

We now use apply again, but this time to the columns, not to the rows. This approach is nicer in some ways, because there is less repetition. Unlike in the code ("Airport" in row["pickup_zone"]) or ("Airport" in row["dropoff_zone"]), where we had the same thing repeated almost entirely twice.

df_sub.apply(lambda col: col.str.contains("Airport"), axis=0)

	pickup_zone	dropoff_zone
0	False	False
1	False	False
2	False	False
3	False	False
4	False	False
...	...	...
6428	False	False
6429	False	False
6430	False	False
6431	False	False
6432	False	False

6341 rows × 2 columns

As an example of how axis works, if we apply any(axis=0), we are checking if there are any True values in the columns. This Series has length two, because there are two columns.

df_sub.apply(lambda col: col.str.contains("Airport"), axis=0).any(axis=0)

pickup_zone     True
dropoff_zone    True
dtype: bool

But what we really want to do is apply any(axis=1), to find out if any of the rows have a True in them. This produces a much longer Series, and we store it in the “Airport2” column.

df_sub.apply(lambda col: col.str.contains("Airport"), axis=0).any(axis=1)

     False
     False
     False
     False
     False
        ...  
  False
  False
  False
  False
  False
Length: 6341, dtype: bool

df["Airport2"] = df_sub.apply(lambda col: col.str.contains("Airport"), axis=0).any(axis=1)
df

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	Airport0	Airport1	Airport2
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	0.0	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan	False	False	False
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	0.0	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan	False	False	False
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	0.0	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan	False	False	False
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	0.0	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan	False	False	False
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	0.0	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6428	2019-03-31 09:51:53	2019-03-31 09:55:27	1	0.75	4.5	1.06	0.0	6.36	green	credit card	East Harlem North	Central Harlem North	Manhattan	Manhattan	False	False	False
6429	2019-03-31 17:38:00	2019-03-31 18:34:23	1	18.74	58.0	0.00	0.0	58.80	green	credit card	Jamaica	East Concourse/Concourse Village	Queens	Bronx	False	False	False
6430	2019-03-23 22:55:18	2019-03-23 23:14:25	1	4.14	16.0	0.00	0.0	17.30	green	cash	Crown Heights North	Bushwick North	Brooklyn	Brooklyn	False	False	False
6431	2019-03-04 10:09:25	2019-03-04 10:14:29	1	1.12	6.0	0.00	0.0	6.80	green	credit card	East New York	East Flatbush/Remsen Village	Brooklyn	Brooklyn	False	False	False
6432	2019-03-13 19:31:22	2019-03-13 19:48:02	1	3.85	15.0	3.36	0.0	20.16	green	credit card	Boerum Hill	Windsor Terrace	Brooklyn	Brooklyn	False	False	False

6341 rows × 17 columns

Make a list air_cols containing the names of those three columns we just made. Use list comprehension and the string startswith method.

Before we do that, here is an example of including an if statement at the end of a list comprehension. It indicates which values we will use in our list. For example in the following, we are iterating over x = 0, 1, 2, 3, 4 (remember that the right endpoint is typically not included in Python), but we are only using the values which are strictly greater than 2. So we are putting 3**2 and 4**2 into a list. (Side comment: In Python, we do not use the caret symbol ^ for powers, instead we use **.)

[x**2 for x in range(5) if x > 2]

[9, 16]

[c for c in df.columns if c == "Airport1"]

['Airport1']

We are making a list containing all the column names which start with the string “Airport”.

[c for c in df.columns if c.startswith("Airport")]

['Airport0', 'Airport1', 'Airport2']

Here is almost another way to make the list. Notice how we have not done anything with i yet.

[f"Airport{i}" for i in range(3)]

['Airport0', 'Airport1', 'Airport2']

Check that the values in these three columns are the same. One approach: compute row.unique() for every row in the DataFrame, and check that the length is always 1.

df[[f"Airport{i}" for i in range(3)]].apply(lambda row: len(row.unique()) == 1, axis = 1)

     True
     True
     True
     True
     True
        ... 
  True
  True
  True
  True
  True
Length: 6341, dtype: bool

df[[f"Airport{i}" for i in range(3)]].apply(lambda row: len(row.unique()) == 1, axis = 1).all(axis = 0)

True

Created in Deepnote

Week 4 Wednesday

Contents

Week 4 Wednesday#

Announcements#

Practice with the DataFrame method apply#

Practice with the DataFrame method `apply`#