Week 2 Friday#

You can find the notesbooks at course notes.

import pandas as pd
df = pd.read_csv("taxis.csv")
  • Convert the “pickup” and “dropoff” columns to datetime values, using the pandas function to_datetime. Save the results as new columns in df, named pickuptime and “droptime”.

df['pickuptime'] = pd.to_datetime(df['pickup'])
df['droptime'] = pd.to_datetime(df['dropoff'])
df.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough pickuptime droptime
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan 2019-03-23 20:21:09 2019-03-23 20:27:24
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan 2019-03-04 16:11:55 2019-03-04 16:19:00
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan 2019-03-27 17:53:01 2019-03-27 18:00:25
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan 2019-03-10 01:23:59 2019-03-10 01:49:51
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan 2019-03-30 13:27:42 2019-03-30 13:37:14
  • Make a new column, “day”, containing the day of the week of the pickup. Use the accessor dt and then the method day_name.

If we try to call day_name on the Series directly, even though it contains datetime values, we get an error. We need to use the dt accessor to apply these kind of methods to the entries in the Series.

df['pickuptime'].day_name()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [6], line 1
----> 1 df['pickuptime'].day_name()

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5465, in NDFrame.__getattr__(self, name)
   5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464     return self[name]
-> 5465 return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'day_name'
df['day'] = df['pickuptime'].dt.day_name() #accessor dt
df['day']
0        Saturday
1          Monday
2       Wednesday
3          Sunday
4        Saturday
          ...    
6428       Sunday
6429       Sunday
6430     Saturday
6431       Monday
6432    Wednesday
Name: day, Length: 6433, dtype: object
help(df['pickuptime'].dt)
 |          - 'raise' will raise an NonExistentTimeError if there are
 |            nonexistent times.
 |      
 |          .. versionadded:: 0.24.0
 |      
 |      Returns
 |      -------
 |      DatetimeIndex, TimedeltaIndex, or Series
 |          Index of the same type for a DatetimeIndex or TimedeltaIndex,
 |          or a Series with the same index for a Series.
 |      
 |      Raises
 |      ------
 |      ValueError if the `freq` cannot be converted.
 |      
 |      Examples
 |      --------
 |      **DatetimeIndex**
 |      
 |      >>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min')
 |      >>> rng
 |      DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00',
 |                     '2018-01-01 12:01:00'],
 |                    dtype='datetime64[ns]', freq='T')
 |      >>> rng.round('H')
 |      DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00',
 |                     '2018-01-01 12:00:00'],
 |                    dtype='datetime64[ns]', freq=None)
 |      
 |      **Series**
 |      
 |      >>> pd.Series(rng).dt.round("H")
 |      0   2018-01-01 12:00:00
 |      1   2018-01-01 12:00:00
 |      2   2018-01-01 12:00:00
 |      dtype: datetime64[ns]
 |  
 |  strftime(self, *args, **kwargs)
 |      Convert to Index using specified date_format.
 |      
 |      Return an Index of formatted strings specified by date_format, which
 |      supports the same string format as the python standard library. Details
 |      of the string format can be found in `python string format
 |      doc <https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior>`__.
 |      
 |      Parameters
 |      ----------
 |      date_format : str
 |          Date format string (e.g. "%Y-%m-%d").
 |      
 |      Returns
 |      -------
 |      ndarray
 |          NumPy ndarray of formatted strings.
 |      
 |      See Also
 |      --------
 |      to_datetime : Convert the given argument to datetime.
 |      DatetimeIndex.normalize : Return DatetimeIndex with times to midnight.
 |      DatetimeIndex.round : Round the DatetimeIndex to the specified freq.
 |      DatetimeIndex.floor : Floor the DatetimeIndex to the specified freq.
 |      
 |      Examples
 |      --------
 |      >>> rng = pd.date_range(pd.Timestamp("2018-03-10 09:00"),
 |      ...                     periods=3, freq='s')
 |      >>> rng.strftime('%B %d, %Y, %r')
 |      Index(['March 10, 2018, 09:00:00 AM', 'March 10, 2018, 09:00:01 AM',
 |             'March 10, 2018, 09:00:02 AM'],
 |            dtype='object')
 |  
 |  to_period(self, *args, **kwargs)
 |      Cast to PeriodArray/Index at a particular frequency.
 |      
 |      Converts DatetimeArray/Index to PeriodArray/Index.
 |      
 |      Parameters
 |      ----------
 |      freq : str or Offset, optional
 |          One of pandas' :ref:`offset strings <timeseries.offset_aliases>`
 |          or an Offset object. Will be inferred by default.
 |      
 |      Returns
 |      -------
 |      PeriodArray/Index
 |      
 |      Raises
 |      ------
 |      ValueError
 |          When converting a DatetimeArray/Index with non-regular values,
 |          so that a frequency cannot be inferred.
 |      
 |      See Also
 |      --------
 |      PeriodIndex: Immutable ndarray holding ordinal values.
 |      DatetimeIndex.to_pydatetime: Return DatetimeIndex as object.
 |      
 |      Examples
 |      --------
 |      >>> df = pd.DataFrame({"y": [1, 2, 3]},
 |      ...                   index=pd.to_datetime(["2000-03-31 00:00:00",
 |      ...                                         "2000-05-31 00:00:00",
 |      ...                                         "2000-08-31 00:00:00"]))
 |      >>> df.index.to_period("M")
 |      PeriodIndex(['2000-03', '2000-05', '2000-08'],
 |                  dtype='period[M]', freq='M')
 |      
 |      Infer the daily frequency
 |      
 |      >>> idx = pd.date_range("2017-01-01", periods=2)
 |      >>> idx.to_period()
 |      PeriodIndex(['2017-01-01', '2017-01-02'],
 |                  dtype='period[D]', freq='D')
 |  
 |  to_pydatetime(self) -> numpy.ndarray
 |      Return the data as an array of native Python datetime objects.
 |      
 |      Timezone information is retained if present.
 |      
 |      .. warning::
 |      
 |         Python's datetime uses microsecond resolution, which is lower than
 |         pandas (nanosecond). The values are truncated.
 |      
 |      Returns
 |      -------
 |      numpy.ndarray
 |          Object dtype array containing native Python datetime objects.
 |      
 |      See Also
 |      --------
 |      datetime.datetime : Standard library value for a datetime.
 |      
 |      Examples
 |      --------
 |      >>> s = pd.Series(pd.date_range('20180310', periods=2))
 |      >>> s
 |      0   2018-03-10
 |      1   2018-03-11
 |      dtype: datetime64[ns]
 |      
 |      >>> s.dt.to_pydatetime()
 |      array([datetime.datetime(2018, 3, 10, 0, 0),
 |             datetime.datetime(2018, 3, 11, 0, 0)], dtype=object)
 |      
 |      pandas' nanosecond precision is truncated to microseconds.
 |      
 |      >>> s = pd.Series(pd.date_range('20180310', periods=2, freq='ns'))
 |      >>> s
 |      0   2018-03-10 00:00:00.000000000
 |      1   2018-03-10 00:00:00.000000001
 |      dtype: datetime64[ns]
 |      
 |      >>> s.dt.to_pydatetime()
 |      array([datetime.datetime(2018, 3, 10, 0, 0),
 |             datetime.datetime(2018, 3, 10, 0, 0)], dtype=object)
 |  
 |  tz_convert(self, *args, **kwargs)
 |      Convert tz-aware Datetime Array/Index from one time zone to another.
 |      
 |      Parameters
 |      ----------
 |      tz : str, pytz.timezone, dateutil.tz.tzfile or None
 |          Time zone for time. Corresponding timestamps would be converted
 |          to this time zone of the Datetime Array/Index. A `tz` of None will
 |          convert to UTC and remove the timezone information.
 |      
 |      Returns
 |      -------
 |      Array or Index
 |      
 |      Raises
 |      ------
 |      TypeError
 |          If Datetime Array/Index is tz-naive.
 |      
 |      See Also
 |      --------
 |      DatetimeIndex.tz : A timezone that has a variable offset from UTC.
 |      DatetimeIndex.tz_localize : Localize tz-naive DatetimeIndex to a
 |          given time zone, or remove timezone from a tz-aware DatetimeIndex.
 |      
 |      Examples
 |      --------
 |      With the `tz` parameter, we can change the DatetimeIndex
 |      to other time zones:
 |      
 |      >>> dti = pd.date_range(start='2014-08-01 09:00',
 |      ...                     freq='H', periods=3, tz='Europe/Berlin')
 |      
 |      >>> dti
 |      DatetimeIndex(['2014-08-01 09:00:00+02:00',
 |                     '2014-08-01 10:00:00+02:00',
 |                     '2014-08-01 11:00:00+02:00'],
 |                    dtype='datetime64[ns, Europe/Berlin]', freq='H')
 |      
 |      >>> dti.tz_convert('US/Central')
 |      DatetimeIndex(['2014-08-01 02:00:00-05:00',
 |                     '2014-08-01 03:00:00-05:00',
 |                     '2014-08-01 04:00:00-05:00'],
 |                    dtype='datetime64[ns, US/Central]', freq='H')
 |      
 |      With the ``tz=None``, we can remove the timezone (after converting
 |      to UTC if necessary):
 |      
 |      >>> dti = pd.date_range(start='2014-08-01 09:00', freq='H',
 |      ...                     periods=3, tz='Europe/Berlin')
 |      
 |      >>> dti
 |      DatetimeIndex(['2014-08-01 09:00:00+02:00',
 |                     '2014-08-01 10:00:00+02:00',
 |                     '2014-08-01 11:00:00+02:00'],
 |                      dtype='datetime64[ns, Europe/Berlin]', freq='H')
 |      
 |      >>> dti.tz_convert(None)
 |      DatetimeIndex(['2014-08-01 07:00:00',
 |                     '2014-08-01 08:00:00',
 |                     '2014-08-01 09:00:00'],
 |                      dtype='datetime64[ns]', freq='H')
 |  
 |  tz_localize(self, *args, **kwargs)
 |      Localize tz-naive Datetime Array/Index to tz-aware
 |      Datetime Array/Index.
 |      
 |      This method takes a time zone (tz) naive Datetime Array/Index object
 |      and makes this time zone aware. It does not move the time to another
 |      time zone.
 |      Time zone localization helps to switch from time zone aware to time
 |      zone unaware objects.
 |      
 |      Parameters
 |      ----------
 |      tz : str, pytz.timezone, dateutil.tz.tzfile or None
 |          Time zone to convert timestamps to. Passing ``None`` will
 |          remove the time zone information preserving local time.
 |      ambiguous : 'infer', 'NaT', bool array, default 'raise'
 |          When clocks moved backward due to DST, ambiguous times may arise.
 |          For example in Central European Time (UTC+01), when going from
 |          03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at
 |          00:30:00 UTC and at 01:30:00 UTC. In such a situation, the
 |          `ambiguous` parameter dictates how ambiguous times should be
 |          handled.
 |      
 |          - 'infer' will attempt to infer fall dst-transition hours based on
 |            order
 |          - bool-ndarray where True signifies a DST time, False signifies a
 |            non-DST time (note that this flag is only applicable for
 |            ambiguous times)
 |          - 'NaT' will return NaT where there are ambiguous times
 |          - 'raise' will raise an AmbiguousTimeError if there are ambiguous
 |            times.
 |      
 |      nonexistent : 'shift_forward', 'shift_backward, 'NaT', timedelta, default 'raise'
 |          A nonexistent time does not exist in a particular timezone
 |          where clocks moved forward due to DST.
 |      
 |          - 'shift_forward' will shift the nonexistent time forward to the
 |            closest existing time
 |          - 'shift_backward' will shift the nonexistent time backward to the
 |            closest existing time
 |          - 'NaT' will return NaT where there are nonexistent times
 |          - timedelta objects will shift nonexistent times by the timedelta
 |          - 'raise' will raise an NonExistentTimeError if there are
 |            nonexistent times.
 |      
 |          .. versionadded:: 0.24.0
 |      
 |      Returns
 |      -------
 |      Same type as self
 |          Array/Index converted to the specified time zone.
 |      
 |      Raises
 |      ------
 |      TypeError
 |          If the Datetime Array/Index is tz-aware and tz is not None.
 |      
 |      See Also
 |      --------
 |      DatetimeIndex.tz_convert : Convert tz-aware DatetimeIndex from
 |          one time zone to another.
 |      
 |      Examples
 |      --------
 |      >>> tz_naive = pd.date_range('2018-03-01 09:00', periods=3)
 |      >>> tz_naive
 |      DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
 |                     '2018-03-03 09:00:00'],
 |                    dtype='datetime64[ns]', freq='D')
 |      
 |      Localize DatetimeIndex in US/Eastern time zone:
 |      
 |      >>> tz_aware = tz_naive.tz_localize(tz='US/Eastern')
 |      >>> tz_aware
 |      DatetimeIndex(['2018-03-01 09:00:00-05:00',
 |                     '2018-03-02 09:00:00-05:00',
 |                     '2018-03-03 09:00:00-05:00'],
 |                    dtype='datetime64[ns, US/Eastern]', freq=None)
 |      
 |      With the ``tz=None``, we can remove the time zone information
 |      while keeping the local time (not converted to UTC):
 |      
 |      >>> tz_aware.tz_localize(None)
 |      DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
 |                     '2018-03-03 09:00:00'],
 |                    dtype='datetime64[ns]', freq=None)
 |      
 |      Be careful with DST changes. When there is sequential data, pandas can
 |      infer the DST time:
 |      
 |      >>> s = pd.to_datetime(pd.Series(['2018-10-28 01:30:00',
 |      ...                               '2018-10-28 02:00:00',
 |      ...                               '2018-10-28 02:30:00',
 |      ...                               '2018-10-28 02:00:00',
 |      ...                               '2018-10-28 02:30:00',
 |      ...                               '2018-10-28 03:00:00',
 |      ...                               '2018-10-28 03:30:00']))
 |      >>> s.dt.tz_localize('CET', ambiguous='infer')
 |      0   2018-10-28 01:30:00+02:00
 |      1   2018-10-28 02:00:00+02:00
 |      2   2018-10-28 02:30:00+02:00
 |      3   2018-10-28 02:00:00+01:00
 |      4   2018-10-28 02:30:00+01:00
 |      5   2018-10-28 03:00:00+01:00
 |      6   2018-10-28 03:30:00+01:00
 |      dtype: datetime64[ns, CET]
 |      
 |      In some cases, inferring the DST is impossible. In such cases, you can
 |      pass an ndarray to the ambiguous parameter to set the DST explicitly
 |      
 |      >>> s = pd.to_datetime(pd.Series(['2018-10-28 01:20:00',
 |      ...                               '2018-10-28 02:36:00',
 |      ...                               '2018-10-28 03:46:00']))
 |      >>> s.dt.tz_localize('CET', ambiguous=np.array([True, True, False]))
 |      0   2018-10-28 01:20:00+02:00
 |      1   2018-10-28 02:36:00+02:00
 |      2   2018-10-28 03:46:00+01:00
 |      dtype: datetime64[ns, CET]
 |      
 |      If the DST transition causes nonexistent times, you can shift these
 |      dates forward or backwards with a timedelta object or `'shift_forward'`
 |      or `'shift_backwards'`.
 |      
 |      >>> s = pd.to_datetime(pd.Series(['2015-03-29 02:30:00',
 |      ...                               '2015-03-29 03:30:00']))
 |      >>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_forward')
 |      0   2015-03-29 03:00:00+02:00
 |      1   2015-03-29 03:30:00+02:00
 |      dtype: datetime64[ns, Europe/Warsaw]
 |      
 |      >>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_backward')
 |      0   2015-03-29 01:59:59.999999999+01:00
 |      1   2015-03-29 03:30:00+02:00
 |      dtype: datetime64[ns, Europe/Warsaw]
 |      
 |      >>> s.dt.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta('1H'))
 |      0   2015-03-29 03:30:00+02:00
 |      1   2015-03-29 03:30:00+02:00
 |      dtype: datetime64[ns, Europe/Warsaw]
 |  
 |  ----------------------------------------------------------------------
 |  Readonly properties defined here:
 |  
 |  freq
 |  
 |  week
 |      The week ordinal of the year.
 |      
 |      .. deprecated:: 1.1.0
 |      
 |      Series.dt.weekofyear and Series.dt.week have been deprecated.
 |      Please use Series.dt.isocalendar().week instead.
 |  
 |  weekofyear
 |      The week ordinal of the year.
 |      
 |      .. deprecated:: 1.1.0
 |      
 |      Series.dt.weekofyear and Series.dt.week have been deprecated.
 |      Please use Series.dt.isocalendar().week instead.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  date
 |      Returns numpy array of python datetime.date objects (namely, the date
 |      part of Timestamps without timezone information).
 |  
 |  day
 |      The day of the datetime.
 |      
 |      Examples
 |      --------
 |      >>> datetime_series = pd.Series(
 |      ...     pd.date_range("2000-01-01", periods=3, freq="D")
 |      ... )
 |      >>> datetime_series
 |      0   2000-01-01
 |      1   2000-01-02
 |      2   2000-01-03
 |      dtype: datetime64[ns]
 |      >>> datetime_series.dt.day
 |      0    1
 |      1    2
 |      2    3
 |      dtype: int64
 |  
 |  day_of_week
 |      The day of the week with Monday=0, Sunday=6.
 |      
 |      Return the day of the week. It is assumed the week starts on
 |      Monday, which is denoted by 0 and ends on Sunday which is denoted
 |      by 6. This method is available on both Series with datetime
 |      values (using the `dt` accessor) or DatetimeIndex.
 |      
 |      Returns
 |      -------
 |      Series or Index
 |          Containing integers indicating the day number.
 |      
 |      See Also
 |      --------
 |      Series.dt.dayofweek : Alias.
 |      Series.dt.weekday : Alias.
 |      Series.dt.day_name : Returns the name of the day of the week.
 |      
 |      Examples
 |      --------
 |      >>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
 |      >>> s.dt.dayofweek
 |      2016-12-31    5
 |      2017-01-01    6
 |      2017-01-02    0
 |      2017-01-03    1
 |      2017-01-04    2
 |      2017-01-05    3
 |      2017-01-06    4
 |      2017-01-07    5
 |      2017-01-08    6
 |      Freq: D, dtype: int64
 |  
 |  day_of_year
 |      The ordinal day of the year.
 |  
 |  dayofweek
 |      The day of the week with Monday=0, Sunday=6.
 |      
 |      Return the day of the week. It is assumed the week starts on
 |      Monday, which is denoted by 0 and ends on Sunday which is denoted
 |      by 6. This method is available on both Series with datetime
 |      values (using the `dt` accessor) or DatetimeIndex.
 |      
 |      Returns
 |      -------
 |      Series or Index
 |          Containing integers indicating the day number.
 |      
 |      See Also
 |      --------
 |      Series.dt.dayofweek : Alias.
 |      Series.dt.weekday : Alias.
 |      Series.dt.day_name : Returns the name of the day of the week.
 |      
 |      Examples
 |      --------
 |      >>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
 |      >>> s.dt.dayofweek
 |      2016-12-31    5
 |      2017-01-01    6
 |      2017-01-02    0
 |      2017-01-03    1
 |      2017-01-04    2
 |      2017-01-05    3
 |      2017-01-06    4
 |      2017-01-07    5
 |      2017-01-08    6
 |      Freq: D, dtype: int64
 |  
 |  dayofyear
 |      The ordinal day of the year.
 |  
 |  days_in_month
 |      The number of days in the month.
 |  
 |  daysinmonth
 |      The number of days in the month.
 |  
 |  hour
 |      The hours of the datetime.
 |      
 |      Examples
 |      --------
 |      >>> datetime_series = pd.Series(
 |      ...     pd.date_range("2000-01-01", periods=3, freq="h")
 |      ... )
 |      >>> datetime_series
 |      0   2000-01-01 00:00:00
 |      1   2000-01-01 01:00:00
 |      2   2000-01-01 02:00:00
 |      dtype: datetime64[ns]
 |      >>> datetime_series.dt.hour
 |      0    0
 |      1    1
 |      2    2
 |      dtype: int64
 |  
 |  is_leap_year
 |      Boolean indicator if the date belongs to a leap year.
 |      
 |      A leap year is a year, which has 366 days (instead of 365) including
 |      29th of February as an intercalary day.
 |      Leap years are years which are multiples of four with the exception
 |      of years divisible by 100 but not by 400.
 |      
 |      Returns
 |      -------
 |      Series or ndarray
 |           Booleans indicating if dates belong to a leap year.
 |      
 |      Examples
 |      --------
 |      This method is available on Series with datetime values under
 |      the ``.dt`` accessor, and directly on DatetimeIndex.
 |      
 |      >>> idx = pd.date_range("2012-01-01", "2015-01-01", freq="Y")
 |      >>> idx
 |      DatetimeIndex(['2012-12-31', '2013-12-31', '2014-12-31'],
 |                    dtype='datetime64[ns]', freq='A-DEC')
 |      >>> idx.is_leap_year
 |      array([ True, False, False])
 |      
 |      >>> dates_series = pd.Series(idx)
 |      >>> dates_series
 |      0   2012-12-31
 |      1   2013-12-31
 |      2   2014-12-31
 |      dtype: datetime64[ns]
 |      >>> dates_series.dt.is_leap_year
 |      0     True
 |      1    False
 |      2    False
 |      dtype: bool
 |  
 |  is_month_end
 |      Indicates whether the date is the last day of the month.
 |      
 |      Returns
 |      -------
 |      Series or array
 |          For Series, returns a Series with boolean values.
 |          For DatetimeIndex, returns a boolean array.
 |      
 |      See Also
 |      --------
 |      is_month_start : Return a boolean indicating whether the date
 |          is the first day of the month.
 |      is_month_end : Return a boolean indicating whether the date
 |          is the last day of the month.
 |      
 |      Examples
 |      --------
 |      This method is available on Series with datetime values under
 |      the ``.dt`` accessor, and directly on DatetimeIndex.
 |      
 |      >>> s = pd.Series(pd.date_range("2018-02-27", periods=3))
 |      >>> s
 |      0   2018-02-27
 |      1   2018-02-28
 |      2   2018-03-01
 |      dtype: datetime64[ns]
 |      >>> s.dt.is_month_start
 |      0    False
 |      1    False
 |      2    True
 |      dtype: bool
 |      >>> s.dt.is_month_end
 |      0    False
 |      1    True
 |      2    False
 |      dtype: bool
 |      
 |      >>> idx = pd.date_range("2018-02-27", periods=3)
 |      >>> idx.is_month_start
 |      array([False, False, True])
 |      >>> idx.is_month_end
 |      array([False, True, False])
 |  
 |  is_month_start
 |      Indicates whether the date is the first day of the month.
 |      
 |      Returns
 |      -------
 |      Series or array
 |          For Series, returns a Series with boolean values.
 |          For DatetimeIndex, returns a boolean array.
 |      
 |      See Also
 |      --------
 |      is_month_start : Return a boolean indicating whether the date
 |          is the first day of the month.
 |      is_month_end : Return a boolean indicating whether the date
 |          is the last day of the month.
 |      
 |      Examples
 |      --------
 |      This method is available on Series with datetime values under
 |      the ``.dt`` accessor, and directly on DatetimeIndex.
 |      
 |      >>> s = pd.Series(pd.date_range("2018-02-27", periods=3))
 |      >>> s
 |      0   2018-02-27
 |      1   2018-02-28
 |      2   2018-03-01
 |      dtype: datetime64[ns]
 |      >>> s.dt.is_month_start
 |      0    False
 |      1    False
 |      2    True
 |      dtype: bool
 |      >>> s.dt.is_month_end
 |      0    False
 |      1    True
 |      2    False
 |      dtype: bool
 |      
 |      >>> idx = pd.date_range("2018-02-27", periods=3)
 |      >>> idx.is_month_start
 |      array([False, False, True])
 |      >>> idx.is_month_end
 |      array([False, True, False])
 |  
 |  is_quarter_end
 |      Indicator for whether the date is the last day of a quarter.
 |      
 |      Returns
 |      -------
 |      is_quarter_end : Series or DatetimeIndex
 |          The same type as the original data with boolean values. Series will
 |          have the same name and index. DatetimeIndex will have the same
 |          name.
 |      
 |      See Also
 |      --------
 |      quarter : Return the quarter of the date.
 |      is_quarter_start : Similar property indicating the quarter start.
 |      
 |      Examples
 |      --------
 |      This method is available on Series with datetime values under
 |      the ``.dt`` accessor, and directly on DatetimeIndex.
 |      
 |      >>> df = pd.DataFrame({'dates': pd.date_range("2017-03-30",
 |      ...                    periods=4)})
 |      >>> df.assign(quarter=df.dates.dt.quarter,
 |      ...           is_quarter_end=df.dates.dt.is_quarter_end)
 |             dates  quarter    is_quarter_end
 |      0 2017-03-30        1             False
 |      1 2017-03-31        1              True
 |      2 2017-04-01        2             False
 |      3 2017-04-02        2             False
 |      
 |      >>> idx = pd.date_range('2017-03-30', periods=4)
 |      >>> idx
 |      DatetimeIndex(['2017-03-30', '2017-03-31', '2017-04-01', '2017-04-02'],
 |                    dtype='datetime64[ns]', freq='D')
 |      
 |      >>> idx.is_quarter_end
 |      array([False,  True, False, False])
 |  
 |  is_quarter_start
 |      Indicator for whether the date is the first day of a quarter.
 |      
 |      Returns
 |      -------
 |      is_quarter_start : Series or DatetimeIndex
 |          The same type as the original data with boolean values. Series will
 |          have the same name and index. DatetimeIndex will have the same
 |          name.
 |      
 |      See Also
 |      --------
 |      quarter : Return the quarter of the date.
 |      is_quarter_end : Similar property for indicating the quarter start.
 |      
 |      Examples
 |      --------
 |      This method is available on Series with datetime values under
 |      the ``.dt`` accessor, and directly on DatetimeIndex.
 |      
 |      >>> df = pd.DataFrame({'dates': pd.date_range("2017-03-30",
 |      ...                   periods=4)})
 |      >>> df.assign(quarter=df.dates.dt.quarter,
 |      ...           is_quarter_start=df.dates.dt.is_quarter_start)
 |             dates  quarter  is_quarter_start
 |      0 2017-03-30        1             False
 |      1 2017-03-31        1             False
 |      2 2017-04-01        2              True
 |      3 2017-04-02        2             False
 |      
 |      >>> idx = pd.date_range('2017-03-30', periods=4)
 |      >>> idx
 |      DatetimeIndex(['2017-03-30', '2017-03-31', '2017-04-01', '2017-04-02'],
 |                    dtype='datetime64[ns]', freq='D')
 |      
 |      >>> idx.is_quarter_start
 |      array([False, False,  True, False])
 |  
 |  is_year_end
 |      Indicate whether the date is the last day of the year.
 |      
 |      Returns
 |      -------
 |      Series or DatetimeIndex
 |          The same type as the original data with boolean values. Series will
 |          have the same name and index. DatetimeIndex will have the same
 |          name.
 |      
 |      See Also
 |      --------
 |      is_year_start : Similar property indicating the start of the year.
 |      
 |      Examples
 |      --------
 |      This method is available on Series with datetime values under
 |      the ``.dt`` accessor, and directly on DatetimeIndex.
 |      
 |      >>> dates = pd.Series(pd.date_range("2017-12-30", periods=3))
 |      >>> dates
 |      0   2017-12-30
 |      1   2017-12-31
 |      2   2018-01-01
 |      dtype: datetime64[ns]
 |      
 |      >>> dates.dt.is_year_end
 |      0    False
 |      1     True
 |      2    False
 |      dtype: bool
 |      
 |      >>> idx = pd.date_range("2017-12-30", periods=3)
 |      >>> idx
 |      DatetimeIndex(['2017-12-30', '2017-12-31', '2018-01-01'],
 |                    dtype='datetime64[ns]', freq='D')
 |      
 |      >>> idx.is_year_end
 |      array([False,  True, False])
 |  
 |  is_year_start
 |      Indicate whether the date is the first day of a year.
 |      
 |      Returns
 |      -------
 |      Series or DatetimeIndex
 |          The same type as the original data with boolean values. Series will
 |          have the same name and index. DatetimeIndex will have the same
 |          name.
 |      
 |      See Also
 |      --------
 |      is_year_end : Similar property indicating the last day of the year.
 |      
 |      Examples
 |      --------
 |      This method is available on Series with datetime values under
 |      the ``.dt`` accessor, and directly on DatetimeIndex.
 |      
 |      >>> dates = pd.Series(pd.date_range("2017-12-30", periods=3))
 |      >>> dates
 |      0   2017-12-30
 |      1   2017-12-31
 |      2   2018-01-01
 |      dtype: datetime64[ns]
 |      
 |      >>> dates.dt.is_year_start
 |      0    False
 |      1    False
 |      2    True
 |      dtype: bool
 |      
 |      >>> idx = pd.date_range("2017-12-30", periods=3)
 |      >>> idx
 |      DatetimeIndex(['2017-12-30', '2017-12-31', '2018-01-01'],
 |                    dtype='datetime64[ns]', freq='D')
 |      
 |      >>> idx.is_year_start
 |      array([False, False,  True])
 |  
 |  microsecond
 |      The microseconds of the datetime.
 |      
 |      Examples
 |      --------
 |      >>> datetime_series = pd.Series(
 |      ...     pd.date_range("2000-01-01", periods=3, freq="us")
 |      ... )
 |      >>> datetime_series
 |      0   2000-01-01 00:00:00.000000
 |      1   2000-01-01 00:00:00.000001
 |      2   2000-01-01 00:00:00.000002
 |      dtype: datetime64[ns]
 |      >>> datetime_series.dt.microsecond
 |      0       0
 |      1       1
 |      2       2
 |      dtype: int64
 |  
 |  minute
 |      The minutes of the datetime.
 |      
 |      Examples
 |      --------
 |      >>> datetime_series = pd.Series(
 |      ...     pd.date_range("2000-01-01", periods=3, freq="T")
 |      ... )
 |      >>> datetime_series
 |      0   2000-01-01 00:00:00
 |      1   2000-01-01 00:01:00
 |      2   2000-01-01 00:02:00
 |      dtype: datetime64[ns]
 |      >>> datetime_series.dt.minute
 |      0    0
 |      1    1
 |      2    2
 |      dtype: int64
 |  
 |  month
 |      The month as January=1, December=12.
 |      
 |      Examples
 |      --------
 |      >>> datetime_series = pd.Series(
 |      ...     pd.date_range("2000-01-01", periods=3, freq="M")
 |      ... )
 |      >>> datetime_series
 |      0   2000-01-31
 |      1   2000-02-29
 |      2   2000-03-31
 |      dtype: datetime64[ns]
 |      >>> datetime_series.dt.month
 |      0    1
 |      1    2
 |      2    3
 |      dtype: int64
 |  
 |  nanosecond
 |      The nanoseconds of the datetime.
 |      
 |      Examples
 |      --------
 |      >>> datetime_series = pd.Series(
 |      ...     pd.date_range("2000-01-01", periods=3, freq="ns")
 |      ... )
 |      >>> datetime_series
 |      0   2000-01-01 00:00:00.000000000
 |      1   2000-01-01 00:00:00.000000001
 |      2   2000-01-01 00:00:00.000000002
 |      dtype: datetime64[ns]
 |      >>> datetime_series.dt.nanosecond
 |      0       0
 |      1       1
 |      2       2
 |      dtype: int64
 |  
 |  quarter
 |      The quarter of the date.
 |  
 |  second
 |      The seconds of the datetime.
 |      
 |      Examples
 |      --------
 |      >>> datetime_series = pd.Series(
 |      ...     pd.date_range("2000-01-01", periods=3, freq="s")
 |      ... )
 |      >>> datetime_series
 |      0   2000-01-01 00:00:00
 |      1   2000-01-01 00:00:01
 |      2   2000-01-01 00:00:02
 |      dtype: datetime64[ns]
 |      >>> datetime_series.dt.second
 |      0    0
 |      1    1
 |      2    2
 |      dtype: int64
 |  
 |  time
 |      Returns numpy array of datetime.time. The time part of the Timestamps.
 |  
 |  timetz
 |      Returns numpy array of datetime.time also containing timezone
 |      information. The time part of the Timestamps.
 |  
 |  tz
 |      Return timezone, if any.
 |      
 |      Returns
 |      -------
 |      datetime.tzinfo, pytz.tzinfo.BaseTZInfo, dateutil.tz.tz.tzfile, or None
 |          Returns None when the array is tz-naive.
 |  
 |  weekday
 |      The day of the week with Monday=0, Sunday=6.
 |      
 |      Return the day of the week. It is assumed the week starts on
 |      Monday, which is denoted by 0 and ends on Sunday which is denoted
 |      by 6. This method is available on both Series with datetime
 |      values (using the `dt` accessor) or DatetimeIndex.
 |      
 |      Returns
 |      -------
 |      Series or Index
 |          Containing integers indicating the day number.
 |      
 |      See Also
 |      --------
 |      Series.dt.dayofweek : Alias.
 |      Series.dt.weekday : Alias.
 |      Series.dt.day_name : Returns the name of the day of the week.
 |      
 |      Examples
 |      --------
 |      >>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
 |      >>> s.dt.dayofweek
 |      2016-12-31    5
 |      2017-01-01    6
 |      2017-01-02    0
 |      2017-01-03    1
 |      2017-01-04    2
 |      2017-01-05    3
 |      2017-01-06    4
 |      2017-01-07    5
 |      2017-01-08    6
 |      Freq: D, dtype: int64
 |  
 |  year
 |      The year of the datetime.
 |      
 |      Examples
 |      --------
 |      >>> datetime_series = pd.Series(
 |      ...     pd.date_range("2000-01-01", periods=3, freq="Y")
 |      ... )
 |      >>> datetime_series
 |      0   2000-12-31
 |      1   2001-12-31
 |      2   2002-12-31
 |      dtype: datetime64[ns]
 |      >>> datetime_series.dt.year
 |      0    2000
 |      1    2001
 |      2    2002
 |      dtype: int64
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from Properties:
 |  
 |  __init__(self, data: 'Series', orig)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from pandas.core.accessor.PandasDelegate:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from pandas.core.base.PandasObject:
 |  
 |  __repr__(self) -> str
 |      Return a string representation for a particular object.
 |  
 |  __sizeof__(self)
 |      Generates the total memory usage for an object that returns
 |      either a value or Series of values
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from pandas.core.base.PandasObject:
 |  
 |  __annotations__ = {'_cache': typing.Dict[str, typing.Any]}
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from pandas.core.accessor.DirNamesMixin:
 |  
 |  __dir__(self) -> List[str]
 |      Provide method name lookup and completion.
 |      
 |      Notes
 |      -----
 |      Only provide 'public' methods.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from pandas.core.base.NoNewAttributesMixin:
 |  
 |  __setattr__(self, key: str, value)
 |      Implement setattr(self, name, value).
  • Make a new column “duration” containing the amount of time the ride took. (Now that pandas recognizes these as dates, you can just subtract.)

This sort of subtraction of times would never work if we hadn’t converted them to datetime types using pd.to_datetime.

df["duration"] = df["droptime"] - df["pickuptime"]
df["duration"]
0      0 days 00:06:15
1      0 days 00:07:05
2      0 days 00:07:24
3      0 days 00:25:52
4      0 days 00:09:32
             ...      
6428   0 days 00:03:34
6429   0 days 00:56:23
6430   0 days 00:19:07
6431   0 days 00:05:04
6432   0 days 00:16:40
Name: duration, Length: 6433, dtype: timedelta64[ns]

Check out the first three rows and first two columns. Does the top row look like a 6 minute taxi ride? Yes, because it went from 8:21pm to 8:27pm.

df.iloc[:3,:2] 
pickup dropoff
0 2019-03-23 20:21:09 2019-03-23 20:27:24
1 2019-03-04 16:11:55 2019-03-04 16:19:00
2 2019-03-27 17:53:01 2019-03-27 18:00:25
  • What are the data types of these four columns we just made?

Notice how the first two we made are called “datetime64[ns]”, whereas the last one is called a “timedelta64[ns]”. When you see “delta”, you should usually think “difference”, and that’s exactly what this column represents. The middle column, “day”, is reported as object, because they are just strings like "Monday".

df.dtypes
pickup                      object
dropoff                     object
passengers                   int64
distance                   float64
fare                       float64
tip                        float64
tolls                      float64
total                      float64
color                       object
payment                     object
pickup_zone                 object
dropoff_zone                object
pickup_borough              object
dropoff_borough             object
pickuptime          datetime64[ns]
droptime            datetime64[ns]
day                         object
duration           timedelta64[ns]
dtype: object

Working with strings in pandas#

  • How many rows in df have the word “Airport” in either the “pickup_zone” column or the “dropoff_zone” column? Do any rows have the word “Airport” in both columns?

Use the str accessor and the contains method.

df[["pickup_zone","dropoff_zone"]]
pickup_zone dropoff_zone
0 Lenox Hill West UN/Turtle Bay South
1 Upper West Side South Upper West Side South
2 Alphabet City West Village
3 Hudson Sq Yorkville West
4 Midtown East Yorkville West
... ... ...
6428 East Harlem North Central Harlem North
6429 Jamaica East Concourse/Concourse Village
6430 Crown Heights North Bushwick North
6431 East New York East Flatbush/Remsen Village
6432 Boerum Hill Windsor Terrace

6433 rows Ă— 2 columns

df['pickup_zone'][:3] #check out the first three entries in the “pickup_zone” column
0          Lenox Hill West
1    Upper West Side South
2            Alphabet City
Name: pickup_zone, dtype: object

An example of how str.contains works:

df['pickup_zone'][:3].str.contains('City')
0    False
1    False
2     True
Name: pickup_zone, dtype: bool

Here we use Boolean indexing to find all the rows where either the “pickup_zone” or the “dropoff_zone” column contains the substring “Airport”. Notice how the logical or is made using a vertical bar. ( Spelled out or is used in some contexts in Python, but not with NumPy Boolean arrays or pandas Boolean Series.)

ps = df['pickup_zone'].str.contains("Airport")
ds = df['dropoff_zone'].str.contains("Airport")
df[ps | ds] #or  #df[ps & ds] #and
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough pickuptime droptime day duration
10 2019-03-16 10:02:25 2019-03-16 10:22:29 1 3.90 17.00 0.00 0.00 17.80 yellow cash LaGuardia Airport Astoria Queens Queens 2019-03-16 10:02:25 2019-03-16 10:22:29 Saturday 0 days 00:20:04
22 2019-03-28 17:20:54 2019-03-28 17:55:57 1 9.82 31.50 8.31 5.76 49.87 yellow credit card LaGuardia Airport Sutton Place/Turtle Bay North Queens Manhattan 2019-03-28 17:20:54 2019-03-28 17:55:57 Thursday 0 days 00:35:03
43 2019-03-27 19:39:16 2019-03-27 20:00:26 1 9.50 28.00 0.00 5.76 35.06 yellow cash LaGuardia Airport Washington Heights South Queens Manhattan 2019-03-27 19:39:16 2019-03-27 20:00:26 Wednesday 0 days 00:21:10
53 2019-03-20 18:49:24 2019-03-20 19:12:40 1 13.45 36.00 0.00 0.00 37.80 yellow cash JFK Airport Sunnyside Queens Queens 2019-03-20 18:49:24 2019-03-20 19:12:40 Wednesday 0 days 00:23:16
56 2019-03-05 17:57:00 2019-03-05 18:45:25 2 19.59 52.00 0.00 5.76 65.56 yellow credit card JFK Airport Upper East Side North Queens Manhattan 2019-03-05 17:57:00 2019-03-05 18:45:25 Tuesday 0 days 00:48:25
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6243 2019-03-12 11:40:49 2019-03-12 11:49:23 1 3.50 12.00 0.00 0.00 12.80 green cash Steinway LaGuardia Airport Queens Queens 2019-03-12 11:40:49 2019-03-12 11:49:23 Tuesday 0 days 00:08:34
6280 2019-03-27 12:25:04 2019-03-27 12:31:15 1 2.74 10.00 0.00 0.00 10.80 green credit card Steinway LaGuardia Airport Queens Queens 2019-03-27 12:25:04 2019-03-27 12:31:15 Wednesday 0 days 00:06:11
6322 2019-03-01 17:03:11 2019-03-01 17:17:11 1 5.39 17.00 3.76 0.00 22.56 green credit card Jamaica JFK Airport Queens Queens 2019-03-01 17:03:11 2019-03-01 17:17:11 Friday 0 days 00:14:00
6396 2019-03-15 17:44:09 2019-03-15 18:20:25 1 11.44 41.63 0.00 0.00 42.13 green credit card Bushwick South LaGuardia Airport Brooklyn Queens 2019-03-15 17:44:09 2019-03-15 18:20:25 Friday 0 days 00:36:16
6398 2019-03-17 13:07:29 2019-03-17 13:51:56 1 18.39 50.50 7.70 0.00 59.00 green credit card DUMBO/Vinegar Hill JFK Airport Brooklyn Queens 2019-03-17 13:07:29 2019-03-17 13:51:56 Sunday 0 days 00:44:27

407 rows Ă— 18 columns

Missing values in pandas#

  • Look at df.info() again. Which columns have missing values?

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype          
---  ------           --------------  -----          
 0   pickup           6433 non-null   object         
 1   dropoff          6433 non-null   object         
 2   passengers       6433 non-null   int64          
 3   distance         6433 non-null   float64        
 4   fare             6433 non-null   float64        
 5   tip              6433 non-null   float64        
 6   tolls            6433 non-null   float64        
 7   total            6433 non-null   float64        
 8   color            6433 non-null   object         
 9   payment          6389 non-null   object         
 10  pickup_zone      6407 non-null   object         
 11  dropoff_zone     6388 non-null   object         
 12  pickup_borough   6407 non-null   object         
 13  dropoff_borough  6388 non-null   object         
 14  pickuptime       6433 non-null   datetime64[ns] 
 15  droptime         6433 non-null   datetime64[ns] 
 16  day              6433 non-null   object         
 17  duration         6433 non-null   timedelta64[ns]
dtypes: datetime64[ns](2), float64(5), int64(1), object(9), timedelta64[ns](1)
memory usage: 904.8+ KB
  • Define df_sub to be the sub-DataFrame containing all the rows for which the “pickup_zone” entry is missing. Use the isna method.

df_sub = df[df["pickup_zone"].isna()]
df_sub
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough pickuptime droptime day duration
42 2019-03-30 23:59:14 2019-03-30 23:59:17 1 0.00 80.0 20.08 0.00 100.38 yellow credit card NaN NaN NaN NaN 2019-03-30 23:59:14 2019-03-30 23:59:17 Saturday 0 days 00:00:03
606 2019-03-19 15:04:40 2019-03-19 15:06:15 1 0.40 3.5 1.35 0.00 8.15 yellow credit card NaN Upper East Side South NaN Manhattan 2019-03-19 15:04:40 2019-03-19 15:06:15 Tuesday 0 days 00:01:35
622 2019-03-12 19:52:36 2019-03-12 19:52:55 3 0.00 120.0 33.20 12.50 166.00 yellow credit card NaN NaN NaN NaN 2019-03-12 19:52:36 2019-03-12 19:52:55 Tuesday 0 days 00:00:19
671 2019-03-09 22:45:46 2019-03-09 22:46:50 1 0.40 3.5 0.00 0.00 7.30 yellow cash NaN Lincoln Square East NaN Manhattan 2019-03-09 22:45:46 2019-03-09 22:46:50 Saturday 0 days 00:01:04
712 2019-03-03 01:24:45 2019-03-03 01:29:32 1 0.80 5.0 1.75 0.00 10.55 yellow credit card NaN West Chelsea/Hudson Yards NaN Manhattan 2019-03-03 01:24:45 2019-03-03 01:29:32 Sunday 0 days 00:04:47
970 2019-03-29 00:02:00 2019-03-29 00:02:05 2 0.00 45.0 5.00 0.00 50.30 yellow credit card NaN NaN NaN NaN 2019-03-29 00:02:00 2019-03-29 00:02:05 Friday 0 days 00:00:05
1108 2019-03-10 09:33:13 2019-03-10 09:52:04 1 13.40 52.0 11.70 5.76 70.26 yellow credit card NaN NaN NaN NaN 2019-03-10 09:33:13 2019-03-10 09:52:04 Sunday 0 days 00:18:51
1961 2019-03-19 16:43:59 2019-03-19 17:37:22 1 17.82 52.0 13.11 5.76 78.67 yellow credit card NaN Garment District NaN Manhattan 2019-03-19 16:43:59 2019-03-19 17:37:22 Tuesday 0 days 00:53:23
2137 2019-03-08 13:27:32 2019-03-08 13:46:23 1 1.40 12.0 3.05 0.00 18.35 yellow credit card NaN NaN NaN NaN 2019-03-08 13:27:32 2019-03-08 13:46:23 Friday 0 days 00:18:51
2742 2019-03-23 06:20:27 2019-03-23 06:20:59 1 0.00 55.0 0.00 0.00 55.80 yellow credit card NaN NaN NaN NaN 2019-03-23 06:20:27 2019-03-23 06:20:59 Saturday 0 days 00:00:32
3085 2019-03-24 15:43:02 2019-03-24 16:02:57 1 3.20 14.5 0.00 0.00 15.30 yellow cash NaN NaN NaN NaN 2019-03-24 15:43:02 2019-03-24 16:02:57 Sunday 0 days 00:19:55
3259 2019-03-30 00:39:09 2019-03-30 00:43:34 1 0.84 5.0 1.70 0.00 10.50 yellow credit card NaN NaN NaN NaN 2019-03-30 00:39:09 2019-03-30 00:43:34 Saturday 0 days 00:04:25
3644 2019-03-19 23:30:50 2019-03-19 23:32:33 2 0.00 103.0 20.65 0.00 123.95 yellow credit card NaN NaN NaN NaN 2019-03-19 23:30:50 2019-03-19 23:32:33 Tuesday 0 days 00:01:43
3793 2019-03-24 06:07:30 2019-03-24 06:12:22 1 1.50 6.5 0.00 0.00 9.80 yellow NaN NaN Midtown East NaN Manhattan 2019-03-24 06:07:30 2019-03-24 06:12:22 Sunday 0 days 00:04:52
3889 2019-03-30 06:42:55 2019-03-30 06:43:00 1 0.00 2.5 0.00 0.00 3.30 yellow cash NaN NaN NaN NaN 2019-03-30 06:42:55 2019-03-30 06:43:00 Saturday 0 days 00:00:05
4118 2019-03-15 14:01:54 2019-03-15 14:01:59 1 0.00 25.0 5.06 0.00 30.36 yellow credit card NaN NaN NaN NaN 2019-03-15 14:01:54 2019-03-15 14:01:59 Friday 0 days 00:00:05
4127 2019-03-01 05:18:21 2019-03-01 05:18:30 2 0.04 2.5 0.00 0.00 3.80 yellow cash NaN NaN NaN NaN 2019-03-01 05:18:21 2019-03-01 05:18:30 Friday 0 days 00:00:09
4281 2019-03-18 23:29:49 2019-03-18 23:38:00 1 2.10 9.0 2.06 0.00 12.36 yellow credit card NaN NaN NaN NaN 2019-03-18 23:29:49 2019-03-18 23:38:00 Monday 0 days 00:08:11
4414 2019-03-29 20:42:47 2019-03-29 21:03:50 1 4.20 17.0 4.80 5.76 28.86 yellow credit card NaN NaN NaN NaN 2019-03-29 20:42:47 2019-03-29 21:03:50 Friday 0 days 00:21:03
4772 2019-03-21 19:51:13 2019-03-21 19:57:03 1 0.72 5.5 1.46 0.00 8.76 yellow credit card NaN NaN NaN NaN 2019-03-21 19:51:13 2019-03-21 19:57:03 Thursday 0 days 00:05:50
4941 2019-03-01 08:12:28 2019-03-01 08:17:19 1 0.83 5.0 2.50 0.00 10.80 yellow credit card NaN NaN NaN NaN 2019-03-01 08:12:28 2019-03-01 08:17:19 Friday 0 days 00:04:51
5263 2019-03-09 20:37:53 2019-03-09 21:07:20 1 4.99 22.0 5.16 0.00 30.96 yellow credit card NaN NaN NaN NaN 2019-03-09 20:37:53 2019-03-09 21:07:20 Saturday 0 days 00:29:27
5493 2019-03-06 12:14:00 2019-03-06 12:14:00 1 0.00 10.0 0.00 0.00 10.00 green credit card NaN NaN NaN NaN 2019-03-06 12:14:00 2019-03-06 12:14:00 Wednesday 0 days 00:00:00
5624 2019-03-11 14:04:50 2019-03-11 14:04:50 1 0.00 2.5 0.00 0.00 3.30 green cash NaN NaN NaN NaN 2019-03-11 14:04:50 2019-03-11 14:04:50 Monday 0 days 00:00:00
5638 2019-03-13 12:22:00 2019-03-13 12:22:00 1 0.00 10.0 0.00 0.00 10.00 green credit card NaN NaN NaN NaN 2019-03-13 12:22:00 2019-03-13 12:22:00 Wednesday 0 days 00:00:00
6083 2019-03-19 15:34:00 2019-03-19 15:34:00 1 0.00 5.0 0.00 0.00 5.00 green credit card NaN NaN NaN NaN 2019-03-19 15:34:00 2019-03-19 15:34:00 Tuesday 0 days 00:00:00
df_sub = df[df["pickup_zone"].isnull()]
df_sub
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough pickuptime droptime day duration
42 2019-03-30 23:59:14 2019-03-30 23:59:17 1 0.00 80.0 20.08 0.00 100.38 yellow credit card NaN NaN NaN NaN 2019-03-30 23:59:14 2019-03-30 23:59:17 Saturday 0 days 00:00:03
606 2019-03-19 15:04:40 2019-03-19 15:06:15 1 0.40 3.5 1.35 0.00 8.15 yellow credit card NaN Upper East Side South NaN Manhattan 2019-03-19 15:04:40 2019-03-19 15:06:15 Tuesday 0 days 00:01:35
622 2019-03-12 19:52:36 2019-03-12 19:52:55 3 0.00 120.0 33.20 12.50 166.00 yellow credit card NaN NaN NaN NaN 2019-03-12 19:52:36 2019-03-12 19:52:55 Tuesday 0 days 00:00:19
671 2019-03-09 22:45:46 2019-03-09 22:46:50 1 0.40 3.5 0.00 0.00 7.30 yellow cash NaN Lincoln Square East NaN Manhattan 2019-03-09 22:45:46 2019-03-09 22:46:50 Saturday 0 days 00:01:04
712 2019-03-03 01:24:45 2019-03-03 01:29:32 1 0.80 5.0 1.75 0.00 10.55 yellow credit card NaN West Chelsea/Hudson Yards NaN Manhattan 2019-03-03 01:24:45 2019-03-03 01:29:32 Sunday 0 days 00:04:47
970 2019-03-29 00:02:00 2019-03-29 00:02:05 2 0.00 45.0 5.00 0.00 50.30 yellow credit card NaN NaN NaN NaN 2019-03-29 00:02:00 2019-03-29 00:02:05 Friday 0 days 00:00:05
1108 2019-03-10 09:33:13 2019-03-10 09:52:04 1 13.40 52.0 11.70 5.76 70.26 yellow credit card NaN NaN NaN NaN 2019-03-10 09:33:13 2019-03-10 09:52:04 Sunday 0 days 00:18:51
1961 2019-03-19 16:43:59 2019-03-19 17:37:22 1 17.82 52.0 13.11 5.76 78.67 yellow credit card NaN Garment District NaN Manhattan 2019-03-19 16:43:59 2019-03-19 17:37:22 Tuesday 0 days 00:53:23
2137 2019-03-08 13:27:32 2019-03-08 13:46:23 1 1.40 12.0 3.05 0.00 18.35 yellow credit card NaN NaN NaN NaN 2019-03-08 13:27:32 2019-03-08 13:46:23 Friday 0 days 00:18:51
2742 2019-03-23 06:20:27 2019-03-23 06:20:59 1 0.00 55.0 0.00 0.00 55.80 yellow credit card NaN NaN NaN NaN 2019-03-23 06:20:27 2019-03-23 06:20:59 Saturday 0 days 00:00:32
3085 2019-03-24 15:43:02 2019-03-24 16:02:57 1 3.20 14.5 0.00 0.00 15.30 yellow cash NaN NaN NaN NaN 2019-03-24 15:43:02 2019-03-24 16:02:57 Sunday 0 days 00:19:55
3259 2019-03-30 00:39:09 2019-03-30 00:43:34 1 0.84 5.0 1.70 0.00 10.50 yellow credit card NaN NaN NaN NaN 2019-03-30 00:39:09 2019-03-30 00:43:34 Saturday 0 days 00:04:25
3644 2019-03-19 23:30:50 2019-03-19 23:32:33 2 0.00 103.0 20.65 0.00 123.95 yellow credit card NaN NaN NaN NaN 2019-03-19 23:30:50 2019-03-19 23:32:33 Tuesday 0 days 00:01:43
3793 2019-03-24 06:07:30 2019-03-24 06:12:22 1 1.50 6.5 0.00 0.00 9.80 yellow NaN NaN Midtown East NaN Manhattan 2019-03-24 06:07:30 2019-03-24 06:12:22 Sunday 0 days 00:04:52
3889 2019-03-30 06:42:55 2019-03-30 06:43:00 1 0.00 2.5 0.00 0.00 3.30 yellow cash NaN NaN NaN NaN 2019-03-30 06:42:55 2019-03-30 06:43:00 Saturday 0 days 00:00:05
4118 2019-03-15 14:01:54 2019-03-15 14:01:59 1 0.00 25.0 5.06 0.00 30.36 yellow credit card NaN NaN NaN NaN 2019-03-15 14:01:54 2019-03-15 14:01:59 Friday 0 days 00:00:05
4127 2019-03-01 05:18:21 2019-03-01 05:18:30 2 0.04 2.5 0.00 0.00 3.80 yellow cash NaN NaN NaN NaN 2019-03-01 05:18:21 2019-03-01 05:18:30 Friday 0 days 00:00:09
4281 2019-03-18 23:29:49 2019-03-18 23:38:00 1 2.10 9.0 2.06 0.00 12.36 yellow credit card NaN NaN NaN NaN 2019-03-18 23:29:49 2019-03-18 23:38:00 Monday 0 days 00:08:11
4414 2019-03-29 20:42:47 2019-03-29 21:03:50 1 4.20 17.0 4.80 5.76 28.86 yellow credit card NaN NaN NaN NaN 2019-03-29 20:42:47 2019-03-29 21:03:50 Friday 0 days 00:21:03
4772 2019-03-21 19:51:13 2019-03-21 19:57:03 1 0.72 5.5 1.46 0.00 8.76 yellow credit card NaN NaN NaN NaN 2019-03-21 19:51:13 2019-03-21 19:57:03 Thursday 0 days 00:05:50
4941 2019-03-01 08:12:28 2019-03-01 08:17:19 1 0.83 5.0 2.50 0.00 10.80 yellow credit card NaN NaN NaN NaN 2019-03-01 08:12:28 2019-03-01 08:17:19 Friday 0 days 00:04:51
5263 2019-03-09 20:37:53 2019-03-09 21:07:20 1 4.99 22.0 5.16 0.00 30.96 yellow credit card NaN NaN NaN NaN 2019-03-09 20:37:53 2019-03-09 21:07:20 Saturday 0 days 00:29:27
5493 2019-03-06 12:14:00 2019-03-06 12:14:00 1 0.00 10.0 0.00 0.00 10.00 green credit card NaN NaN NaN NaN 2019-03-06 12:14:00 2019-03-06 12:14:00 Wednesday 0 days 00:00:00
5624 2019-03-11 14:04:50 2019-03-11 14:04:50 1 0.00 2.5 0.00 0.00 3.30 green cash NaN NaN NaN NaN 2019-03-11 14:04:50 2019-03-11 14:04:50 Monday 0 days 00:00:00
5638 2019-03-13 12:22:00 2019-03-13 12:22:00 1 0.00 10.0 0.00 0.00 10.00 green credit card NaN NaN NaN NaN 2019-03-13 12:22:00 2019-03-13 12:22:00 Wednesday 0 days 00:00:00
6083 2019-03-19 15:34:00 2019-03-19 15:34:00 1 0.00 5.0 0.00 0.00 5.00 green credit card NaN NaN NaN NaN 2019-03-19 15:34:00 2019-03-19 15:34:00 Tuesday 0 days 00:00:00
df_sub = df[df["pickup_zone"].notna()] #"pickup_zone" entry is not missing
df_sub
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough pickuptime droptime day duration
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan 2019-03-23 20:21:09 2019-03-23 20:27:24 Saturday 0 days 00:06:15
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan 2019-03-04 16:11:55 2019-03-04 16:19:00 Monday 0 days 00:07:05
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan 2019-03-27 17:53:01 2019-03-27 18:00:25 Wednesday 0 days 00:07:24
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan 2019-03-10 01:23:59 2019-03-10 01:49:51 Sunday 0 days 00:25:52
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan 2019-03-30 13:27:42 2019-03-30 13:37:14 Saturday 0 days 00:09:32
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6428 2019-03-31 09:51:53 2019-03-31 09:55:27 1 0.75 4.5 1.06 0.0 6.36 green credit card East Harlem North Central Harlem North Manhattan Manhattan 2019-03-31 09:51:53 2019-03-31 09:55:27 Sunday 0 days 00:03:34
6429 2019-03-31 17:38:00 2019-03-31 18:34:23 1 18.74 58.0 0.00 0.0 58.80 green credit card Jamaica East Concourse/Concourse Village Queens Bronx 2019-03-31 17:38:00 2019-03-31 18:34:23 Sunday 0 days 00:56:23
6430 2019-03-23 22:55:18 2019-03-23 23:14:25 1 4.14 16.0 0.00 0.0 17.30 green cash Crown Heights North Bushwick North Brooklyn Brooklyn 2019-03-23 22:55:18 2019-03-23 23:14:25 Saturday 0 days 00:19:07
6431 2019-03-04 10:09:25 2019-03-04 10:14:29 1 1.12 6.0 0.00 0.0 6.80 green credit card East New York East Flatbush/Remsen Village Brooklyn Brooklyn 2019-03-04 10:09:25 2019-03-04 10:14:29 Monday 0 days 00:05:04
6432 2019-03-13 19:31:22 2019-03-13 19:48:02 1 3.85 15.0 3.36 0.0 20.16 green credit card Boerum Hill Windsor Terrace Brooklyn Brooklyn 2019-03-13 19:31:22 2019-03-13 19:48:02 Wednesday 0 days 00:16:40

6407 rows Ă— 18 columns

  • What if you tried to check for equality with np.nan?

import numpy as np
df[df["pickup_zone"] == np.nan] #empty
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough pickuptime droptime day duration
np.nan == np.nan #not equal to itself
False
np.nan != np.nan 
True

Worse approach: Using a for loop#

We used str and contains to find the sub-DataFrame containing the rows where both the “pickup_zone” and the “dropoff_zone” involved an airport.

Here we’re going to go through the 6433 rows, one at a time. We will do that by going through the integers from 0 (inclusive) to 6433 (exclusive) one at a time. We can do that using for i in range(6433):. (In Matlab, we would do something like for i = 1:6433.)

# problem from missing values

good_inds = []

for i in range(len(df)):
    if ("Airport" in df.loc[i,"pickup_zone"]) and ("Airport" in df.loc[i,"dropoff_zone"]):
        good_inds.append(i)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [14], line 6
      3 good_inds = []
      5 for i in range(len(df)):
----> 6     if ("Airport" in df.loc[i,"pickup_zone"]) and ("Airport" in df.loc[i,"dropoff_zone"]):
      7         good_inds.append(i)

TypeError: argument of type 'float' is not iterable

If we try to check whether "Airport" in np.nan, we indeed get the exact same error.

"Airport" in np.nan
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [15], line 1
----> 1 "Airport" in np.nan

TypeError: argument of type 'float' is not iterable

How to identify identify np.nan values? We can use np.isnan() to identify np.nan values whether with individual values or arrays.

np.isnan(np.nan)
True

However, np.isnan() cannot handle strings.

np.isnan('String')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [25], line 1
----> 1 np.isnan('String')

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

For a pandas DataFrame or Series, you can use the isna() or isnull() methods to check for missing values.

df.loc[0,"pickup_zone"].isna()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [26], line 1
----> 1 df.loc[0,"pickup_zone"].isna()

AttributeError: 'str' object has no attribute 'isna'

Pandas methods like isna() are typically available on DataFrame and Series objects. When you’re dealing with individual elements from a DataFrame or Series, it’s often safer to use top-level pandas functions like pd.isna() to ensure compatibility.

pd.isna(df.loc[0, "pickup_zone"])
False

We are using and instead of & in the following, because it is being used with True and False values directly, rather than with arrays or Series of True and False values.

index_id = []

for i in range(len(df)):
    if pd.isna(df.loc[i, "pickup_zone"]) or pd.isna(df.loc[i, "dropoff_zone"]):
        continue #skip the rest of the loop iteration
    elif ("Airport" in df.loc[i, "pickup_zone"]) and ("Airport" in df.loc[i, "dropoff_zone"]):
        index_id.append(i)

print(index_id)
[120, 770, 1080, 1089, 1416, 1929, 2387, 3571, 4358, 5095, 5364]

There are many other ways we could get around this (such as by dropping rows with missing values in these columns). Let’s use aother simple approach, of wrapping the whole problematic portion in a try block. The except TypeError portion is saying that even if a TypeError is raised, do not stop the execution of the code, instead just continue to the next step in the for loop.

index_id = []

for i in range(len(df)):
    try:
        if ("Airport" in df.loc[i,"pickup_zone"]) and ("Airport" in df.loc[i,"dropoff_zone"]):
            index_id.append(i)
    except TypeError:
        continue

index_id
[120, 770, 1080, 1089, 1416, 1929, 2387, 3571, 4358, 5095, 5364]
df.iloc[index_id,:] #or use df.iloc[index_id]
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough pickuptime droptime day duration
120 2019-03-21 17:21:44 2019-03-21 17:21:49 1 0.00 2.50 0.00 0.00 4.30 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-21 17:21:44 2019-03-21 17:21:49 Thursday 0 days 00:00:05
770 2019-03-02 03:16:59 2019-03-02 03:17:06 0 9.40 2.50 0.00 0.00 3.80 yellow NaN JFK Airport JFK Airport Queens Queens 2019-03-02 03:16:59 2019-03-02 03:17:06 Saturday 0 days 00:00:07
1080 2019-03-04 14:17:05 2019-03-04 14:17:13 1 0.00 2.50 0.00 0.00 3.30 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-04 14:17:05 2019-03-04 14:17:13 Monday 0 days 00:00:08
1089 2019-03-10 01:43:32 2019-03-10 01:45:22 1 0.37 3.50 0.00 0.00 4.80 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-10 01:43:32 2019-03-10 01:45:22 Sunday 0 days 00:01:50
1416 2019-03-09 13:16:32 2019-03-09 13:46:11 2 12.39 35.00 0.00 0.00 35.80 yellow cash LaGuardia Airport JFK Airport Queens Queens 2019-03-09 13:16:32 2019-03-09 13:46:11 Saturday 0 days 00:29:39
1929 2019-03-13 22:35:35 2019-03-13 22:35:49 1 0.00 2.50 0.00 0.00 3.80 yellow NaN JFK Airport JFK Airport Queens Queens 2019-03-13 22:35:35 2019-03-13 22:35:49 Wednesday 0 days 00:00:14
2387 2019-03-28 15:58:52 2019-03-28 15:59:25 1 1.80 69.06 20.80 0.00 90.16 yellow credit card JFK Airport JFK Airport Queens Queens 2019-03-28 15:58:52 2019-03-28 15:59:25 Thursday 0 days 00:00:33
3571 2019-03-22 16:47:41 2019-03-22 16:47:50 1 0.81 66.00 0.00 0.00 66.80 yellow credit card JFK Airport JFK Airport Queens Queens 2019-03-22 16:47:41 2019-03-22 16:47:50 Friday 0 days 00:00:09
4358 2019-03-06 18:24:00 2019-03-06 18:24:13 2 0.01 2.50 0.00 0.00 4.30 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-06 18:24:00 2019-03-06 18:24:13 Wednesday 0 days 00:00:13
5095 2019-03-30 20:14:44 2019-03-30 21:01:28 3 18.91 52.00 8.78 5.76 67.34 yellow credit card JFK Airport JFK Airport Queens Queens 2019-03-30 20:14:44 2019-03-30 21:01:28 Saturday 0 days 00:46:44
5364 2019-03-17 16:59:17 2019-03-17 18:04:08 2 36.70 150.00 0.00 24.02 174.82 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-17 16:59:17 2019-03-17 18:04:08 Sunday 0 days 01:04:51