Week 2 Friday#
You can find the notesbooks at course notes.
import pandas as pd
df = pd.read_csv("taxis.csv")
Convert the “pickup” and “dropoff” columns to datetime values, using the pandas function
to_datetime
. Save the results as new columns indf
, named pickuptime and “droptime”.
df['pickuptime'] = pd.to_datetime(df['pickup'])
df['droptime'] = pd.to_datetime(df['dropoff'])
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | pickuptime | droptime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 |
Make a new column, “day”, containing the day of the week of the pickup. Use the accessor
dt
and then the methodday_name
.
If we try to call day_name
on the Series directly, even though it contains datetime values, we get an error. We need to use the dt
accessor to apply these kind of methods to the entries in the Series.
df['pickuptime'].day_name()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [6], line 1
----> 1 df['pickuptime'].day_name()
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5465, in NDFrame.__getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'day_name'
df['day'] = df['pickuptime'].dt.day_name() #accessor dt
df['day']
0 Saturday
1 Monday
2 Wednesday
3 Sunday
4 Saturday
...
6428 Sunday
6429 Sunday
6430 Saturday
6431 Monday
6432 Wednesday
Name: day, Length: 6433, dtype: object
help(df['pickuptime'].dt)
| - 'raise' will raise an NonExistentTimeError if there are
| nonexistent times.
|
| .. versionadded:: 0.24.0
|
| Returns
| -------
| DatetimeIndex, TimedeltaIndex, or Series
| Index of the same type for a DatetimeIndex or TimedeltaIndex,
| or a Series with the same index for a Series.
|
| Raises
| ------
| ValueError if the `freq` cannot be converted.
|
| Examples
| --------
| **DatetimeIndex**
|
| >>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min')
| >>> rng
| DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00',
| '2018-01-01 12:01:00'],
| dtype='datetime64[ns]', freq='T')
| >>> rng.round('H')
| DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00',
| '2018-01-01 12:00:00'],
| dtype='datetime64[ns]', freq=None)
|
| **Series**
|
| >>> pd.Series(rng).dt.round("H")
| 0 2018-01-01 12:00:00
| 1 2018-01-01 12:00:00
| 2 2018-01-01 12:00:00
| dtype: datetime64[ns]
|
| strftime(self, *args, **kwargs)
| Convert to Index using specified date_format.
|
| Return an Index of formatted strings specified by date_format, which
| supports the same string format as the python standard library. Details
| of the string format can be found in `python string format
| doc <https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior>`__.
|
| Parameters
| ----------
| date_format : str
| Date format string (e.g. "%Y-%m-%d").
|
| Returns
| -------
| ndarray
| NumPy ndarray of formatted strings.
|
| See Also
| --------
| to_datetime : Convert the given argument to datetime.
| DatetimeIndex.normalize : Return DatetimeIndex with times to midnight.
| DatetimeIndex.round : Round the DatetimeIndex to the specified freq.
| DatetimeIndex.floor : Floor the DatetimeIndex to the specified freq.
|
| Examples
| --------
| >>> rng = pd.date_range(pd.Timestamp("2018-03-10 09:00"),
| ... periods=3, freq='s')
| >>> rng.strftime('%B %d, %Y, %r')
| Index(['March 10, 2018, 09:00:00 AM', 'March 10, 2018, 09:00:01 AM',
| 'March 10, 2018, 09:00:02 AM'],
| dtype='object')
|
| to_period(self, *args, **kwargs)
| Cast to PeriodArray/Index at a particular frequency.
|
| Converts DatetimeArray/Index to PeriodArray/Index.
|
| Parameters
| ----------
| freq : str or Offset, optional
| One of pandas' :ref:`offset strings <timeseries.offset_aliases>`
| or an Offset object. Will be inferred by default.
|
| Returns
| -------
| PeriodArray/Index
|
| Raises
| ------
| ValueError
| When converting a DatetimeArray/Index with non-regular values,
| so that a frequency cannot be inferred.
|
| See Also
| --------
| PeriodIndex: Immutable ndarray holding ordinal values.
| DatetimeIndex.to_pydatetime: Return DatetimeIndex as object.
|
| Examples
| --------
| >>> df = pd.DataFrame({"y": [1, 2, 3]},
| ... index=pd.to_datetime(["2000-03-31 00:00:00",
| ... "2000-05-31 00:00:00",
| ... "2000-08-31 00:00:00"]))
| >>> df.index.to_period("M")
| PeriodIndex(['2000-03', '2000-05', '2000-08'],
| dtype='period[M]', freq='M')
|
| Infer the daily frequency
|
| >>> idx = pd.date_range("2017-01-01", periods=2)
| >>> idx.to_period()
| PeriodIndex(['2017-01-01', '2017-01-02'],
| dtype='period[D]', freq='D')
|
| to_pydatetime(self) -> numpy.ndarray
| Return the data as an array of native Python datetime objects.
|
| Timezone information is retained if present.
|
| .. warning::
|
| Python's datetime uses microsecond resolution, which is lower than
| pandas (nanosecond). The values are truncated.
|
| Returns
| -------
| numpy.ndarray
| Object dtype array containing native Python datetime objects.
|
| See Also
| --------
| datetime.datetime : Standard library value for a datetime.
|
| Examples
| --------
| >>> s = pd.Series(pd.date_range('20180310', periods=2))
| >>> s
| 0 2018-03-10
| 1 2018-03-11
| dtype: datetime64[ns]
|
| >>> s.dt.to_pydatetime()
| array([datetime.datetime(2018, 3, 10, 0, 0),
| datetime.datetime(2018, 3, 11, 0, 0)], dtype=object)
|
| pandas' nanosecond precision is truncated to microseconds.
|
| >>> s = pd.Series(pd.date_range('20180310', periods=2, freq='ns'))
| >>> s
| 0 2018-03-10 00:00:00.000000000
| 1 2018-03-10 00:00:00.000000001
| dtype: datetime64[ns]
|
| >>> s.dt.to_pydatetime()
| array([datetime.datetime(2018, 3, 10, 0, 0),
| datetime.datetime(2018, 3, 10, 0, 0)], dtype=object)
|
| tz_convert(self, *args, **kwargs)
| Convert tz-aware Datetime Array/Index from one time zone to another.
|
| Parameters
| ----------
| tz : str, pytz.timezone, dateutil.tz.tzfile or None
| Time zone for time. Corresponding timestamps would be converted
| to this time zone of the Datetime Array/Index. A `tz` of None will
| convert to UTC and remove the timezone information.
|
| Returns
| -------
| Array or Index
|
| Raises
| ------
| TypeError
| If Datetime Array/Index is tz-naive.
|
| See Also
| --------
| DatetimeIndex.tz : A timezone that has a variable offset from UTC.
| DatetimeIndex.tz_localize : Localize tz-naive DatetimeIndex to a
| given time zone, or remove timezone from a tz-aware DatetimeIndex.
|
| Examples
| --------
| With the `tz` parameter, we can change the DatetimeIndex
| to other time zones:
|
| >>> dti = pd.date_range(start='2014-08-01 09:00',
| ... freq='H', periods=3, tz='Europe/Berlin')
|
| >>> dti
| DatetimeIndex(['2014-08-01 09:00:00+02:00',
| '2014-08-01 10:00:00+02:00',
| '2014-08-01 11:00:00+02:00'],
| dtype='datetime64[ns, Europe/Berlin]', freq='H')
|
| >>> dti.tz_convert('US/Central')
| DatetimeIndex(['2014-08-01 02:00:00-05:00',
| '2014-08-01 03:00:00-05:00',
| '2014-08-01 04:00:00-05:00'],
| dtype='datetime64[ns, US/Central]', freq='H')
|
| With the ``tz=None``, we can remove the timezone (after converting
| to UTC if necessary):
|
| >>> dti = pd.date_range(start='2014-08-01 09:00', freq='H',
| ... periods=3, tz='Europe/Berlin')
|
| >>> dti
| DatetimeIndex(['2014-08-01 09:00:00+02:00',
| '2014-08-01 10:00:00+02:00',
| '2014-08-01 11:00:00+02:00'],
| dtype='datetime64[ns, Europe/Berlin]', freq='H')
|
| >>> dti.tz_convert(None)
| DatetimeIndex(['2014-08-01 07:00:00',
| '2014-08-01 08:00:00',
| '2014-08-01 09:00:00'],
| dtype='datetime64[ns]', freq='H')
|
| tz_localize(self, *args, **kwargs)
| Localize tz-naive Datetime Array/Index to tz-aware
| Datetime Array/Index.
|
| This method takes a time zone (tz) naive Datetime Array/Index object
| and makes this time zone aware. It does not move the time to another
| time zone.
| Time zone localization helps to switch from time zone aware to time
| zone unaware objects.
|
| Parameters
| ----------
| tz : str, pytz.timezone, dateutil.tz.tzfile or None
| Time zone to convert timestamps to. Passing ``None`` will
| remove the time zone information preserving local time.
| ambiguous : 'infer', 'NaT', bool array, default 'raise'
| When clocks moved backward due to DST, ambiguous times may arise.
| For example in Central European Time (UTC+01), when going from
| 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at
| 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the
| `ambiguous` parameter dictates how ambiguous times should be
| handled.
|
| - 'infer' will attempt to infer fall dst-transition hours based on
| order
| - bool-ndarray where True signifies a DST time, False signifies a
| non-DST time (note that this flag is only applicable for
| ambiguous times)
| - 'NaT' will return NaT where there are ambiguous times
| - 'raise' will raise an AmbiguousTimeError if there are ambiguous
| times.
|
| nonexistent : 'shift_forward', 'shift_backward, 'NaT', timedelta, default 'raise'
| A nonexistent time does not exist in a particular timezone
| where clocks moved forward due to DST.
|
| - 'shift_forward' will shift the nonexistent time forward to the
| closest existing time
| - 'shift_backward' will shift the nonexistent time backward to the
| closest existing time
| - 'NaT' will return NaT where there are nonexistent times
| - timedelta objects will shift nonexistent times by the timedelta
| - 'raise' will raise an NonExistentTimeError if there are
| nonexistent times.
|
| .. versionadded:: 0.24.0
|
| Returns
| -------
| Same type as self
| Array/Index converted to the specified time zone.
|
| Raises
| ------
| TypeError
| If the Datetime Array/Index is tz-aware and tz is not None.
|
| See Also
| --------
| DatetimeIndex.tz_convert : Convert tz-aware DatetimeIndex from
| one time zone to another.
|
| Examples
| --------
| >>> tz_naive = pd.date_range('2018-03-01 09:00', periods=3)
| >>> tz_naive
| DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
| '2018-03-03 09:00:00'],
| dtype='datetime64[ns]', freq='D')
|
| Localize DatetimeIndex in US/Eastern time zone:
|
| >>> tz_aware = tz_naive.tz_localize(tz='US/Eastern')
| >>> tz_aware
| DatetimeIndex(['2018-03-01 09:00:00-05:00',
| '2018-03-02 09:00:00-05:00',
| '2018-03-03 09:00:00-05:00'],
| dtype='datetime64[ns, US/Eastern]', freq=None)
|
| With the ``tz=None``, we can remove the time zone information
| while keeping the local time (not converted to UTC):
|
| >>> tz_aware.tz_localize(None)
| DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
| '2018-03-03 09:00:00'],
| dtype='datetime64[ns]', freq=None)
|
| Be careful with DST changes. When there is sequential data, pandas can
| infer the DST time:
|
| >>> s = pd.to_datetime(pd.Series(['2018-10-28 01:30:00',
| ... '2018-10-28 02:00:00',
| ... '2018-10-28 02:30:00',
| ... '2018-10-28 02:00:00',
| ... '2018-10-28 02:30:00',
| ... '2018-10-28 03:00:00',
| ... '2018-10-28 03:30:00']))
| >>> s.dt.tz_localize('CET', ambiguous='infer')
| 0 2018-10-28 01:30:00+02:00
| 1 2018-10-28 02:00:00+02:00
| 2 2018-10-28 02:30:00+02:00
| 3 2018-10-28 02:00:00+01:00
| 4 2018-10-28 02:30:00+01:00
| 5 2018-10-28 03:00:00+01:00
| 6 2018-10-28 03:30:00+01:00
| dtype: datetime64[ns, CET]
|
| In some cases, inferring the DST is impossible. In such cases, you can
| pass an ndarray to the ambiguous parameter to set the DST explicitly
|
| >>> s = pd.to_datetime(pd.Series(['2018-10-28 01:20:00',
| ... '2018-10-28 02:36:00',
| ... '2018-10-28 03:46:00']))
| >>> s.dt.tz_localize('CET', ambiguous=np.array([True, True, False]))
| 0 2018-10-28 01:20:00+02:00
| 1 2018-10-28 02:36:00+02:00
| 2 2018-10-28 03:46:00+01:00
| dtype: datetime64[ns, CET]
|
| If the DST transition causes nonexistent times, you can shift these
| dates forward or backwards with a timedelta object or `'shift_forward'`
| or `'shift_backwards'`.
|
| >>> s = pd.to_datetime(pd.Series(['2015-03-29 02:30:00',
| ... '2015-03-29 03:30:00']))
| >>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_forward')
| 0 2015-03-29 03:00:00+02:00
| 1 2015-03-29 03:30:00+02:00
| dtype: datetime64[ns, Europe/Warsaw]
|
| >>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_backward')
| 0 2015-03-29 01:59:59.999999999+01:00
| 1 2015-03-29 03:30:00+02:00
| dtype: datetime64[ns, Europe/Warsaw]
|
| >>> s.dt.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta('1H'))
| 0 2015-03-29 03:30:00+02:00
| 1 2015-03-29 03:30:00+02:00
| dtype: datetime64[ns, Europe/Warsaw]
|
| ----------------------------------------------------------------------
| Readonly properties defined here:
|
| freq
|
| week
| The week ordinal of the year.
|
| .. deprecated:: 1.1.0
|
| Series.dt.weekofyear and Series.dt.week have been deprecated.
| Please use Series.dt.isocalendar().week instead.
|
| weekofyear
| The week ordinal of the year.
|
| .. deprecated:: 1.1.0
|
| Series.dt.weekofyear and Series.dt.week have been deprecated.
| Please use Series.dt.isocalendar().week instead.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| date
| Returns numpy array of python datetime.date objects (namely, the date
| part of Timestamps without timezone information).
|
| day
| The day of the datetime.
|
| Examples
| --------
| >>> datetime_series = pd.Series(
| ... pd.date_range("2000-01-01", periods=3, freq="D")
| ... )
| >>> datetime_series
| 0 2000-01-01
| 1 2000-01-02
| 2 2000-01-03
| dtype: datetime64[ns]
| >>> datetime_series.dt.day
| 0 1
| 1 2
| 2 3
| dtype: int64
|
| day_of_week
| The day of the week with Monday=0, Sunday=6.
|
| Return the day of the week. It is assumed the week starts on
| Monday, which is denoted by 0 and ends on Sunday which is denoted
| by 6. This method is available on both Series with datetime
| values (using the `dt` accessor) or DatetimeIndex.
|
| Returns
| -------
| Series or Index
| Containing integers indicating the day number.
|
| See Also
| --------
| Series.dt.dayofweek : Alias.
| Series.dt.weekday : Alias.
| Series.dt.day_name : Returns the name of the day of the week.
|
| Examples
| --------
| >>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
| >>> s.dt.dayofweek
| 2016-12-31 5
| 2017-01-01 6
| 2017-01-02 0
| 2017-01-03 1
| 2017-01-04 2
| 2017-01-05 3
| 2017-01-06 4
| 2017-01-07 5
| 2017-01-08 6
| Freq: D, dtype: int64
|
| day_of_year
| The ordinal day of the year.
|
| dayofweek
| The day of the week with Monday=0, Sunday=6.
|
| Return the day of the week. It is assumed the week starts on
| Monday, which is denoted by 0 and ends on Sunday which is denoted
| by 6. This method is available on both Series with datetime
| values (using the `dt` accessor) or DatetimeIndex.
|
| Returns
| -------
| Series or Index
| Containing integers indicating the day number.
|
| See Also
| --------
| Series.dt.dayofweek : Alias.
| Series.dt.weekday : Alias.
| Series.dt.day_name : Returns the name of the day of the week.
|
| Examples
| --------
| >>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
| >>> s.dt.dayofweek
| 2016-12-31 5
| 2017-01-01 6
| 2017-01-02 0
| 2017-01-03 1
| 2017-01-04 2
| 2017-01-05 3
| 2017-01-06 4
| 2017-01-07 5
| 2017-01-08 6
| Freq: D, dtype: int64
|
| dayofyear
| The ordinal day of the year.
|
| days_in_month
| The number of days in the month.
|
| daysinmonth
| The number of days in the month.
|
| hour
| The hours of the datetime.
|
| Examples
| --------
| >>> datetime_series = pd.Series(
| ... pd.date_range("2000-01-01", periods=3, freq="h")
| ... )
| >>> datetime_series
| 0 2000-01-01 00:00:00
| 1 2000-01-01 01:00:00
| 2 2000-01-01 02:00:00
| dtype: datetime64[ns]
| >>> datetime_series.dt.hour
| 0 0
| 1 1
| 2 2
| dtype: int64
|
| is_leap_year
| Boolean indicator if the date belongs to a leap year.
|
| A leap year is a year, which has 366 days (instead of 365) including
| 29th of February as an intercalary day.
| Leap years are years which are multiples of four with the exception
| of years divisible by 100 but not by 400.
|
| Returns
| -------
| Series or ndarray
| Booleans indicating if dates belong to a leap year.
|
| Examples
| --------
| This method is available on Series with datetime values under
| the ``.dt`` accessor, and directly on DatetimeIndex.
|
| >>> idx = pd.date_range("2012-01-01", "2015-01-01", freq="Y")
| >>> idx
| DatetimeIndex(['2012-12-31', '2013-12-31', '2014-12-31'],
| dtype='datetime64[ns]', freq='A-DEC')
| >>> idx.is_leap_year
| array([ True, False, False])
|
| >>> dates_series = pd.Series(idx)
| >>> dates_series
| 0 2012-12-31
| 1 2013-12-31
| 2 2014-12-31
| dtype: datetime64[ns]
| >>> dates_series.dt.is_leap_year
| 0 True
| 1 False
| 2 False
| dtype: bool
|
| is_month_end
| Indicates whether the date is the last day of the month.
|
| Returns
| -------
| Series or array
| For Series, returns a Series with boolean values.
| For DatetimeIndex, returns a boolean array.
|
| See Also
| --------
| is_month_start : Return a boolean indicating whether the date
| is the first day of the month.
| is_month_end : Return a boolean indicating whether the date
| is the last day of the month.
|
| Examples
| --------
| This method is available on Series with datetime values under
| the ``.dt`` accessor, and directly on DatetimeIndex.
|
| >>> s = pd.Series(pd.date_range("2018-02-27", periods=3))
| >>> s
| 0 2018-02-27
| 1 2018-02-28
| 2 2018-03-01
| dtype: datetime64[ns]
| >>> s.dt.is_month_start
| 0 False
| 1 False
| 2 True
| dtype: bool
| >>> s.dt.is_month_end
| 0 False
| 1 True
| 2 False
| dtype: bool
|
| >>> idx = pd.date_range("2018-02-27", periods=3)
| >>> idx.is_month_start
| array([False, False, True])
| >>> idx.is_month_end
| array([False, True, False])
|
| is_month_start
| Indicates whether the date is the first day of the month.
|
| Returns
| -------
| Series or array
| For Series, returns a Series with boolean values.
| For DatetimeIndex, returns a boolean array.
|
| See Also
| --------
| is_month_start : Return a boolean indicating whether the date
| is the first day of the month.
| is_month_end : Return a boolean indicating whether the date
| is the last day of the month.
|
| Examples
| --------
| This method is available on Series with datetime values under
| the ``.dt`` accessor, and directly on DatetimeIndex.
|
| >>> s = pd.Series(pd.date_range("2018-02-27", periods=3))
| >>> s
| 0 2018-02-27
| 1 2018-02-28
| 2 2018-03-01
| dtype: datetime64[ns]
| >>> s.dt.is_month_start
| 0 False
| 1 False
| 2 True
| dtype: bool
| >>> s.dt.is_month_end
| 0 False
| 1 True
| 2 False
| dtype: bool
|
| >>> idx = pd.date_range("2018-02-27", periods=3)
| >>> idx.is_month_start
| array([False, False, True])
| >>> idx.is_month_end
| array([False, True, False])
|
| is_quarter_end
| Indicator for whether the date is the last day of a quarter.
|
| Returns
| -------
| is_quarter_end : Series or DatetimeIndex
| The same type as the original data with boolean values. Series will
| have the same name and index. DatetimeIndex will have the same
| name.
|
| See Also
| --------
| quarter : Return the quarter of the date.
| is_quarter_start : Similar property indicating the quarter start.
|
| Examples
| --------
| This method is available on Series with datetime values under
| the ``.dt`` accessor, and directly on DatetimeIndex.
|
| >>> df = pd.DataFrame({'dates': pd.date_range("2017-03-30",
| ... periods=4)})
| >>> df.assign(quarter=df.dates.dt.quarter,
| ... is_quarter_end=df.dates.dt.is_quarter_end)
| dates quarter is_quarter_end
| 0 2017-03-30 1 False
| 1 2017-03-31 1 True
| 2 2017-04-01 2 False
| 3 2017-04-02 2 False
|
| >>> idx = pd.date_range('2017-03-30', periods=4)
| >>> idx
| DatetimeIndex(['2017-03-30', '2017-03-31', '2017-04-01', '2017-04-02'],
| dtype='datetime64[ns]', freq='D')
|
| >>> idx.is_quarter_end
| array([False, True, False, False])
|
| is_quarter_start
| Indicator for whether the date is the first day of a quarter.
|
| Returns
| -------
| is_quarter_start : Series or DatetimeIndex
| The same type as the original data with boolean values. Series will
| have the same name and index. DatetimeIndex will have the same
| name.
|
| See Also
| --------
| quarter : Return the quarter of the date.
| is_quarter_end : Similar property for indicating the quarter start.
|
| Examples
| --------
| This method is available on Series with datetime values under
| the ``.dt`` accessor, and directly on DatetimeIndex.
|
| >>> df = pd.DataFrame({'dates': pd.date_range("2017-03-30",
| ... periods=4)})
| >>> df.assign(quarter=df.dates.dt.quarter,
| ... is_quarter_start=df.dates.dt.is_quarter_start)
| dates quarter is_quarter_start
| 0 2017-03-30 1 False
| 1 2017-03-31 1 False
| 2 2017-04-01 2 True
| 3 2017-04-02 2 False
|
| >>> idx = pd.date_range('2017-03-30', periods=4)
| >>> idx
| DatetimeIndex(['2017-03-30', '2017-03-31', '2017-04-01', '2017-04-02'],
| dtype='datetime64[ns]', freq='D')
|
| >>> idx.is_quarter_start
| array([False, False, True, False])
|
| is_year_end
| Indicate whether the date is the last day of the year.
|
| Returns
| -------
| Series or DatetimeIndex
| The same type as the original data with boolean values. Series will
| have the same name and index. DatetimeIndex will have the same
| name.
|
| See Also
| --------
| is_year_start : Similar property indicating the start of the year.
|
| Examples
| --------
| This method is available on Series with datetime values under
| the ``.dt`` accessor, and directly on DatetimeIndex.
|
| >>> dates = pd.Series(pd.date_range("2017-12-30", periods=3))
| >>> dates
| 0 2017-12-30
| 1 2017-12-31
| 2 2018-01-01
| dtype: datetime64[ns]
|
| >>> dates.dt.is_year_end
| 0 False
| 1 True
| 2 False
| dtype: bool
|
| >>> idx = pd.date_range("2017-12-30", periods=3)
| >>> idx
| DatetimeIndex(['2017-12-30', '2017-12-31', '2018-01-01'],
| dtype='datetime64[ns]', freq='D')
|
| >>> idx.is_year_end
| array([False, True, False])
|
| is_year_start
| Indicate whether the date is the first day of a year.
|
| Returns
| -------
| Series or DatetimeIndex
| The same type as the original data with boolean values. Series will
| have the same name and index. DatetimeIndex will have the same
| name.
|
| See Also
| --------
| is_year_end : Similar property indicating the last day of the year.
|
| Examples
| --------
| This method is available on Series with datetime values under
| the ``.dt`` accessor, and directly on DatetimeIndex.
|
| >>> dates = pd.Series(pd.date_range("2017-12-30", periods=3))
| >>> dates
| 0 2017-12-30
| 1 2017-12-31
| 2 2018-01-01
| dtype: datetime64[ns]
|
| >>> dates.dt.is_year_start
| 0 False
| 1 False
| 2 True
| dtype: bool
|
| >>> idx = pd.date_range("2017-12-30", periods=3)
| >>> idx
| DatetimeIndex(['2017-12-30', '2017-12-31', '2018-01-01'],
| dtype='datetime64[ns]', freq='D')
|
| >>> idx.is_year_start
| array([False, False, True])
|
| microsecond
| The microseconds of the datetime.
|
| Examples
| --------
| >>> datetime_series = pd.Series(
| ... pd.date_range("2000-01-01", periods=3, freq="us")
| ... )
| >>> datetime_series
| 0 2000-01-01 00:00:00.000000
| 1 2000-01-01 00:00:00.000001
| 2 2000-01-01 00:00:00.000002
| dtype: datetime64[ns]
| >>> datetime_series.dt.microsecond
| 0 0
| 1 1
| 2 2
| dtype: int64
|
| minute
| The minutes of the datetime.
|
| Examples
| --------
| >>> datetime_series = pd.Series(
| ... pd.date_range("2000-01-01", periods=3, freq="T")
| ... )
| >>> datetime_series
| 0 2000-01-01 00:00:00
| 1 2000-01-01 00:01:00
| 2 2000-01-01 00:02:00
| dtype: datetime64[ns]
| >>> datetime_series.dt.minute
| 0 0
| 1 1
| 2 2
| dtype: int64
|
| month
| The month as January=1, December=12.
|
| Examples
| --------
| >>> datetime_series = pd.Series(
| ... pd.date_range("2000-01-01", periods=3, freq="M")
| ... )
| >>> datetime_series
| 0 2000-01-31
| 1 2000-02-29
| 2 2000-03-31
| dtype: datetime64[ns]
| >>> datetime_series.dt.month
| 0 1
| 1 2
| 2 3
| dtype: int64
|
| nanosecond
| The nanoseconds of the datetime.
|
| Examples
| --------
| >>> datetime_series = pd.Series(
| ... pd.date_range("2000-01-01", periods=3, freq="ns")
| ... )
| >>> datetime_series
| 0 2000-01-01 00:00:00.000000000
| 1 2000-01-01 00:00:00.000000001
| 2 2000-01-01 00:00:00.000000002
| dtype: datetime64[ns]
| >>> datetime_series.dt.nanosecond
| 0 0
| 1 1
| 2 2
| dtype: int64
|
| quarter
| The quarter of the date.
|
| second
| The seconds of the datetime.
|
| Examples
| --------
| >>> datetime_series = pd.Series(
| ... pd.date_range("2000-01-01", periods=3, freq="s")
| ... )
| >>> datetime_series
| 0 2000-01-01 00:00:00
| 1 2000-01-01 00:00:01
| 2 2000-01-01 00:00:02
| dtype: datetime64[ns]
| >>> datetime_series.dt.second
| 0 0
| 1 1
| 2 2
| dtype: int64
|
| time
| Returns numpy array of datetime.time. The time part of the Timestamps.
|
| timetz
| Returns numpy array of datetime.time also containing timezone
| information. The time part of the Timestamps.
|
| tz
| Return timezone, if any.
|
| Returns
| -------
| datetime.tzinfo, pytz.tzinfo.BaseTZInfo, dateutil.tz.tz.tzfile, or None
| Returns None when the array is tz-naive.
|
| weekday
| The day of the week with Monday=0, Sunday=6.
|
| Return the day of the week. It is assumed the week starts on
| Monday, which is denoted by 0 and ends on Sunday which is denoted
| by 6. This method is available on both Series with datetime
| values (using the `dt` accessor) or DatetimeIndex.
|
| Returns
| -------
| Series or Index
| Containing integers indicating the day number.
|
| See Also
| --------
| Series.dt.dayofweek : Alias.
| Series.dt.weekday : Alias.
| Series.dt.day_name : Returns the name of the day of the week.
|
| Examples
| --------
| >>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
| >>> s.dt.dayofweek
| 2016-12-31 5
| 2017-01-01 6
| 2017-01-02 0
| 2017-01-03 1
| 2017-01-04 2
| 2017-01-05 3
| 2017-01-06 4
| 2017-01-07 5
| 2017-01-08 6
| Freq: D, dtype: int64
|
| year
| The year of the datetime.
|
| Examples
| --------
| >>> datetime_series = pd.Series(
| ... pd.date_range("2000-01-01", periods=3, freq="Y")
| ... )
| >>> datetime_series
| 0 2000-12-31
| 1 2001-12-31
| 2 2002-12-31
| dtype: datetime64[ns]
| >>> datetime_series.dt.year
| 0 2000
| 1 2001
| 2 2002
| dtype: int64
|
| ----------------------------------------------------------------------
| Methods inherited from Properties:
|
| __init__(self, data: 'Series', orig)
| Initialize self. See help(type(self)) for accurate signature.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from pandas.core.accessor.PandasDelegate:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from pandas.core.base.PandasObject:
|
| __repr__(self) -> str
| Return a string representation for a particular object.
|
| __sizeof__(self)
| Generates the total memory usage for an object that returns
| either a value or Series of values
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from pandas.core.base.PandasObject:
|
| __annotations__ = {'_cache': typing.Dict[str, typing.Any]}
|
| ----------------------------------------------------------------------
| Methods inherited from pandas.core.accessor.DirNamesMixin:
|
| __dir__(self) -> List[str]
| Provide method name lookup and completion.
|
| Notes
| -----
| Only provide 'public' methods.
|
| ----------------------------------------------------------------------
| Methods inherited from pandas.core.base.NoNewAttributesMixin:
|
| __setattr__(self, key: str, value)
| Implement setattr(self, name, value).
Make a new column “duration” containing the amount of time the ride took. (Now that pandas recognizes these as dates, you can just subtract.)
This sort of subtraction of times would never work if we hadn’t converted them to datetime types using pd.to_datetime
.
df["duration"] = df["droptime"] - df["pickuptime"]
df["duration"]
0 0 days 00:06:15
1 0 days 00:07:05
2 0 days 00:07:24
3 0 days 00:25:52
4 0 days 00:09:32
...
6428 0 days 00:03:34
6429 0 days 00:56:23
6430 0 days 00:19:07
6431 0 days 00:05:04
6432 0 days 00:16:40
Name: duration, Length: 6433, dtype: timedelta64[ns]
Check out the first three rows and first two columns. Does the top row look like a 6 minute taxi ride? Yes, because it went from 8:21pm to 8:27pm.
df.iloc[:3,:2]
pickup | dropoff | |
---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 |
What are the data types of these four columns we just made?
Notice how the first two we made are called “datetime64[ns]”
, whereas the last one is called a “timedelta64[ns]”
. When you see “delta”, you should usually think “difference”, and that’s exactly what this column represents. The middle column, “day”, is reported as object, because they are just strings like "Monday"
.
df.dtypes
pickup object
dropoff object
passengers int64
distance float64
fare float64
tip float64
tolls float64
total float64
color object
payment object
pickup_zone object
dropoff_zone object
pickup_borough object
dropoff_borough object
pickuptime datetime64[ns]
droptime datetime64[ns]
day object
duration timedelta64[ns]
dtype: object
Working with strings in pandas#
How many rows in
df
have the word “Airport” in either the “pickup_zone” column or the “dropoff_zone” column? Do any rows have the word “Airport” in both columns?
Use the str
accessor and the contains
method.
df[["pickup_zone","dropoff_zone"]]
pickup_zone | dropoff_zone | |
---|---|---|
0 | Lenox Hill West | UN/Turtle Bay South |
1 | Upper West Side South | Upper West Side South |
2 | Alphabet City | West Village |
3 | Hudson Sq | Yorkville West |
4 | Midtown East | Yorkville West |
... | ... | ... |
6428 | East Harlem North | Central Harlem North |
6429 | Jamaica | East Concourse/Concourse Village |
6430 | Crown Heights North | Bushwick North |
6431 | East New York | East Flatbush/Remsen Village |
6432 | Boerum Hill | Windsor Terrace |
6433 rows Ă— 2 columns
df['pickup_zone'][:3] #check out the first three entries in the “pickup_zone” column
0 Lenox Hill West
1 Upper West Side South
2 Alphabet City
Name: pickup_zone, dtype: object
An example of how str.contains
works:
df['pickup_zone'][:3].str.contains('City')
0 False
1 False
2 True
Name: pickup_zone, dtype: bool
Here we use Boolean indexing to find all the rows where either the “pickup_zone” or the “dropoff_zone” column contains the substring “Airport”. Notice how the logical or is made using a vertical bar. ( Spelled out or is used in some contexts in Python, but not with NumPy Boolean arrays or
pandas Boolean Series.)
ps = df['pickup_zone'].str.contains("Airport")
ds = df['dropoff_zone'].str.contains("Airport")
df[ps | ds] #or #df[ps & ds] #and
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | pickuptime | droptime | day | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 2019-03-16 10:02:25 | 2019-03-16 10:22:29 | 1 | 3.90 | 17.00 | 0.00 | 0.00 | 17.80 | yellow | cash | LaGuardia Airport | Astoria | Queens | Queens | 2019-03-16 10:02:25 | 2019-03-16 10:22:29 | Saturday | 0 days 00:20:04 |
22 | 2019-03-28 17:20:54 | 2019-03-28 17:55:57 | 1 | 9.82 | 31.50 | 8.31 | 5.76 | 49.87 | yellow | credit card | LaGuardia Airport | Sutton Place/Turtle Bay North | Queens | Manhattan | 2019-03-28 17:20:54 | 2019-03-28 17:55:57 | Thursday | 0 days 00:35:03 |
43 | 2019-03-27 19:39:16 | 2019-03-27 20:00:26 | 1 | 9.50 | 28.00 | 0.00 | 5.76 | 35.06 | yellow | cash | LaGuardia Airport | Washington Heights South | Queens | Manhattan | 2019-03-27 19:39:16 | 2019-03-27 20:00:26 | Wednesday | 0 days 00:21:10 |
53 | 2019-03-20 18:49:24 | 2019-03-20 19:12:40 | 1 | 13.45 | 36.00 | 0.00 | 0.00 | 37.80 | yellow | cash | JFK Airport | Sunnyside | Queens | Queens | 2019-03-20 18:49:24 | 2019-03-20 19:12:40 | Wednesday | 0 days 00:23:16 |
56 | 2019-03-05 17:57:00 | 2019-03-05 18:45:25 | 2 | 19.59 | 52.00 | 0.00 | 5.76 | 65.56 | yellow | credit card | JFK Airport | Upper East Side North | Queens | Manhattan | 2019-03-05 17:57:00 | 2019-03-05 18:45:25 | Tuesday | 0 days 00:48:25 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6243 | 2019-03-12 11:40:49 | 2019-03-12 11:49:23 | 1 | 3.50 | 12.00 | 0.00 | 0.00 | 12.80 | green | cash | Steinway | LaGuardia Airport | Queens | Queens | 2019-03-12 11:40:49 | 2019-03-12 11:49:23 | Tuesday | 0 days 00:08:34 |
6280 | 2019-03-27 12:25:04 | 2019-03-27 12:31:15 | 1 | 2.74 | 10.00 | 0.00 | 0.00 | 10.80 | green | credit card | Steinway | LaGuardia Airport | Queens | Queens | 2019-03-27 12:25:04 | 2019-03-27 12:31:15 | Wednesday | 0 days 00:06:11 |
6322 | 2019-03-01 17:03:11 | 2019-03-01 17:17:11 | 1 | 5.39 | 17.00 | 3.76 | 0.00 | 22.56 | green | credit card | Jamaica | JFK Airport | Queens | Queens | 2019-03-01 17:03:11 | 2019-03-01 17:17:11 | Friday | 0 days 00:14:00 |
6396 | 2019-03-15 17:44:09 | 2019-03-15 18:20:25 | 1 | 11.44 | 41.63 | 0.00 | 0.00 | 42.13 | green | credit card | Bushwick South | LaGuardia Airport | Brooklyn | Queens | 2019-03-15 17:44:09 | 2019-03-15 18:20:25 | Friday | 0 days 00:36:16 |
6398 | 2019-03-17 13:07:29 | 2019-03-17 13:51:56 | 1 | 18.39 | 50.50 | 7.70 | 0.00 | 59.00 | green | credit card | DUMBO/Vinegar Hill | JFK Airport | Brooklyn | Queens | 2019-03-17 13:07:29 | 2019-03-17 13:51:56 | Sunday | 0 days 00:44:27 |
407 rows Ă— 18 columns
Missing values in pandas#
Look at
df.info()
again. Which columns have missing values?
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pickup 6433 non-null object
1 dropoff 6433 non-null object
2 passengers 6433 non-null int64
3 distance 6433 non-null float64
4 fare 6433 non-null float64
5 tip 6433 non-null float64
6 tolls 6433 non-null float64
7 total 6433 non-null float64
8 color 6433 non-null object
9 payment 6389 non-null object
10 pickup_zone 6407 non-null object
11 dropoff_zone 6388 non-null object
12 pickup_borough 6407 non-null object
13 dropoff_borough 6388 non-null object
14 pickuptime 6433 non-null datetime64[ns]
15 droptime 6433 non-null datetime64[ns]
16 day 6433 non-null object
17 duration 6433 non-null timedelta64[ns]
dtypes: datetime64[ns](2), float64(5), int64(1), object(9), timedelta64[ns](1)
memory usage: 904.8+ KB
Define
df_sub
to be the sub-DataFrame containing all the rows for which the “pickup_zone” entry is missing. Use theisna
method.
df_sub = df[df["pickup_zone"].isna()]
df_sub
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | pickuptime | droptime | day | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
42 | 2019-03-30 23:59:14 | 2019-03-30 23:59:17 | 1 | 0.00 | 80.0 | 20.08 | 0.00 | 100.38 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-30 23:59:14 | 2019-03-30 23:59:17 | Saturday | 0 days 00:00:03 |
606 | 2019-03-19 15:04:40 | 2019-03-19 15:06:15 | 1 | 0.40 | 3.5 | 1.35 | 0.00 | 8.15 | yellow | credit card | NaN | Upper East Side South | NaN | Manhattan | 2019-03-19 15:04:40 | 2019-03-19 15:06:15 | Tuesday | 0 days 00:01:35 |
622 | 2019-03-12 19:52:36 | 2019-03-12 19:52:55 | 3 | 0.00 | 120.0 | 33.20 | 12.50 | 166.00 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-12 19:52:36 | 2019-03-12 19:52:55 | Tuesday | 0 days 00:00:19 |
671 | 2019-03-09 22:45:46 | 2019-03-09 22:46:50 | 1 | 0.40 | 3.5 | 0.00 | 0.00 | 7.30 | yellow | cash | NaN | Lincoln Square East | NaN | Manhattan | 2019-03-09 22:45:46 | 2019-03-09 22:46:50 | Saturday | 0 days 00:01:04 |
712 | 2019-03-03 01:24:45 | 2019-03-03 01:29:32 | 1 | 0.80 | 5.0 | 1.75 | 0.00 | 10.55 | yellow | credit card | NaN | West Chelsea/Hudson Yards | NaN | Manhattan | 2019-03-03 01:24:45 | 2019-03-03 01:29:32 | Sunday | 0 days 00:04:47 |
970 | 2019-03-29 00:02:00 | 2019-03-29 00:02:05 | 2 | 0.00 | 45.0 | 5.00 | 0.00 | 50.30 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-29 00:02:00 | 2019-03-29 00:02:05 | Friday | 0 days 00:00:05 |
1108 | 2019-03-10 09:33:13 | 2019-03-10 09:52:04 | 1 | 13.40 | 52.0 | 11.70 | 5.76 | 70.26 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-10 09:33:13 | 2019-03-10 09:52:04 | Sunday | 0 days 00:18:51 |
1961 | 2019-03-19 16:43:59 | 2019-03-19 17:37:22 | 1 | 17.82 | 52.0 | 13.11 | 5.76 | 78.67 | yellow | credit card | NaN | Garment District | NaN | Manhattan | 2019-03-19 16:43:59 | 2019-03-19 17:37:22 | Tuesday | 0 days 00:53:23 |
2137 | 2019-03-08 13:27:32 | 2019-03-08 13:46:23 | 1 | 1.40 | 12.0 | 3.05 | 0.00 | 18.35 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-08 13:27:32 | 2019-03-08 13:46:23 | Friday | 0 days 00:18:51 |
2742 | 2019-03-23 06:20:27 | 2019-03-23 06:20:59 | 1 | 0.00 | 55.0 | 0.00 | 0.00 | 55.80 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-23 06:20:27 | 2019-03-23 06:20:59 | Saturday | 0 days 00:00:32 |
3085 | 2019-03-24 15:43:02 | 2019-03-24 16:02:57 | 1 | 3.20 | 14.5 | 0.00 | 0.00 | 15.30 | yellow | cash | NaN | NaN | NaN | NaN | 2019-03-24 15:43:02 | 2019-03-24 16:02:57 | Sunday | 0 days 00:19:55 |
3259 | 2019-03-30 00:39:09 | 2019-03-30 00:43:34 | 1 | 0.84 | 5.0 | 1.70 | 0.00 | 10.50 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-30 00:39:09 | 2019-03-30 00:43:34 | Saturday | 0 days 00:04:25 |
3644 | 2019-03-19 23:30:50 | 2019-03-19 23:32:33 | 2 | 0.00 | 103.0 | 20.65 | 0.00 | 123.95 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-19 23:30:50 | 2019-03-19 23:32:33 | Tuesday | 0 days 00:01:43 |
3793 | 2019-03-24 06:07:30 | 2019-03-24 06:12:22 | 1 | 1.50 | 6.5 | 0.00 | 0.00 | 9.80 | yellow | NaN | NaN | Midtown East | NaN | Manhattan | 2019-03-24 06:07:30 | 2019-03-24 06:12:22 | Sunday | 0 days 00:04:52 |
3889 | 2019-03-30 06:42:55 | 2019-03-30 06:43:00 | 1 | 0.00 | 2.5 | 0.00 | 0.00 | 3.30 | yellow | cash | NaN | NaN | NaN | NaN | 2019-03-30 06:42:55 | 2019-03-30 06:43:00 | Saturday | 0 days 00:00:05 |
4118 | 2019-03-15 14:01:54 | 2019-03-15 14:01:59 | 1 | 0.00 | 25.0 | 5.06 | 0.00 | 30.36 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-15 14:01:54 | 2019-03-15 14:01:59 | Friday | 0 days 00:00:05 |
4127 | 2019-03-01 05:18:21 | 2019-03-01 05:18:30 | 2 | 0.04 | 2.5 | 0.00 | 0.00 | 3.80 | yellow | cash | NaN | NaN | NaN | NaN | 2019-03-01 05:18:21 | 2019-03-01 05:18:30 | Friday | 0 days 00:00:09 |
4281 | 2019-03-18 23:29:49 | 2019-03-18 23:38:00 | 1 | 2.10 | 9.0 | 2.06 | 0.00 | 12.36 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-18 23:29:49 | 2019-03-18 23:38:00 | Monday | 0 days 00:08:11 |
4414 | 2019-03-29 20:42:47 | 2019-03-29 21:03:50 | 1 | 4.20 | 17.0 | 4.80 | 5.76 | 28.86 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-29 20:42:47 | 2019-03-29 21:03:50 | Friday | 0 days 00:21:03 |
4772 | 2019-03-21 19:51:13 | 2019-03-21 19:57:03 | 1 | 0.72 | 5.5 | 1.46 | 0.00 | 8.76 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-21 19:51:13 | 2019-03-21 19:57:03 | Thursday | 0 days 00:05:50 |
4941 | 2019-03-01 08:12:28 | 2019-03-01 08:17:19 | 1 | 0.83 | 5.0 | 2.50 | 0.00 | 10.80 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-01 08:12:28 | 2019-03-01 08:17:19 | Friday | 0 days 00:04:51 |
5263 | 2019-03-09 20:37:53 | 2019-03-09 21:07:20 | 1 | 4.99 | 22.0 | 5.16 | 0.00 | 30.96 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-09 20:37:53 | 2019-03-09 21:07:20 | Saturday | 0 days 00:29:27 |
5493 | 2019-03-06 12:14:00 | 2019-03-06 12:14:00 | 1 | 0.00 | 10.0 | 0.00 | 0.00 | 10.00 | green | credit card | NaN | NaN | NaN | NaN | 2019-03-06 12:14:00 | 2019-03-06 12:14:00 | Wednesday | 0 days 00:00:00 |
5624 | 2019-03-11 14:04:50 | 2019-03-11 14:04:50 | 1 | 0.00 | 2.5 | 0.00 | 0.00 | 3.30 | green | cash | NaN | NaN | NaN | NaN | 2019-03-11 14:04:50 | 2019-03-11 14:04:50 | Monday | 0 days 00:00:00 |
5638 | 2019-03-13 12:22:00 | 2019-03-13 12:22:00 | 1 | 0.00 | 10.0 | 0.00 | 0.00 | 10.00 | green | credit card | NaN | NaN | NaN | NaN | 2019-03-13 12:22:00 | 2019-03-13 12:22:00 | Wednesday | 0 days 00:00:00 |
6083 | 2019-03-19 15:34:00 | 2019-03-19 15:34:00 | 1 | 0.00 | 5.0 | 0.00 | 0.00 | 5.00 | green | credit card | NaN | NaN | NaN | NaN | 2019-03-19 15:34:00 | 2019-03-19 15:34:00 | Tuesday | 0 days 00:00:00 |
df_sub = df[df["pickup_zone"].isnull()]
df_sub
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | pickuptime | droptime | day | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
42 | 2019-03-30 23:59:14 | 2019-03-30 23:59:17 | 1 | 0.00 | 80.0 | 20.08 | 0.00 | 100.38 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-30 23:59:14 | 2019-03-30 23:59:17 | Saturday | 0 days 00:00:03 |
606 | 2019-03-19 15:04:40 | 2019-03-19 15:06:15 | 1 | 0.40 | 3.5 | 1.35 | 0.00 | 8.15 | yellow | credit card | NaN | Upper East Side South | NaN | Manhattan | 2019-03-19 15:04:40 | 2019-03-19 15:06:15 | Tuesday | 0 days 00:01:35 |
622 | 2019-03-12 19:52:36 | 2019-03-12 19:52:55 | 3 | 0.00 | 120.0 | 33.20 | 12.50 | 166.00 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-12 19:52:36 | 2019-03-12 19:52:55 | Tuesday | 0 days 00:00:19 |
671 | 2019-03-09 22:45:46 | 2019-03-09 22:46:50 | 1 | 0.40 | 3.5 | 0.00 | 0.00 | 7.30 | yellow | cash | NaN | Lincoln Square East | NaN | Manhattan | 2019-03-09 22:45:46 | 2019-03-09 22:46:50 | Saturday | 0 days 00:01:04 |
712 | 2019-03-03 01:24:45 | 2019-03-03 01:29:32 | 1 | 0.80 | 5.0 | 1.75 | 0.00 | 10.55 | yellow | credit card | NaN | West Chelsea/Hudson Yards | NaN | Manhattan | 2019-03-03 01:24:45 | 2019-03-03 01:29:32 | Sunday | 0 days 00:04:47 |
970 | 2019-03-29 00:02:00 | 2019-03-29 00:02:05 | 2 | 0.00 | 45.0 | 5.00 | 0.00 | 50.30 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-29 00:02:00 | 2019-03-29 00:02:05 | Friday | 0 days 00:00:05 |
1108 | 2019-03-10 09:33:13 | 2019-03-10 09:52:04 | 1 | 13.40 | 52.0 | 11.70 | 5.76 | 70.26 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-10 09:33:13 | 2019-03-10 09:52:04 | Sunday | 0 days 00:18:51 |
1961 | 2019-03-19 16:43:59 | 2019-03-19 17:37:22 | 1 | 17.82 | 52.0 | 13.11 | 5.76 | 78.67 | yellow | credit card | NaN | Garment District | NaN | Manhattan | 2019-03-19 16:43:59 | 2019-03-19 17:37:22 | Tuesday | 0 days 00:53:23 |
2137 | 2019-03-08 13:27:32 | 2019-03-08 13:46:23 | 1 | 1.40 | 12.0 | 3.05 | 0.00 | 18.35 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-08 13:27:32 | 2019-03-08 13:46:23 | Friday | 0 days 00:18:51 |
2742 | 2019-03-23 06:20:27 | 2019-03-23 06:20:59 | 1 | 0.00 | 55.0 | 0.00 | 0.00 | 55.80 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-23 06:20:27 | 2019-03-23 06:20:59 | Saturday | 0 days 00:00:32 |
3085 | 2019-03-24 15:43:02 | 2019-03-24 16:02:57 | 1 | 3.20 | 14.5 | 0.00 | 0.00 | 15.30 | yellow | cash | NaN | NaN | NaN | NaN | 2019-03-24 15:43:02 | 2019-03-24 16:02:57 | Sunday | 0 days 00:19:55 |
3259 | 2019-03-30 00:39:09 | 2019-03-30 00:43:34 | 1 | 0.84 | 5.0 | 1.70 | 0.00 | 10.50 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-30 00:39:09 | 2019-03-30 00:43:34 | Saturday | 0 days 00:04:25 |
3644 | 2019-03-19 23:30:50 | 2019-03-19 23:32:33 | 2 | 0.00 | 103.0 | 20.65 | 0.00 | 123.95 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-19 23:30:50 | 2019-03-19 23:32:33 | Tuesday | 0 days 00:01:43 |
3793 | 2019-03-24 06:07:30 | 2019-03-24 06:12:22 | 1 | 1.50 | 6.5 | 0.00 | 0.00 | 9.80 | yellow | NaN | NaN | Midtown East | NaN | Manhattan | 2019-03-24 06:07:30 | 2019-03-24 06:12:22 | Sunday | 0 days 00:04:52 |
3889 | 2019-03-30 06:42:55 | 2019-03-30 06:43:00 | 1 | 0.00 | 2.5 | 0.00 | 0.00 | 3.30 | yellow | cash | NaN | NaN | NaN | NaN | 2019-03-30 06:42:55 | 2019-03-30 06:43:00 | Saturday | 0 days 00:00:05 |
4118 | 2019-03-15 14:01:54 | 2019-03-15 14:01:59 | 1 | 0.00 | 25.0 | 5.06 | 0.00 | 30.36 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-15 14:01:54 | 2019-03-15 14:01:59 | Friday | 0 days 00:00:05 |
4127 | 2019-03-01 05:18:21 | 2019-03-01 05:18:30 | 2 | 0.04 | 2.5 | 0.00 | 0.00 | 3.80 | yellow | cash | NaN | NaN | NaN | NaN | 2019-03-01 05:18:21 | 2019-03-01 05:18:30 | Friday | 0 days 00:00:09 |
4281 | 2019-03-18 23:29:49 | 2019-03-18 23:38:00 | 1 | 2.10 | 9.0 | 2.06 | 0.00 | 12.36 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-18 23:29:49 | 2019-03-18 23:38:00 | Monday | 0 days 00:08:11 |
4414 | 2019-03-29 20:42:47 | 2019-03-29 21:03:50 | 1 | 4.20 | 17.0 | 4.80 | 5.76 | 28.86 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-29 20:42:47 | 2019-03-29 21:03:50 | Friday | 0 days 00:21:03 |
4772 | 2019-03-21 19:51:13 | 2019-03-21 19:57:03 | 1 | 0.72 | 5.5 | 1.46 | 0.00 | 8.76 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-21 19:51:13 | 2019-03-21 19:57:03 | Thursday | 0 days 00:05:50 |
4941 | 2019-03-01 08:12:28 | 2019-03-01 08:17:19 | 1 | 0.83 | 5.0 | 2.50 | 0.00 | 10.80 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-01 08:12:28 | 2019-03-01 08:17:19 | Friday | 0 days 00:04:51 |
5263 | 2019-03-09 20:37:53 | 2019-03-09 21:07:20 | 1 | 4.99 | 22.0 | 5.16 | 0.00 | 30.96 | yellow | credit card | NaN | NaN | NaN | NaN | 2019-03-09 20:37:53 | 2019-03-09 21:07:20 | Saturday | 0 days 00:29:27 |
5493 | 2019-03-06 12:14:00 | 2019-03-06 12:14:00 | 1 | 0.00 | 10.0 | 0.00 | 0.00 | 10.00 | green | credit card | NaN | NaN | NaN | NaN | 2019-03-06 12:14:00 | 2019-03-06 12:14:00 | Wednesday | 0 days 00:00:00 |
5624 | 2019-03-11 14:04:50 | 2019-03-11 14:04:50 | 1 | 0.00 | 2.5 | 0.00 | 0.00 | 3.30 | green | cash | NaN | NaN | NaN | NaN | 2019-03-11 14:04:50 | 2019-03-11 14:04:50 | Monday | 0 days 00:00:00 |
5638 | 2019-03-13 12:22:00 | 2019-03-13 12:22:00 | 1 | 0.00 | 10.0 | 0.00 | 0.00 | 10.00 | green | credit card | NaN | NaN | NaN | NaN | 2019-03-13 12:22:00 | 2019-03-13 12:22:00 | Wednesday | 0 days 00:00:00 |
6083 | 2019-03-19 15:34:00 | 2019-03-19 15:34:00 | 1 | 0.00 | 5.0 | 0.00 | 0.00 | 5.00 | green | credit card | NaN | NaN | NaN | NaN | 2019-03-19 15:34:00 | 2019-03-19 15:34:00 | Tuesday | 0 days 00:00:00 |
df_sub = df[df["pickup_zone"].notna()] #"pickup_zone" entry is not missing
df_sub
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | pickuptime | droptime | day | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | Saturday | 0 days 00:06:15 |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | Monday | 0 days 00:07:05 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | Wednesday | 0 days 00:07:24 |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | Sunday | 0 days 00:25:52 |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | Saturday | 0 days 00:09:32 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | Sunday | 0 days 00:03:34 |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | Sunday | 0 days 00:56:23 |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | Saturday | 0 days 00:19:07 |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | Monday | 0 days 00:05:04 |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | Wednesday | 0 days 00:16:40 |
6407 rows Ă— 18 columns
What if you tried to check for equality with
np.nan
?
import numpy as np
df[df["pickup_zone"] == np.nan] #empty
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | pickuptime | droptime | day | duration |
---|
np.nan == np.nan #not equal to itself
False
np.nan != np.nan
True
Worse approach: Using a for loop#
We used str
and contains
to find the sub-DataFrame containing the rows where both the “pickup_zone” and the “dropoff_zone” involved an airport.
Here we’re going to go through the 6433 rows, one at a time. We will do that by going through the integers from 0 (inclusive) to 6433 (exclusive) one at a time. We can do that using for i in range(6433):
. (In Matlab, we would do something like for i = 1:6433
.)
# problem from missing values
good_inds = []
for i in range(len(df)):
if ("Airport" in df.loc[i,"pickup_zone"]) and ("Airport" in df.loc[i,"dropoff_zone"]):
good_inds.append(i)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [14], line 6
3 good_inds = []
5 for i in range(len(df)):
----> 6 if ("Airport" in df.loc[i,"pickup_zone"]) and ("Airport" in df.loc[i,"dropoff_zone"]):
7 good_inds.append(i)
TypeError: argument of type 'float' is not iterable
If we try to check whether "Airport" in np.nan
, we indeed get the exact same error.
"Airport" in np.nan
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [15], line 1
----> 1 "Airport" in np.nan
TypeError: argument of type 'float' is not iterable
How to identify identify np.nan
values? We can use np.isnan()
to identify np.nan values whether with individual values or arrays.
np.isnan(np.nan)
True
However, np.isnan()
cannot handle strings.
np.isnan('String')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [25], line 1
----> 1 np.isnan('String')
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
For a pandas DataFrame or Series, you can use the isna() or isnull() methods to check for missing values.
df.loc[0,"pickup_zone"].isna()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [26], line 1
----> 1 df.loc[0,"pickup_zone"].isna()
AttributeError: 'str' object has no attribute 'isna'
Pandas methods like isna()
are typically available on DataFrame and Series objects. When you’re dealing with individual elements from a DataFrame or Series, it’s often safer to use top-level pandas functions like pd.isna()
to ensure compatibility.
pd.isna(df.loc[0, "pickup_zone"])
False
We are using and
instead of &
in the following, because it is being used with True
and False
values directly, rather than with arrays or Series of True
and False
values.
index_id = []
for i in range(len(df)):
if pd.isna(df.loc[i, "pickup_zone"]) or pd.isna(df.loc[i, "dropoff_zone"]):
continue #skip the rest of the loop iteration
elif ("Airport" in df.loc[i, "pickup_zone"]) and ("Airport" in df.loc[i, "dropoff_zone"]):
index_id.append(i)
print(index_id)
[120, 770, 1080, 1089, 1416, 1929, 2387, 3571, 4358, 5095, 5364]
There are many other ways we could get around this (such as by dropping rows with missing values in these columns). Let’s use aother simple approach, of wrapping the whole problematic portion in a try
block. The except TypeError
portion is saying that even if a TypeError is raised, do not stop the execution of the code, instead just continue to the next step in the for loop.
index_id = []
for i in range(len(df)):
try:
if ("Airport" in df.loc[i,"pickup_zone"]) and ("Airport" in df.loc[i,"dropoff_zone"]):
index_id.append(i)
except TypeError:
continue
index_id
[120, 770, 1080, 1089, 1416, 1929, 2387, 3571, 4358, 5095, 5364]
df.iloc[index_id,:] #or use df.iloc[index_id]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | pickuptime | droptime | day | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
120 | 2019-03-21 17:21:44 | 2019-03-21 17:21:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-21 17:21:44 | 2019-03-21 17:21:49 | Thursday | 0 days 00:00:05 |
770 | 2019-03-02 03:16:59 | 2019-03-02 03:17:06 | 0 | 9.40 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens | 2019-03-02 03:16:59 | 2019-03-02 03:17:06 | Saturday | 0 days 00:00:07 |
1080 | 2019-03-04 14:17:05 | 2019-03-04 14:17:13 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-04 14:17:05 | 2019-03-04 14:17:13 | Monday | 0 days 00:00:08 |
1089 | 2019-03-10 01:43:32 | 2019-03-10 01:45:22 | 1 | 0.37 | 3.50 | 0.00 | 0.00 | 4.80 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-10 01:43:32 | 2019-03-10 01:45:22 | Sunday | 0 days 00:01:50 |
1416 | 2019-03-09 13:16:32 | 2019-03-09 13:46:11 | 2 | 12.39 | 35.00 | 0.00 | 0.00 | 35.80 | yellow | cash | LaGuardia Airport | JFK Airport | Queens | Queens | 2019-03-09 13:16:32 | 2019-03-09 13:46:11 | Saturday | 0 days 00:29:39 |
1929 | 2019-03-13 22:35:35 | 2019-03-13 22:35:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens | 2019-03-13 22:35:35 | 2019-03-13 22:35:49 | Wednesday | 0 days 00:00:14 |
2387 | 2019-03-28 15:58:52 | 2019-03-28 15:59:25 | 1 | 1.80 | 69.06 | 20.80 | 0.00 | 90.16 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens | 2019-03-28 15:58:52 | 2019-03-28 15:59:25 | Thursday | 0 days 00:00:33 |
3571 | 2019-03-22 16:47:41 | 2019-03-22 16:47:50 | 1 | 0.81 | 66.00 | 0.00 | 0.00 | 66.80 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens | 2019-03-22 16:47:41 | 2019-03-22 16:47:50 | Friday | 0 days 00:00:09 |
4358 | 2019-03-06 18:24:00 | 2019-03-06 18:24:13 | 2 | 0.01 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-06 18:24:00 | 2019-03-06 18:24:13 | Wednesday | 0 days 00:00:13 |
5095 | 2019-03-30 20:14:44 | 2019-03-30 21:01:28 | 3 | 18.91 | 52.00 | 8.78 | 5.76 | 67.34 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens | 2019-03-30 20:14:44 | 2019-03-30 21:01:28 | Saturday | 0 days 00:46:44 |
5364 | 2019-03-17 16:59:17 | 2019-03-17 18:04:08 | 2 | 36.70 | 150.00 | 0.00 | 24.02 | 174.82 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-17 16:59:17 | 2019-03-17 18:04:08 | Sunday | 0 days 01:04:51 |