Week 2 Wednesday#

You can find the notesbooks at course notes.

Exploring the taxis dataset#

  • Load the attached file taxis.csv using the pandas function read_csv. Store the resulting DataFrame in the variable df.

import pandas as pd
df = pd.read_csv("taxis.csv")

How many different values are there in the “pickup_borough” column? First get the pandas Series containing that column, then

  • use the unique method;

  • use the value_counts method.

df['pickup_borough']
0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object
df.loc[:,'pickup_borough']
0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

If we want to use iloc, we need to find the integer position of the “pickup_borough” column. We can use the columns attribute to see all the columns.

df.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

Convert to a list and then use the index method.

# list
mylist = list(df.columns)
print(mylist)
mylist.index('pickup_borough')
['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls', 'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']
12
df.iloc[:,12]
0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

Get a Boolean array and then use np.nonzero (don’t forget to import NumPy before calling np.nonzero.)

# boolean
label_ba = df.columns == 'pickup_borough'
print(label_ba)
print(type(label_ba))
[False False False False False False False False False False False False
  True False]
<class 'numpy.ndarray'>

The function np.nonzero returns the indices of non-zero elements in the input array. The output is a tuple of arrays, one for each dimension of the input, containing the indices of the non-zero elements in that dimension.

For a 1D array, the output will be a tuple with a single array inside it (such as (array([12]),)). For a 2D array, the output would be a tuple with two arrays, one for each dimension (such as (array([2]), array([2]))). The output (array([12]), 1) would not be a valid output for np.nonzero() or np.where.

import numpy as np
print(np.nonzero(label_ba))
print(np.where(label_ba))
(array([12]),)
(array([12]),)
df.loc[:,label_ba]
pickup_borough
0 Manhattan
1 Manhattan
2 Manhattan
3 Manhattan
4 Manhattan
... ...
6428 Manhattan
6429 Queens
6430 Brooklyn
6431 Brooklyn
6432 Brooklyn

6433 rows Ă— 1 columns

Use the get_loc method of a pandas Index.

df.columns.get_loc('pickup_borough')
12
type(df.columns)
pandas.core.indexes.base.Index
help(df.columns.get_loc)
Help on method get_loc in module pandas.core.indexes.base:

get_loc(key, method=None, tolerance=None) method of pandas.core.indexes.base.Index instance
    Get integer location, slice or boolean mask for requested label.
    
    Parameters
    ----------
    key : label
    method : {None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional
        * default: exact matches only.
        * pad / ffill: find the PREVIOUS index value if no exact match.
        * backfill / bfill: use NEXT index value if no exact match
        * nearest: use the NEAREST index value if no exact match. Tied
          distances are broken by preferring the larger index value.
    tolerance : int or float, optional
        Maximum distance from index value for inexact matches. The value of
        the index at the matching location must satisfy the equation
        ``abs(index[loc] - key) <= tolerance``.
    
    Returns
    -------
    loc : int if unique index, slice if monotonic index, else mask
    
    Examples
    --------
    >>> unique_index = pd.Index(list('abc'))
    >>> unique_index.get_loc('b')
    1
    
    >>> monotonic_index = pd.Index(list('abbc'))
    >>> monotonic_index.get_loc('b')
    slice(1, 3, None)
    
    >>> non_monotonic_index = pd.Index(list('abcb'))
    >>> non_monotonic_index.get_loc('b')
    array([False,  True, False,  True])

Use the get_indexer method of a pandas Index.

#df.columns.get_loc(['pickup_borough','pickup']) #fail for multiple locations
df.columns.get_indexer(['pickup_borough','pickup']) # get multiple locations at the same time
array([12,  0])

unique gets us the distinct values which occur in the Series (including the missing value, represented as nan, which stands for “not a number”). Notice that unique is a Series method, not a DataFrame method.

col_pb = df['pickup_borough']
col_pb.unique()
array(['Manhattan', 'Queens', nan, 'Bronx', 'Brooklyn'], dtype=object)
len(col_pb.unique())
5

(5,) represents a one-dimensional array (such as np.array([1, 2, 3, 4, 5])), while the shape (5, 1) represents a two-dimensional array with one column (such as np.array([[1], [2], [3], [4], [5]])).

col_pb.unique().shape
(5,)

value_counts returns a Series containing counts of unique values. Excludes nan by default.

print(col_pb.value_counts())
len(col_pb.value_counts())
Manhattan    5268
Queens        657
Brooklyn      383
Bronx          99
Name: pickup_borough, dtype: int64
4
help(col_pb.value_counts)
Help on method value_counts in module pandas.core.base:

value_counts(normalize: bool = False, sort: bool = True, ascending: bool = False, bins=None, dropna: bool = True) method of pandas.core.series.Series instance
    Return a Series containing counts of unique values.
    
    The resulting object will be in descending order so that the
    first element is the most frequently-occurring element.
    Excludes NA values by default.
    
    Parameters
    ----------
    normalize : bool, default False
        If True then the object returned will contain the relative
        frequencies of the unique values.
    sort : bool, default True
        Sort by frequencies.
    ascending : bool, default False
        Sort in ascending order.
    bins : int, optional
        Rather than count values, group them into half-open bins,
        a convenience for ``pd.cut``, only works with numeric data.
    dropna : bool, default True
        Don't include counts of NaN.
    
    Returns
    -------
    Series
    
    See Also
    --------
    Series.count: Number of non-NA elements in a Series.
    DataFrame.count: Number of non-NA elements in a DataFrame.
    DataFrame.value_counts: Equivalent method on DataFrames.
    
    Examples
    --------
    >>> index = pd.Index([3, 1, 2, 3, 4, np.nan])
    >>> index.value_counts()
    3.0    2
    2.0    1
    4.0    1
    1.0    1
    dtype: int64
    
    With `normalize` set to `True`, returns the relative frequency by
    dividing all values by the sum of values.
    
    >>> s = pd.Series([3, 1, 2, 3, 4, np.nan])
    >>> s.value_counts(normalize=True)
    3.0    0.4
    2.0    0.2
    4.0    0.2
    1.0    0.2
    dtype: float64
    
    **bins**
    
    Bins can be useful for going from a continuous variable to a
    categorical variable; instead of counting unique
    apparitions of values, divide the index in the specified
    number of half-open bins.
    
    >>> s.value_counts(bins=3)
    (0.996, 2.0]    2
    (2.0, 3.0]      2
    (3.0, 4.0]      1
    dtype: int64
    
    **dropna**
    
    With `dropna` set to `False` we can also see NaN index values.
    
    >>> s.value_counts(dropna=False)
    3.0    2
    2.0    1
    NaN    1
    4.0    1
    1.0    1
    dtype: int64
col_pb.value_counts(dropna=False)
Manhattan    5268
Queens        657
Brooklyn      383
Bronx          99
NaN            26
Name: pickup_borough, dtype: int64

Working with dates in pandas#

  • Convert the “pickup” and “dropoff” columns to datetime values, using the pandas function to_datetime. Save the results as new columns in df, named “picktime” and “droptime”.

We can recognize “pickup” and “dropoff” columns as dates, but for now, they are just strings to Python and to pandas. (Notice that it is reported as an “object” data type.)

df[['pickup','dropoff']]
pickup dropoff
0 2019-03-23 20:21:09 2019-03-23 20:27:24
1 2019-03-04 16:11:55 2019-03-04 16:19:00
2 2019-03-27 17:53:01 2019-03-27 18:00:25
3 2019-03-10 01:23:59 2019-03-10 01:49:51
4 2019-03-30 13:27:42 2019-03-30 13:37:14
... ... ...
6428 2019-03-31 09:51:53 2019-03-31 09:55:27
6429 2019-03-31 17:38:00 2019-03-31 18:34:23
6430 2019-03-23 22:55:18 2019-03-23 23:14:25
6431 2019-03-04 10:09:25 2019-03-04 10:14:29
6432 2019-03-13 19:31:22 2019-03-13 19:48:02

6433 rows Ă— 2 columns

Let’s see in a smaller example how the pd.to_datetime works. We start out by defining a string representing today’s date.

today = 'Oct 11, 2023'
type(today)
str

Convert it to a Timestamp using pd.to_datetime.

ts = pd.to_datetime(today)
print(ts)
type(ts)
2023-10-11 00:00:00
pandas._libs.tslibs.timestamps.Timestamp

We now have access to all sorts of additional information related to this date. It’s a little unpredictable which are methods. Whether day_name or day_of_year is a method is not something you would be tested on.

ts.day_of_year
284
ts.day_name()
'Wednesday'
help(ts.day_name)
Help on built-in function day_name:

day_name(...) method of pandas._libs.tslibs.timestamps.Timestamp instance
    Return the day name of the Timestamp with specified locale.
    
    Parameters
    ----------
    locale : str, default None (English locale)
        Locale determining the language in which to return the day name.
    
    Returns
    -------
    str

Here we are adding one new column to the DataFrame. We name it “picktime”.

df['pickuptime'] = pd.to_datetime(df['pickup'])
df.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough pickuptime
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan 2019-03-23 20:21:09
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan 2019-03-04 16:11:55
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan 2019-03-27 17:53:01
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan 2019-03-10 01:23:59
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan 2019-03-30 13:27:42