Week 2 Wednesday

Week 2 Wednesday#

You can find the notesbooks at course notes.

Exploring the taxis dataset#

Load the attached file taxis.csv using the pandas function read_csv. Store the resulting DataFrame in the variable df.

import pandas as pd
df = pd.read_csv("taxis.csv")

How many different values are there in the “pickup_borough” column? First get the pandas Series containing that column, then

use the unique method;
use the value_counts method.

df['pickup_borough']

     Manhattan
     Manhattan
     Manhattan
     Manhattan
     Manhattan
          ...    
  Manhattan
     Queens
   Brooklyn
   Brooklyn
   Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

df.loc[:,'pickup_borough']

     Manhattan
     Manhattan
     Manhattan
     Manhattan
     Manhattan
          ...    
  Manhattan
     Queens
   Brooklyn
   Brooklyn
   Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

If we want to use iloc, we need to find the integer position of the “pickup_borough” column. We can use the columns attribute to see all the columns.

df.columns

Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

Convert to a list and then use the index method.

# list
mylist = list(df.columns)
print(mylist)
mylist.index('pickup_borough')

['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls', 'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']

df.iloc[:,12]

     Manhattan
     Manhattan
     Manhattan
     Manhattan
     Manhattan
          ...    
  Manhattan
     Queens
   Brooklyn
   Brooklyn
   Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

Get a Boolean array and then use np.nonzero (don’t forget to import NumPy before calling np.nonzero.)

# boolean
label_ba = df.columns == 'pickup_borough'
print(label_ba)
print(type(label_ba))

[False False False False False False False False False False False False
  True False]
<class 'numpy.ndarray'>

The function np.nonzero returns the indices of non-zero elements in the input array. The output is a tuple of arrays, one for each dimension of the input, containing the indices of the non-zero elements in that dimension.

For a 1D array, the output will be a tuple with a single array inside it (such as (array([12]),)). For a 2D array, the output would be a tuple with two arrays, one for each dimension (such as (array([2]), array([2]))). The output (array([12]), 1) would not be a valid output for np.nonzero() or np.where.

import numpy as np
print(np.nonzero(label_ba))
print(np.where(label_ba))

(array([12]),)
(array([12]),)

df.loc[:,label_ba]

	pickup_borough
0	Manhattan
1	Manhattan
2	Manhattan
3	Manhattan
4	Manhattan
...	...
6428	Manhattan
6429	Queens
6430	Brooklyn
6431	Brooklyn
6432	Brooklyn

6433 rows × 1 columns

Use the get_loc method of a pandas Index.

df.columns.get_loc('pickup_borough')

type(df.columns)

pandas.core.indexes.base.Index

help(df.columns.get_loc)

Help on method get_loc in module pandas.core.indexes.base:

get_loc(key, method=None, tolerance=None) method of pandas.core.indexes.base.Index instance
    Get integer location, slice or boolean mask for requested label.
    
    Parameters
    ----------
    key : label
    method : {None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional
        * default: exact matches only.
        * pad / ffill: find the PREVIOUS index value if no exact match.
        * backfill / bfill: use NEXT index value if no exact match
        * nearest: use the NEAREST index value if no exact match. Tied
          distances are broken by preferring the larger index value.
    tolerance : int or float, optional
        Maximum distance from index value for inexact matches. The value of
        the index at the matching location must satisfy the equation
        ``abs(index[loc] - key) <= tolerance``.
    
    Returns
    -------
    loc : int if unique index, slice if monotonic index, else mask
    
    Examples
    --------
    >>> unique_index = pd.Index(list('abc'))
    >>> unique_index.get_loc('b')
    1
    
    >>> monotonic_index = pd.Index(list('abbc'))
    >>> monotonic_index.get_loc('b')
    slice(1, 3, None)
    
    >>> non_monotonic_index = pd.Index(list('abcb'))
    >>> non_monotonic_index.get_loc('b')
    array([False,  True, False,  True])

Use the get_indexer method of a pandas Index.

#df.columns.get_loc(['pickup_borough','pickup']) #fail for multiple locations
df.columns.get_indexer(['pickup_borough','pickup']) # get multiple locations at the same time

array([12,  0])

unique gets us the distinct values which occur in the Series (including the missing value, represented as nan, which stands for “not a number”). Notice that unique is a Series method, not a DataFrame method.

col_pb = df['pickup_borough']
col_pb.unique()

array(['Manhattan', 'Queens', nan, 'Bronx', 'Brooklyn'], dtype=object)

len(col_pb.unique())

(5,) represents a one-dimensional array (such as np.array([1, 2, 3, 4, 5])), while the shape (5, 1) represents a two-dimensional array with one column (such as np.array([[1], [2], [3], [4], [5]])).

col_pb.unique().shape

(5,)

value_counts returns a Series containing counts of unique values. Excludes nan by default.

print(col_pb.value_counts())
len(col_pb.value_counts())

Manhattan    5268
Queens        657
Brooklyn      383
Bronx          99
Name: pickup_borough, dtype: int64

help(col_pb.value_counts)

Help on method value_counts in module pandas.core.base:

value_counts(normalize: bool = False, sort: bool = True, ascending: bool = False, bins=None, dropna: bool = True) method of pandas.core.series.Series instance
    Return a Series containing counts of unique values.
    
    The resulting object will be in descending order so that the
    first element is the most frequently-occurring element.
    Excludes NA values by default.
    
    Parameters
    ----------
    normalize : bool, default False
        If True then the object returned will contain the relative
        frequencies of the unique values.
    sort : bool, default True
        Sort by frequencies.
    ascending : bool, default False
        Sort in ascending order.
    bins : int, optional
        Rather than count values, group them into half-open bins,
        a convenience for ``pd.cut``, only works with numeric data.
    dropna : bool, default True
        Don't include counts of NaN.
    
    Returns
    -------
    Series
    
    See Also
    --------
    Series.count: Number of non-NA elements in a Series.
    DataFrame.count: Number of non-NA elements in a DataFrame.
    DataFrame.value_counts: Equivalent method on DataFrames.
    
    Examples
    --------
    >>> index = pd.Index([3, 1, 2, 3, 4, np.nan])
    >>> index.value_counts()
    3.0    2
    2.0    1
    4.0    1
    1.0    1
    dtype: int64
    
    With `normalize` set to `True`, returns the relative frequency by
    dividing all values by the sum of values.
    
    >>> s = pd.Series([3, 1, 2, 3, 4, np.nan])
    >>> s.value_counts(normalize=True)
    3.0    0.4
    2.0    0.2
    4.0    0.2
    1.0    0.2
    dtype: float64
    
    **bins**
    
    Bins can be useful for going from a continuous variable to a
    categorical variable; instead of counting unique
    apparitions of values, divide the index in the specified
    number of half-open bins.
    
    >>> s.value_counts(bins=3)
    (0.996, 2.0]    2
    (2.0, 3.0]      2
    (3.0, 4.0]      1
    dtype: int64
    
    **dropna**
    
    With `dropna` set to `False` we can also see NaN index values.
    
    >>> s.value_counts(dropna=False)
    3.0    2
    2.0    1
    NaN    1
    4.0    1
    1.0    1
    dtype: int64

col_pb.value_counts(dropna=False)

Manhattan    5268
Queens        657
Brooklyn      383
Bronx          99
NaN            26
Name: pickup_borough, dtype: int64

Working with dates in pandas#

Convert the “pickup” and “dropoff” columns to datetime values, using the pandas function to_datetime. Save the results as new columns in df, named “picktime” and “droptime”.

We can recognize “pickup” and “dropoff” columns as dates, but for now, they are just strings to Python and to pandas. (Notice that it is reported as an “object” data type.)

df[['pickup','dropoff']]

	pickup	dropoff
0	2019-03-23 20:21:09	2019-03-23 20:27:24
1	2019-03-04 16:11:55	2019-03-04 16:19:00
2	2019-03-27 17:53:01	2019-03-27 18:00:25
3	2019-03-10 01:23:59	2019-03-10 01:49:51
4	2019-03-30 13:27:42	2019-03-30 13:37:14
...	...	...
6428	2019-03-31 09:51:53	2019-03-31 09:55:27
6429	2019-03-31 17:38:00	2019-03-31 18:34:23
6430	2019-03-23 22:55:18	2019-03-23 23:14:25
6431	2019-03-04 10:09:25	2019-03-04 10:14:29
6432	2019-03-13 19:31:22	2019-03-13 19:48:02

6433 rows × 2 columns

Let’s see in a smaller example how the pd.to_datetime works. We start out by defining a string representing today’s date.

today = 'Oct 11, 2023'
type(today)

str

Convert it to a Timestamp using pd.to_datetime.

ts = pd.to_datetime(today)
print(ts)
type(ts)

2023-10-11 00:00:00

pandas._libs.tslibs.timestamps.Timestamp

We now have access to all sorts of additional information related to this date. It’s a little unpredictable which are methods. Whether day_name or day_of_year is a method is not something you would be tested on.

ts.day_of_year

ts.day_name()

'Wednesday'

help(ts.day_name)

Help on built-in function day_name:

day_name(...) method of pandas._libs.tslibs.timestamps.Timestamp instance
    Return the day name of the Timestamp with specified locale.
    
    Parameters
    ----------
    locale : str, default None (English locale)
        Locale determining the language in which to return the day name.
    
    Returns
    -------
    str

Here we are adding one new column to the DataFrame. We name it “picktime”.

df['pickuptime'] = pd.to_datetime(df['pickup'])
df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	pickuptime
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan	2019-03-23 20:21:09
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan	2019-03-04 16:11:55
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan	2019-03-27 17:53:01
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan	2019-03-10 01:23:59
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan	2019-03-30 13:27:42

Week 2 Wednesday

Contents

Week 2 Wednesday#

Exploring the taxis dataset#

Working with dates in pandas#