Week 2 Wednesday#
You can find the notesbooks at course notes.
Exploring the taxis dataset#
Load the attached file
taxis.csv
using the pandas functionread_csv
. Store the resulting DataFrame in the variabledf
.
import pandas as pd
df = pd.read_csv("taxis.csv")
How many different values are there in the “pickup_borough” column? First get the pandas Series containing that column, then
use the
unique
method;use the
value_counts
method.
df['pickup_borough']
0 Manhattan
1 Manhattan
2 Manhattan
3 Manhattan
4 Manhattan
...
6428 Manhattan
6429 Queens
6430 Brooklyn
6431 Brooklyn
6432 Brooklyn
Name: pickup_borough, Length: 6433, dtype: object
df.loc[:,'pickup_borough']
0 Manhattan
1 Manhattan
2 Manhattan
3 Manhattan
4 Manhattan
...
6428 Manhattan
6429 Queens
6430 Brooklyn
6431 Brooklyn
6432 Brooklyn
Name: pickup_borough, Length: 6433, dtype: object
If we want to use iloc
, we need to find the integer position of the “pickup_borough” column. We can use the columns
attribute to see all the columns.
df.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
'pickup_borough', 'dropoff_borough'],
dtype='object')
Convert to a list
and then use the index
method.
# list
mylist = list(df.columns)
print(mylist)
mylist.index('pickup_borough')
['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls', 'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']
12
df.iloc[:,12]
0 Manhattan
1 Manhattan
2 Manhattan
3 Manhattan
4 Manhattan
...
6428 Manhattan
6429 Queens
6430 Brooklyn
6431 Brooklyn
6432 Brooklyn
Name: pickup_borough, Length: 6433, dtype: object
Get a Boolean array and then use np.nonzero
(don’t forget to import NumPy before calling np.nonzero
.)
# boolean
label_ba = df.columns == 'pickup_borough'
print(label_ba)
print(type(label_ba))
[False False False False False False False False False False False False
True False]
<class 'numpy.ndarray'>
The function np.nonzero
returns the indices of non-zero elements in the input array. The output is a tuple of arrays, one for each dimension of the input, containing the indices of the non-zero elements in that dimension.
For a 1D array, the output will be a tuple with a single array inside it (such as (array([12]),)
). For a 2D array, the output would be a tuple with two arrays, one for each dimension (such as (array([2]), array([2]))
). The output (array([12]), 1)
would not be a valid output for np.nonzero()
or np.where
.
import numpy as np
print(np.nonzero(label_ba))
print(np.where(label_ba))
(array([12]),)
(array([12]),)
df.loc[:,label_ba]
pickup_borough | |
---|---|
0 | Manhattan |
1 | Manhattan |
2 | Manhattan |
3 | Manhattan |
4 | Manhattan |
... | ... |
6428 | Manhattan |
6429 | Queens |
6430 | Brooklyn |
6431 | Brooklyn |
6432 | Brooklyn |
6433 rows Ă— 1 columns
Use the get_loc
method of a pandas Index.
df.columns.get_loc('pickup_borough')
12
type(df.columns)
pandas.core.indexes.base.Index
help(df.columns.get_loc)
Help on method get_loc in module pandas.core.indexes.base:
get_loc(key, method=None, tolerance=None) method of pandas.core.indexes.base.Index instance
Get integer location, slice or boolean mask for requested label.
Parameters
----------
key : label
method : {None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional
* default: exact matches only.
* pad / ffill: find the PREVIOUS index value if no exact match.
* backfill / bfill: use NEXT index value if no exact match
* nearest: use the NEAREST index value if no exact match. Tied
distances are broken by preferring the larger index value.
tolerance : int or float, optional
Maximum distance from index value for inexact matches. The value of
the index at the matching location must satisfy the equation
``abs(index[loc] - key) <= tolerance``.
Returns
-------
loc : int if unique index, slice if monotonic index, else mask
Examples
--------
>>> unique_index = pd.Index(list('abc'))
>>> unique_index.get_loc('b')
1
>>> monotonic_index = pd.Index(list('abbc'))
>>> monotonic_index.get_loc('b')
slice(1, 3, None)
>>> non_monotonic_index = pd.Index(list('abcb'))
>>> non_monotonic_index.get_loc('b')
array([False, True, False, True])
Use the get_indexer
method of a pandas Index.
#df.columns.get_loc(['pickup_borough','pickup']) #fail for multiple locations
df.columns.get_indexer(['pickup_borough','pickup']) # get multiple locations at the same time
array([12, 0])
unique
gets us the distinct values which occur in the Series (including the missing value, represented as nan
, which stands for “not a number”). Notice that unique
is a Series method, not a DataFrame method.
col_pb = df['pickup_borough']
col_pb.unique()
array(['Manhattan', 'Queens', nan, 'Bronx', 'Brooklyn'], dtype=object)
len(col_pb.unique())
5
(5,)
represents a one-dimensional array (such as np.array([1, 2, 3, 4, 5])
), while the shape (5, 1)
represents a two-dimensional array with one column (such as np.array([[1], [2], [3], [4], [5]])
).
col_pb.unique().shape
(5,)
value_counts
returns a Series containing counts of unique values. Excludes nan
by default.
print(col_pb.value_counts())
len(col_pb.value_counts())
Manhattan 5268
Queens 657
Brooklyn 383
Bronx 99
Name: pickup_borough, dtype: int64
4
help(col_pb.value_counts)
Help on method value_counts in module pandas.core.base:
value_counts(normalize: bool = False, sort: bool = True, ascending: bool = False, bins=None, dropna: bool = True) method of pandas.core.series.Series instance
Return a Series containing counts of unique values.
The resulting object will be in descending order so that the
first element is the most frequently-occurring element.
Excludes NA values by default.
Parameters
----------
normalize : bool, default False
If True then the object returned will contain the relative
frequencies of the unique values.
sort : bool, default True
Sort by frequencies.
ascending : bool, default False
Sort in ascending order.
bins : int, optional
Rather than count values, group them into half-open bins,
a convenience for ``pd.cut``, only works with numeric data.
dropna : bool, default True
Don't include counts of NaN.
Returns
-------
Series
See Also
--------
Series.count: Number of non-NA elements in a Series.
DataFrame.count: Number of non-NA elements in a DataFrame.
DataFrame.value_counts: Equivalent method on DataFrames.
Examples
--------
>>> index = pd.Index([3, 1, 2, 3, 4, np.nan])
>>> index.value_counts()
3.0 2
2.0 1
4.0 1
1.0 1
dtype: int64
With `normalize` set to `True`, returns the relative frequency by
dividing all values by the sum of values.
>>> s = pd.Series([3, 1, 2, 3, 4, np.nan])
>>> s.value_counts(normalize=True)
3.0 0.4
2.0 0.2
4.0 0.2
1.0 0.2
dtype: float64
**bins**
Bins can be useful for going from a continuous variable to a
categorical variable; instead of counting unique
apparitions of values, divide the index in the specified
number of half-open bins.
>>> s.value_counts(bins=3)
(0.996, 2.0] 2
(2.0, 3.0] 2
(3.0, 4.0] 1
dtype: int64
**dropna**
With `dropna` set to `False` we can also see NaN index values.
>>> s.value_counts(dropna=False)
3.0 2
2.0 1
NaN 1
4.0 1
1.0 1
dtype: int64
col_pb.value_counts(dropna=False)
Manhattan 5268
Queens 657
Brooklyn 383
Bronx 99
NaN 26
Name: pickup_borough, dtype: int64
Working with dates in pandas#
Convert the “pickup” and “dropoff” columns to datetime values, using the pandas function
to_datetime
. Save the results as new columns indf
, named “picktime” and “droptime”.
We can recognize “pickup” and “dropoff” columns as dates, but for now, they are just strings to Python and to pandas. (Notice that it is reported as an “object” data type.)
df[['pickup','dropoff']]
pickup | dropoff | |
---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 |
... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 |
6433 rows Ă— 2 columns
Let’s see in a smaller example how the pd.to_datetime
works. We start out by defining a string representing today’s date.
today = 'Oct 11, 2023'
type(today)
str
Convert it to a Timestamp using pd.to_datetime
.
ts = pd.to_datetime(today)
print(ts)
type(ts)
2023-10-11 00:00:00
pandas._libs.tslibs.timestamps.Timestamp
We now have access to all sorts of additional information related to this date. It’s a little unpredictable which are methods. Whether day_name
or day_of_year
is a method is not something you would be tested on.
ts.day_of_year
284
ts.day_name()
'Wednesday'
help(ts.day_name)
Help on built-in function day_name:
day_name(...) method of pandas._libs.tslibs.timestamps.Timestamp instance
Return the day name of the Timestamp with specified locale.
Parameters
----------
locale : str, default None (English locale)
Locale determining the language in which to return the day name.
Returns
-------
str
Here we are adding one new column to the DataFrame. We name it “picktime”.
df['pickuptime'] = pd.to_datetime(df['pickup'])
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | pickuptime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | 2019-03-23 20:21:09 |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | 2019-03-04 16:11:55 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | 2019-03-27 17:53:01 |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | 2019-03-10 01:23:59 |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | 2019-03-30 13:27:42 |