Week 3 Friday#
You can find the notesbooks at course notes.
Rename columns or rows#
In the penguins dataset, change the column named “island” so it is named “location” and change the column named “sex” so it is named “gender”. Use the pandas DataFrame method
rename
, and input a Python dictionary.
import seaborn as sns
df = sns.load_dataset("penguins")
df.columns
Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
'flipper_length_mm', 'body_mass_g', 'sex'],
dtype='object')
The object inside the parentheses is an example of a Python dictionary. This is a very important data type that is built into Python.
But the following has no effect.
df.rename({'island':'location','sex':'gender'})
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
help(df.rename)
Help on method rename in module pandas.core.frame:
rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore') method of pandas.core.frame.DataFrame instance
Alter axes labels.
Function / dict values must be unique (1-to-1). Labels not contained in
a dict / Series will be left as-is. Extra labels listed don't throw an
error.
See the :ref:`user guide <basics.rename>` for more.
Parameters
----------
mapper : dict-like or function
Dict-like or function transformations to apply to
that axis' values. Use either ``mapper`` and ``axis`` to
specify the axis to target with ``mapper``, or ``index`` and
``columns``.
index : dict-like or function
Alternative to specifying axis (``mapper, axis=0``
is equivalent to ``index=mapper``).
columns : dict-like or function
Alternative to specifying axis (``mapper, axis=1``
is equivalent to ``columns=mapper``).
axis : {0 or 'index', 1 or 'columns'}, default 0
Axis to target with ``mapper``. Can be either the axis name
('index', 'columns') or number (0, 1). The default is 'index'.
copy : bool, default True
Also copy underlying data.
inplace : bool, default False
Whether to return a new DataFrame. If True then value of copy is
ignored.
level : int or level name, default None
In case of a MultiIndex, only rename labels in the specified
level.
errors : {'ignore', 'raise'}, default 'ignore'
If 'raise', raise a `KeyError` when a dict-like `mapper`, `index`,
or `columns` contains labels that are not present in the Index
being transformed.
If 'ignore', existing keys will be renamed and extra keys will be
ignored.
Returns
-------
DataFrame or None
DataFrame with the renamed axis labels or None if ``inplace=True``.
Raises
------
KeyError
If any of the labels is not found in the selected axis and
"errors='raise'".
See Also
--------
DataFrame.rename_axis : Set the name of the axis.
Examples
--------
``DataFrame.rename`` supports two calling conventions
* ``(index=index_mapper, columns=columns_mapper, ...)``
* ``(mapper, axis={'index', 'columns'}, ...)``
We *highly* recommend using keyword arguments to clarify your
intent.
Rename columns using a mapping:
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})
a c
0 1 4
1 2 5
2 3 6
Rename index using a mapping:
>>> df.rename(index={0: "x", 1: "y", 2: "z"})
A B
x 1 4
y 2 5
z 3 6
Cast index labels to a different type:
>>> df.index
RangeIndex(start=0, stop=3, step=1)
>>> df.rename(index=str).index
Index(['0', '1', '2'], dtype='object')
>>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise")
Traceback (most recent call last):
KeyError: ['C'] not found in axis
Using axis-style parameters:
>>> df.rename(str.lower, axis='columns')
a b
0 1 4
1 2 5
2 3 6
>>> df.rename({1: 2, 2: 4}, axis='index')
A B
0 1 4
2 2 5
4 3 6
axis=0
is the default: looking through all of the row labels (0
, 1
, through 343
) and if it finds a row named “island” or “sex” it will change that name.
We are trying to change the column names, so we should instead use the argument axis=1
. Notice how the “island” and “sex” columns have been renamed.
df.rename({'island':'location','sex':'gender'},axis=1)
species | location | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | gender | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
An example of renaming one of the row labels (switch to axis=0
).
df.rename({3: 'Today'},axis=0)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
Today | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
It’s important to point out that we haven’t changed df
itself. A hint that we haven’t changed df
is that DataFrames were displayed as the result of our code. The code was creating new DataFrames, not changing the original DataFrame.
df
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
If we want to change df
itself, we should use the inplace
keyword argument or assign the result to df
(Warning: this can be dangerous; it would be safer to call this DataFrame something different, like df2
).
df.rename({'island':'location','sex':'gender'},axis=1,inplace=True)
df
species | location | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | gender | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
df2 = df.rename({'island':'location','sex':'gender'},axis=1)
df2
species | location | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | gender | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
Delete all the rows which contain missing values#
Notice: Altair, by default, will remove rows with missing values that are being used for the visualation.
Apply
isna
method withany
and a suitableaxis
keyword argument to determine which rows have any missing data.
If we use axis=0
, we are keeping the column labels the same, and finding out which columns have any missing values.
df.isna().any(axis = 0)
species False
location False
bill_length_mm True
bill_depth_mm True
flipper_length_mm True
body_mass_g True
gender True
dtype: bool
In this case, we want to keep the row labels the same, and finding out which row have any missing values.
df.isna().any(axis = 1) #which row has any missing values
0 False
1 False
2 False
3 True
4 False
...
339 True
340 False
341 False
342 False
343 False
Length: 344, dtype: bool
There is also an all
(in contrast to the any
we are using). Notice how the row 3
is now False
, because it is not the case that all of the values are missing in this row.
df.isna().all(axis=1)
0 False
1 False
2 False
3 False
4 False
...
339 False
340 False
341 False
342 False
343 False
Length: 344, dtype: bool
Now use Boolean indexing like usual. You might need to take a negation, using tilde
~
.
If we plug in exactly what we have above, we will be doing the exact opposite of what we want. This is keeping the rows that have any missing values.
Be sure to save the resulting DataFrame with the same name
df
. It should now have 333 rows.
df.shape
(344, 7)
df3 = df[~df.isna().any(axis=1)] # get rid of rows with missing data
df3.shape
(333, 7)
We can also use notna()
but with all()
to remove rows with missig data
df4 = df[df.notna().all(axis = 1)]
df4.shape
(333, 7)
Or use dropna()
to directly remove missing values.
df5 = df.dropna()
df5.shape
(333, 7)
Facet charts#
Display an Altair scatter chart showing bill length for the x-axis, flipper length for the y-axis, and color using species.
df5.columns
Index(['species', 'location', 'bill_length_mm', 'bill_depth_mm',
'flipper_length_mm', 'body_mass_g', 'gender'],
dtype='object')
import altair as alt
alt.Chart(df5).mark_circle().encode(
x = 'bill_length_mm',
y = 'flipper_length_mm',
color = 'species'
)
Use the domain 30-to-60 for the x-axis and 170-to-240 for the y-axis. (I had some trouble this morning installing the new version of Altair, so let’s use the old syntax.)
alt.Chart(df5).mark_circle().encode(
x = alt.X('bill_length_mm',scale = alt.Scale(domain=(30,60))),
y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
color = 'species'
)
What data encoding type makes the most sense for “species”, Quantitative, Ordinal, or Nominal? Does adding that abbreviation it change the appearance of the chart?
Changing from "species"
to "species:N"
does not have any effect, because when there are strings in the column, Altair automatically defaults to a Nominal data type.
alt.Chart(df5).mark_circle().encode(
x = alt.X('bill_length_mm',scale = alt.Scale(domain=(30,60))),
y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
color = 'species:N'
)
What happens if you try to use the “Ordinal” encoding type for the x-axis? (Get rid of the
scale
part for this.)
Notice how different this looks. Also notice how the gap between 34 and 34.4 is the same as the gap between 36.6 and 36.7. By using the Ordinal data type, ":O"
, we are telling Altair to treat these as distinct categories, and that the numerical difference between the values is not important.
This chart definitely looks worse than with the default Quantitative encoding.
alt.Chart(df5).mark_circle().encode(
x = 'bill_length_mm:O',
y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
color = 'species:N'
)
Make a facet chart where the penguins are divided according to gender. (Go back to “Quantitative” encoding for the x channel.)
Here the data is divided by “gender”, and the different genders are put into different rows. That is what the row="gender"
part means.
alt.Chart(df5).mark_circle().encode(
x = alt.X('bill_length_mm',scale = alt.Scale(domain=(30,60))),
y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
color = 'species:N',
row = 'gender'
)
Here is the same thing, but putting different genders into different columns. This would be a good choice if you wanted to compare the flipper lengths between genders. If instead you wanted to compare the bill lengths between genders, then I think it would make more sense to use the above vertical facet chart.
alt.Chart(df5).mark_circle().encode(
x = alt.X('bill_length_mm',scale = alt.Scale(domain=(30,60))),
y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
color = 'species:N',
column = 'gender'
)