Week 3 Friday#

You can find the notesbooks at course notes.

Rename columns or rows#

  • In the penguins dataset, change the column named “island” so it is named “location” and change the column named “sex” so it is named “gender”. Use the pandas DataFrame method rename, and input a Python dictionary.

import seaborn as sns
df = sns.load_dataset("penguins")
df.columns
Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

The object inside the parentheses is an example of a Python dictionary. This is a very important data type that is built into Python.

But the following has no effect.

df.rename({'island':'location','sex':'gender'})
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

help(df.rename)
Help on method rename in module pandas.core.frame:

rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore') method of pandas.core.frame.DataFrame instance
    Alter axes labels.
    
    Function / dict values must be unique (1-to-1). Labels not contained in
    a dict / Series will be left as-is. Extra labels listed don't throw an
    error.
    
    See the :ref:`user guide <basics.rename>` for more.
    
    Parameters
    ----------
    mapper : dict-like or function
        Dict-like or function transformations to apply to
        that axis' values. Use either ``mapper`` and ``axis`` to
        specify the axis to target with ``mapper``, or ``index`` and
        ``columns``.
    index : dict-like or function
        Alternative to specifying axis (``mapper, axis=0``
        is equivalent to ``index=mapper``).
    columns : dict-like or function
        Alternative to specifying axis (``mapper, axis=1``
        is equivalent to ``columns=mapper``).
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Axis to target with ``mapper``. Can be either the axis name
        ('index', 'columns') or number (0, 1). The default is 'index'.
    copy : bool, default True
        Also copy underlying data.
    inplace : bool, default False
        Whether to return a new DataFrame. If True then value of copy is
        ignored.
    level : int or level name, default None
        In case of a MultiIndex, only rename labels in the specified
        level.
    errors : {'ignore', 'raise'}, default 'ignore'
        If 'raise', raise a `KeyError` when a dict-like `mapper`, `index`,
        or `columns` contains labels that are not present in the Index
        being transformed.
        If 'ignore', existing keys will be renamed and extra keys will be
        ignored.
    
    Returns
    -------
    DataFrame or None
        DataFrame with the renamed axis labels or None if ``inplace=True``.
    
    Raises
    ------
    KeyError
        If any of the labels is not found in the selected axis and
        "errors='raise'".
    
    See Also
    --------
    DataFrame.rename_axis : Set the name of the axis.
    
    Examples
    --------
    ``DataFrame.rename`` supports two calling conventions
    
    * ``(index=index_mapper, columns=columns_mapper, ...)``
    * ``(mapper, axis={'index', 'columns'}, ...)``
    
    We *highly* recommend using keyword arguments to clarify your
    intent.
    
    Rename columns using a mapping:
    
    >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
    >>> df.rename(columns={"A": "a", "B": "c"})
       a  c
    0  1  4
    1  2  5
    2  3  6
    
    Rename index using a mapping:
    
    >>> df.rename(index={0: "x", 1: "y", 2: "z"})
       A  B
    x  1  4
    y  2  5
    z  3  6
    
    Cast index labels to a different type:
    
    >>> df.index
    RangeIndex(start=0, stop=3, step=1)
    >>> df.rename(index=str).index
    Index(['0', '1', '2'], dtype='object')
    
    >>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise")
    Traceback (most recent call last):
    KeyError: ['C'] not found in axis
    
    Using axis-style parameters:
    
    >>> df.rename(str.lower, axis='columns')
       a  b
    0  1  4
    1  2  5
    2  3  6
    
    >>> df.rename({1: 2, 2: 4}, axis='index')
       A  B
    0  1  4
    2  2  5
    4  3  6

axis=0 is the default: looking through all of the row labels (0, 1, through 343) and if it finds a row named “island” or “sex” it will change that name.

We are trying to change the column names, so we should instead use the argument axis=1. Notice how the “island” and “sex” columns have been renamed.

df.rename({'island':'location','sex':'gender'},axis=1)
species location bill_length_mm bill_depth_mm flipper_length_mm body_mass_g gender
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

An example of renaming one of the row labels (switch to axis=0).

df.rename({3: 'Today'},axis=0)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
Today Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

It’s important to point out that we haven’t changed df itself. A hint that we haven’t changed df is that DataFrames were displayed as the result of our code. The code was creating new DataFrames, not changing the original DataFrame.

df
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

If we want to change df itself, we should use the inplace keyword argument or assign the result to df (Warning: this can be dangerous; it would be safer to call this DataFrame something different, like df2).

df.rename({'island':'location','sex':'gender'},axis=1,inplace=True)

df
species location bill_length_mm bill_depth_mm flipper_length_mm body_mass_g gender
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

df2 = df.rename({'island':'location','sex':'gender'},axis=1)
df2
species location bill_length_mm bill_depth_mm flipper_length_mm body_mass_g gender
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

Delete all the rows which contain missing values#

Notice: Altair, by default, will remove rows with missing values that are being used for the visualation.

  • Apply isna method with any and a suitable axis keyword argument to determine which rows have any missing data.

If we use axis=0, we are keeping the column labels the same, and finding out which columns have any missing values.

df.isna().any(axis = 0)
species              False
location             False
bill_length_mm        True
bill_depth_mm         True
flipper_length_mm     True
body_mass_g           True
gender                True
dtype: bool

In this case, we want to keep the row labels the same, and finding out which row have any missing values.

df.isna().any(axis = 1) #which row has any missing values
0      False
1      False
2      False
3       True
4      False
       ...  
339     True
340    False
341    False
342    False
343    False
Length: 344, dtype: bool

There is also an all (in contrast to the any we are using). Notice how the row 3 is now False, because it is not the case that all of the values are missing in this row.

df.isna().all(axis=1)
0      False
1      False
2      False
3      False
4      False
       ...  
339    False
340    False
341    False
342    False
343    False
Length: 344, dtype: bool
  • Now use Boolean indexing like usual. You might need to take a negation, using tilde ~.

If we plug in exactly what we have above, we will be doing the exact opposite of what we want. This is keeping the rows that have any missing values.

  • Be sure to save the resulting DataFrame with the same name df. It should now have 333 rows.

df.shape
(344, 7)
df3 = df[~df.isna().any(axis=1)] # get rid of rows with missing data
df3.shape 
(333, 7)

We can also use notna() but with all() to remove rows with missig data

df4 = df[df.notna().all(axis = 1)]
df4.shape
(333, 7)

Or use dropna() to directly remove missing values.

df5 = df.dropna()
df5.shape
(333, 7)

Facet charts#

  • Display an Altair scatter chart showing bill length for the x-axis, flipper length for the y-axis, and color using species.

df5.columns
Index(['species', 'location', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'gender'],
      dtype='object')
import altair as alt
alt.Chart(df5).mark_circle().encode(
    x = 'bill_length_mm',
    y = 'flipper_length_mm',
    color = 'species'
)
  • Use the domain 30-to-60 for the x-axis and 170-to-240 for the y-axis. (I had some trouble this morning installing the new version of Altair, so let’s use the old syntax.)

alt.Chart(df5).mark_circle().encode(
    x = alt.X('bill_length_mm',scale = alt.Scale(domain=(30,60))),
    y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
    color = 'species'
)
  • What data encoding type makes the most sense for “species”, Quantitative, Ordinal, or Nominal? Does adding that abbreviation it change the appearance of the chart?

Changing from "species" to "species:N" does not have any effect, because when there are strings in the column, Altair automatically defaults to a Nominal data type.

alt.Chart(df5).mark_circle().encode(
    x = alt.X('bill_length_mm',scale = alt.Scale(domain=(30,60))),
    y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
    color = 'species:N'
)
  • What happens if you try to use the “Ordinal” encoding type for the x-axis? (Get rid of the scale part for this.)

Notice how different this looks. Also notice how the gap between 34 and 34.4 is the same as the gap between 36.6 and 36.7. By using the Ordinal data type, ":O", we are telling Altair to treat these as distinct categories, and that the numerical difference between the values is not important.

This chart definitely looks worse than with the default Quantitative encoding.

alt.Chart(df5).mark_circle().encode(
    x = 'bill_length_mm:O',
    y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
    color = 'species:N'
)
  • Make a facet chart where the penguins are divided according to gender. (Go back to “Quantitative” encoding for the x channel.)

Here the data is divided by “gender”, and the different genders are put into different rows. That is what the row="gender" part means.

alt.Chart(df5).mark_circle().encode(
    x = alt.X('bill_length_mm',scale = alt.Scale(domain=(30,60))),
    y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
    color = 'species:N',
    row = 'gender'
)

Here is the same thing, but putting different genders into different columns. This would be a good choice if you wanted to compare the flipper lengths between genders. If instead you wanted to compare the bill lengths between genders, then I think it would make more sense to use the above vertical facet chart.

alt.Chart(df5).mark_circle().encode(
    x = alt.X('bill_length_mm',scale = alt.Scale(domain=(30,60))),
    y = alt.Y('flipper_length_mm',scale = alt.Scale(domain=(170,240))),
    color = 'species:N',
    column = 'gender'
)
Created in deepnote.com Created in Deepnote