Week 3 Wednesday#

  • Encoding data types

  • Other types of charts in Altair

Encoding data types#

(This notion of quantitative data vs categorical data will also be very important when we get to the Machine Learning portion of Math 10.) Altair chooses different default values depending on the type of the data being encoded. These are the 5 types of data distinguished by Altair:

Data Type

Shorthand Code

Description

quantitative

Q

a continuous real-valued quantity

ordinal

O

a discrete ordered quantity

nominal

N

a discrete unordered category

temporal

T

a time or date value

geojson

G

a geographic shape

A quantitative data type is just an ordinary numeric data type, like floats. Ordinal and Nominal data types are categorical data types, where the values represent discrete categories or classes. We use the Ordinal designation if the categories have a natural ordering and we use Nominal if the categories do not have a natural ordering. A Temporal data type is used for data representing datetime-like values. The last encoding data type is for geographic values (like for maps).

  • Load the ā€œmpgā€ dataset (sns.load_dataset) from Seaborn and name the DataFrame df.

import altair as alt
print(alt.__version__) # check the version of altair 
4.2.2

Notice: the newest version 5 of Altair has different syntax.

import seaborn as sns
df = sns.load_dataset('mpg')
  • Find the sub-DataFrame for which the name of the car contains the substring ā€œskylarkā€. Name the sub-DataFrame df_sub. (Reminder. Use str and contains.)

df.head(5)
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
#Boolean indexing to get the appropriate sub-DataFrame
df_sub = df[df["name"].str.contains("skylark")]
df_sub
mpg cylinders displacement horsepower weight acceleration model_year origin name
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
226 20.5 6 231.0 105.0 3425 16.9 77 usa buick skylark
305 28.4 4 151.0 90.0 2670 16.0 79 usa buick skylark limited
339 26.6 4 151.0 84.0 2635 16.4 81 usa buick skylark
  • Make a scatter plot in Altair from this sub-DataFrame using the ā€œmodel_yearā€ for both the x-coordinate and the color, and using ā€œmpgā€ for the y-coordinate. (We can increase the size of the points, and remove zero from the x-axis, to make it easier to see.)

alt.Chart(df_sub).mark_circle().encode(
    x = "model_year",
    y = "mpg",
    color = "model_year"
)
# increase the point size
alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = "model_year",
    y = "mpg",
    color = "model_year"
)
alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year", scale = alt.Scale(zero = False)), #x-axis does not start from zero
    y = "mpg",
    color = "model_year"
)

It still does not look very good. Letā€™s see what effect changing the encoding types will have.

  • What changes if you specify different encoding types for ā€œmodel_yearā€? (The difference in color between quantitative and ordinal will be more clear if you use a different color scheme: options.)

Here we switch the x-axis to the ā€œOrdinalā€ encoding data type, using :O. Notice how now the values 70, 77, 79, 81 are now treated like discrete categories, and the spacing between them is ignored.

In Altair, the data type of an encoding can be specified using one of the following data type shorthand characters:

ā€˜Qā€™ for quantitative (continuous) data ā€˜Oā€™ for ordinal data ā€˜Nā€™ for nominal data ā€˜Tā€™ for temporal data

alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #`:O` for ordinal data type
    y = "mpg",
    color = "model_year"
)

Here we change color scheme (see the above link for options). We are specifying that the color channel should use a ā€œQuantitativeā€ encoding, but that is the default, so you will see the same thing if you do not use that.

alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #x-axis does not start from zero
    y = "mpg",
    color = alt.Color("model_year:Q", scale = alt.Scale(scheme="lightgreyred"))
)

Here is the exact same chart, but where we switch to the ā€œOrdinalā€ encoding data type. Do you see the differences? One different is that the ā€œQuantiativeā€ legend shows a continuous progression of numbers. A more subtle difference is that the colors for 77, 79, 81 are grouped closer to each other in the ā€œQuantitativeā€ version, whereas everything, including the 70, is equally spaced colorwise in the ā€œOrdinalā€ version.

alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #x-axis does not start from zero
    y = "mpg",
    color = alt.Color("model_year:O", scale = alt.Scale(scheme="lightgreyred"))
)

If you switch the color data type to ā€œNominalā€ (which means unordered) and use the default color scheme, you can see that there is no ordering or progression to the colors used. The colors in this case are chosen to make the values as distinct as possible.

alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #x-axis does not start from zero
    y = "mpg",
    color = alt.Color("model_year:N")
)

Other types of charts in Altair#

Here we switch back to the full DataFrame, df. There are many types of charts in Altair (browse the example gallery to see some of the possibilities).

  • Make a bar chart using ā€œcylindersā€ for the x-coordinate, using the median of the mpg values for the y-coordinate.

Here is no column called "median(mpg)" in our DataFrame. Instead this syntax is telling Altair to compute the median and plot the bar heights based on the result.

alt.Chart(df).mark_bar().encode(
    x = "cylinders:O",
    y = "median(mpg)"
)
  • Add a tooltip so we can find the precise median values.

For example, if you put your mouse over the 4-cylinders bar, it will report a median value of 28.25. That is telling us that the median miles-per-gallon across 4-cylinder cars in the dataset is 28.25.

alt.Chart(df).mark_bar().encode(
    x = "cylinders:O",
    y = "median(mpg)",
    tooltip = ["median(mpg)", "cylinders"]
)
  • Can you find these same median values using df.groupby? Deepnote hides the warning, but use the keyword argument numeric_only when computing the median to avoid a Python warning.

df.groupby("cylinders").median()
mpg displacement horsepower weight acceleration model_year
cylinders
3 20.25 70.0 98.5 2375.0 13.5 75.0
4 28.25 105.0 78.0 2232.0 16.2 78.0
5 25.40 131.0 77.0 2950.0 19.9 79.0
6 19.00 228.0 100.0 3201.5 16.1 76.0
8 14.00 350.0 150.0 4140.0 13.0 73.0
df.groupby("cylinders").median()["mpg"]
cylinders
3    20.25
4    28.25
5    25.40
6    19.00
8    14.00
Name: mpg, dtype: float64
  • Make a ā€œrectangle chartā€ using mark_rect with ā€œmodel_yearā€ along the x-axis, with ā€œcylindersā€ along the y-axis, and with the rectangles colored by "count()".

Note that here "count()" is something defined by Altair, not one of the columns in df.

Reading based on the colors, it appears that there are the most cars from the year 82 and with 4 cylinders in this dataset. The tooltip lets us check that there are 28 such values in the dataset.

Here we are storing the chart in the variable c1 so we can refer to it below.

c1 = alt.Chart(df).mark_rect().encode(
    x = "model_year:O",
    y = "cylinders:O",
    color = "count()",
    tooltip = ["count()"]
)
c1 
  • Make a ā€œtext chartā€ using mark_text with the same parameters as above, but remove the color encoding, and add a text encoding based on "count()".

By itself, this looks a little strange. (We are using text for the marks in this case, that is why it is called mark_text.)

c2 = alt.Chart(df).mark_text().encode(
    x = "model_year:O",
    y = "cylinders:O",
    text = "count()"    
)
c2
  • Layer these last two charts together, either using + or using alt.layer.

c1 + c2

The + notation in Altair is just shorthand for layering using alt.layer. Here we are getting both charts displayed at once.

alt.layer(c1, c2)
Created in deepnote.com Created in Deepnote