Week 3 Wednesday#
Encoding data types
Other types of charts in Altair
Encoding data types#
(This notion of quantitative data vs categorical data will also be very important when we get to the Machine Learning portion of Math 10.) Altair chooses different default values depending on the type of the data being encoded. These are the 5 types of data distinguished by Altair:
Data Type |
Shorthand Code |
Description |
---|---|---|
quantitative |
Q |
a continuous real-valued quantity |
ordinal |
O |
a discrete ordered quantity |
nominal |
N |
a discrete unordered category |
temporal |
T |
a time or date value |
geojson |
G |
a geographic shape |
A quantitative data type is just an ordinary numeric data type, like floats. Ordinal and Nominal data types are categorical data types, where the values represent discrete categories or classes. We use the Ordinal designation if the categories have a natural ordering and we use Nominal if the categories do not have a natural ordering. A Temporal data type is used for data representing datetime-like values. The last encoding data type is for geographic values (like for maps).
Load the āmpgā dataset (
sns.load_dataset
) from Seaborn and name the DataFramedf
.
import altair as alt
print(alt.__version__) # check the version of altair
4.2.2
Notice: the newest version 5 of Altair has different syntax.
import seaborn as sns
df = sns.load_dataset('mpg')
Find the sub-DataFrame for which the name of the car contains the substring āskylarkā. Name the sub-DataFrame
df_sub
. (Reminder. Usestr
andcontains
.)
df.head(5)
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa | ford torino |
#Boolean indexing to get the appropriate sub-DataFrame
df_sub = df[df["name"].str.contains("skylark")]
df_sub
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
226 | 20.5 | 6 | 231.0 | 105.0 | 3425 | 16.9 | 77 | usa | buick skylark |
305 | 28.4 | 4 | 151.0 | 90.0 | 2670 | 16.0 | 79 | usa | buick skylark limited |
339 | 26.6 | 4 | 151.0 | 84.0 | 2635 | 16.4 | 81 | usa | buick skylark |
Make a scatter plot in Altair from this sub-DataFrame using the āmodel_yearā for both the x-coordinate and the color, and using āmpgā for the y-coordinate. (We can increase the size of the points, and remove zero from the x-axis, to make it easier to see.)
alt.Chart(df_sub).mark_circle().encode(
x = "model_year",
y = "mpg",
color = "model_year"
)
# increase the point size
alt.Chart(df_sub).mark_circle(size = 150).encode(
x = "model_year",
y = "mpg",
color = "model_year"
)
alt.Chart(df_sub).mark_circle(size = 150).encode(
x = alt.X("model_year", scale = alt.Scale(zero = False)), #x-axis does not start from zero
y = "mpg",
color = "model_year"
)
It still does not look very good. Letās see what effect changing the encoding types will have.
What changes if you specify different encoding types for āmodel_yearā? (The difference in color between quantitative and ordinal will be more clear if you use a different color scheme: options.)
Here we switch the x-axis to the āOrdinalā encoding data type, using :O
. Notice how now the values 70
, 77
, 79
, 81
are now treated like discrete categories, and the spacing between them is ignored.
In Altair, the data type of an encoding can be specified using one of the following data type shorthand characters:
āQā for quantitative (continuous) data āOā for ordinal data āNā for nominal data āTā for temporal data
alt.Chart(df_sub).mark_circle(size = 150).encode(
x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #`:O` for ordinal data type
y = "mpg",
color = "model_year"
)
Here we change color scheme (see the above link for options). We are specifying that the color channel should use a āQuantitativeā encoding, but that is the default, so you will see the same thing if you do not use that.
alt.Chart(df_sub).mark_circle(size = 150).encode(
x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #x-axis does not start from zero
y = "mpg",
color = alt.Color("model_year:Q", scale = alt.Scale(scheme="lightgreyred"))
)
Here is the exact same chart, but where we switch to the āOrdinalā encoding data type. Do you see the differences? One different is that the āQuantiativeā legend shows a continuous progression of numbers. A more subtle difference is that the colors for 77
, 79
, 81
are grouped closer to each other in the āQuantitativeā version, whereas everything, including the 70
, is equally spaced colorwise in the āOrdinalā version.
alt.Chart(df_sub).mark_circle(size = 150).encode(
x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #x-axis does not start from zero
y = "mpg",
color = alt.Color("model_year:O", scale = alt.Scale(scheme="lightgreyred"))
)
If you switch the color data type to āNominalā (which means unordered) and use the default color scheme, you can see that there is no ordering or progression to the colors used. The colors in this case are chosen to make the values as distinct as possible.
alt.Chart(df_sub).mark_circle(size = 150).encode(
x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #x-axis does not start from zero
y = "mpg",
color = alt.Color("model_year:N")
)
Other types of charts in Altair#
Here we switch back to the full DataFrame, df
. There are many types of charts in Altair (browse the example gallery to see some of the possibilities).
Make a bar chart using ācylindersā for the x-coordinate, using the median of the mpg values for the y-coordinate.
Here is no column called "median(mpg)"
in our DataFrame. Instead this syntax is telling Altair to compute the median and plot the bar heights based on the result.
alt.Chart(df).mark_bar().encode(
x = "cylinders:O",
y = "median(mpg)"
)
Add a tooltip so we can find the precise median values.
For example, if you put your mouse over the 4-cylinders bar, it will report a median value of 28.25
. That is telling us that the median miles-per-gallon across 4-cylinder cars in the dataset is 28.25
.
alt.Chart(df).mark_bar().encode(
x = "cylinders:O",
y = "median(mpg)",
tooltip = ["median(mpg)", "cylinders"]
)
Can you find these same median values using
df.groupby
? Deepnote hides the warning, but use the keyword argumentnumeric_only
when computing the median to avoid a Python warning.
df.groupby("cylinders").median()
mpg | displacement | horsepower | weight | acceleration | model_year | |
---|---|---|---|---|---|---|
cylinders | ||||||
3 | 20.25 | 70.0 | 98.5 | 2375.0 | 13.5 | 75.0 |
4 | 28.25 | 105.0 | 78.0 | 2232.0 | 16.2 | 78.0 |
5 | 25.40 | 131.0 | 77.0 | 2950.0 | 19.9 | 79.0 |
6 | 19.00 | 228.0 | 100.0 | 3201.5 | 16.1 | 76.0 |
8 | 14.00 | 350.0 | 150.0 | 4140.0 | 13.0 | 73.0 |
df.groupby("cylinders").median()["mpg"]
cylinders
3 20.25
4 28.25
5 25.40
6 19.00
8 14.00
Name: mpg, dtype: float64
Make a ārectangle chartā using
mark_rect
with āmodel_yearā along the x-axis, with ācylindersā along the y-axis, and with the rectangles colored by"count()"
.
Note that here "count()"
is something defined by Altair, not one of the columns in df
.
Reading based on the colors, it appears that there are the most cars from the year 82
and with 4
cylinders in this dataset. The tooltip lets us check that there are 28
such values in the dataset.
Here we are storing the chart in the variable c1
so we can refer to it below.
c1 = alt.Chart(df).mark_rect().encode(
x = "model_year:O",
y = "cylinders:O",
color = "count()",
tooltip = ["count()"]
)
c1
Make a ātext chartā using
mark_text
with the same parameters as above, but remove thecolor
encoding, and add atext
encoding based on"count()"
.
By itself, this looks a little strange. (We are using text for the marks in this case, that is why it is called mark_text
.)
c2 = alt.Chart(df).mark_text().encode(
x = "model_year:O",
y = "cylinders:O",
text = "count()"
)
c2
Layer these last two charts together, either using
+
or usingalt.layer
.
c1 + c2
The +
notation in Altair is just shorthand for layering using alt.layer
. Here we are getting both charts displayed at once.
alt.layer(c1, c2)