Week 3 Wednesday

Week 3 Wednesday#

Encoding data types
Other types of charts in Altair

Encoding data types#

(This notion of quantitative data vs categorical data will also be very important when we get to the Machine Learning portion of Math 10.) Altair chooses different default values depending on the type of the data being encoded. These are the 5 types of data distinguished by Altair:

Data Type	Shorthand Code	Description
quantitative	Q	a continuous real-valued quantity
ordinal	O	a discrete ordered quantity
nominal	N	a discrete unordered category
temporal	T	a time or date value
geojson	G	a geographic shape

A quantitative data type is just an ordinary numeric data type, like floats. Ordinal and Nominal data types are categorical data types, where the values represent discrete categories or classes. We use the Ordinal designation if the categories have a natural ordering and we use Nominal if the categories do not have a natural ordering. A Temporal data type is used for data representing datetime-like values. The last encoding data type is for geographic values (like for maps).

Load the “mpg” dataset (sns.load_dataset) from Seaborn and name the DataFrame df.

import altair as alt

print(alt.__version__) # check the version of altair 

4.2.2

Notice: the newest version 5 of Altair has different syntax.

import seaborn as sns
df = sns.load_dataset('mpg')

Find the sub-DataFrame for which the name of the car contains the substring “skylark”. Name the sub-DataFrame df_sub. (Reminder. Use str and contains.)

df.head(5)

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	usa	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693	11.5	70	usa	buick skylark 320
2	18.0	8	318.0	150.0	3436	11.0	70	usa	plymouth satellite
3	16.0	8	304.0	150.0	3433	12.0	70	usa	amc rebel sst
4	17.0	8	302.0	140.0	3449	10.5	70	usa	ford torino

#Boolean indexing to get the appropriate sub-DataFrame
df_sub = df[df["name"].str.contains("skylark")]
df_sub

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
1	15.0	8	350.0	165.0	3693	11.5	70	usa	buick skylark 320
226	20.5	6	231.0	105.0	3425	16.9	77	usa	buick skylark
305	28.4	4	151.0	90.0	2670	16.0	79	usa	buick skylark limited
339	26.6	4	151.0	84.0	2635	16.4	81	usa	buick skylark

Make a scatter plot in Altair from this sub-DataFrame using the “model_year” for both the x-coordinate and the color, and using “mpg” for the y-coordinate. (We can increase the size of the points, and remove zero from the x-axis, to make it easier to see.)

alt.Chart(df_sub).mark_circle().encode(
    x = "model_year",
    y = "mpg",
    color = "model_year"
)

# increase the point size
alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = "model_year",
    y = "mpg",
    color = "model_year"
)

alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year", scale = alt.Scale(zero = False)), #x-axis does not start from zero
    y = "mpg",
    color = "model_year"
)

It still does not look very good. Let’s see what effect changing the encoding types will have.

What changes if you specify different encoding types for “model_year”? (The difference in color between quantitative and ordinal will be more clear if you use a different color scheme: options.)

Here we switch the x-axis to the “Ordinal” encoding data type, using :O. Notice how now the values 70, 77, 79, 81 are now treated like discrete categories, and the spacing between them is ignored.

In Altair, the data type of an encoding can be specified using one of the following data type shorthand characters:

‘Q’ for quantitative (continuous) data ‘O’ for ordinal data ‘N’ for nominal data ‘T’ for temporal data

alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #`:O` for ordinal data type
    y = "mpg",
    color = "model_year"
)

Here we change color scheme (see the above link for options). We are specifying that the color channel should use a “Quantitative” encoding, but that is the default, so you will see the same thing if you do not use that.

alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #x-axis does not start from zero
    y = "mpg",
    color = alt.Color("model_year:Q", scale = alt.Scale(scheme="lightgreyred"))
)

Here is the exact same chart, but where we switch to the “Ordinal” encoding data type. Do you see the differences? One different is that the “Quantiative” legend shows a continuous progression of numbers. A more subtle difference is that the colors for 77, 79, 81 are grouped closer to each other in the “Quantitative” version, whereas everything, including the 70, is equally spaced colorwise in the “Ordinal” version.

alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #x-axis does not start from zero
    y = "mpg",
    color = alt.Color("model_year:O", scale = alt.Scale(scheme="lightgreyred"))
)

If you switch the color data type to “Nominal” (which means unordered) and use the default color scheme, you can see that there is no ordering or progression to the colors used. The colors in this case are chosen to make the values as distinct as possible.

alt.Chart(df_sub).mark_circle(size = 150).encode(
    x = alt.X("model_year:O", scale = alt.Scale(zero = False)), #x-axis does not start from zero
    y = "mpg",
    color = alt.Color("model_year:N")
)

Other types of charts in Altair#

Here we switch back to the full DataFrame, df. There are many types of charts in Altair (browse the example gallery to see some of the possibilities).

Make a bar chart using “cylinders” for the x-coordinate, using the median of the mpg values for the y-coordinate.

Here is no column called "median(mpg)" in our DataFrame. Instead this syntax is telling Altair to compute the median and plot the bar heights based on the result.

alt.Chart(df).mark_bar().encode(
    x = "cylinders:O",
    y = "median(mpg)"
)

Add a tooltip so we can find the precise median values.

For example, if you put your mouse over the 4-cylinders bar, it will report a median value of 28.25. That is telling us that the median miles-per-gallon across 4-cylinder cars in the dataset is 28.25.

alt.Chart(df).mark_bar().encode(
    x = "cylinders:O",
    y = "median(mpg)",
    tooltip = ["median(mpg)", "cylinders"]
)

Can you find these same median values using df.groupby? Deepnote hides the warning, but use the keyword argument numeric_only when computing the median to avoid a Python warning.

df.groupby("cylinders").median()

	mpg	displacement	horsepower	weight	acceleration	model_year
cylinders
3	20.25	70.0	98.5	2375.0	13.5	75.0
4	28.25	105.0	78.0	2232.0	16.2	78.0
5	25.40	131.0	77.0	2950.0	19.9	79.0
6	19.00	228.0	100.0	3201.5	16.1	76.0
8	14.00	350.0	150.0	4140.0	13.0	73.0

df.groupby("cylinders").median()["mpg"]

cylinders
  20.25
  28.25
  25.40
  19.00
  14.00
Name: mpg, dtype: float64

Make a “rectangle chart” using mark_rect with “model_year” along the x-axis, with “cylinders” along the y-axis, and with the rectangles colored by "count()".

Note that here "count()" is something defined by Altair, not one of the columns in df.

Reading based on the colors, it appears that there are the most cars from the year 82 and with 4 cylinders in this dataset. The tooltip lets us check that there are 28 such values in the dataset.

Here we are storing the chart in the variable c1 so we can refer to it below.

c1 = alt.Chart(df).mark_rect().encode(
    x = "model_year:O",
    y = "cylinders:O",
    color = "count()",
    tooltip = ["count()"]
)
c1 

Make a “text chart” using mark_text with the same parameters as above, but remove the color encoding, and add a text encoding based on "count()".

By itself, this looks a little strange. (We are using text for the marks in this case, that is why it is called mark_text.)

c2 = alt.Chart(df).mark_text().encode(
    x = "model_year:O",
    y = "cylinders:O",
    text = "count()"    
)
c2

Layer these last two charts together, either using + or using alt.layer.

c1 + c2

The + notation in Altair is just shorthand for layering using alt.layer. Here we are getting both charts displayed at once.

alt.layer(c1, c2)

Created in Deepnote

Week 3 Wednesday

Contents

Week 3 Wednesday#

Encoding data types#

Other types of charts in Altair#