Analyzing Top Rubik’s Cube Times#

Author: Cyrus Phan

Course Project, UC Irvine, Math 10, F23

Introduction#

For about 7 years at this point, I’ve been very interested in Rubik’s Cubes, and particularly in speedsolving, or speedcubing, in which people compete to solve a puzzle in the shortest time possible. Official times are recorded by the World Cube Association (WCA) in many events, but I will be focusing on the 3x3 event. I would like to analyze patterns in the times of the top performers in the world.

Importing the Datasets and Manipulating Them#

The datasets I will be using are courtesy of this Kaggle post, which is a compilation of the top competitors’ best averages and single solve times until September 5, 2021. Since then, much of this data is outdated, and many records have been broken, but this file should still be a good indicator of patterns of times.

For context, WCA competitions are conducted as such: each round, a competitor does five solves, recorded as a number accurate to three decimal places. The third decimal is truncated, and of the five solves done in the round, the fastest and slowest times are discarded, and the mean of the three remaining times are recorded as the average of that round. This is all to say, the WCA’s use of the terminology, “average” does not mean true mean of five solves. So, from this point forward, when I say, “average”, what I mean is, “average of the three middle times in a round of five solves.” The dataframe averages contains the best result for the top 1000 performers in the 3x3 average category.

Similarly, the dataframe singles contains the best result for the top 1000 performers in the 3x3 single category. That is, the best solve time ever recorded for that competitor throughout any competition.

import pandas as pd
averages = pd.read_csv('best_averages.csv')
singles = pd.read_csv('best_singles.csv')
singles
Unnamed: 0 rank name time country competition
0 0 1 Yusheng Du (杜宇生) 3.47 China Wuhu Open 2018
1 1 2 Ruihang Xu (许瑞航) 4.06 China Wuhan Open 2021
2 2 3 Feliks Zemdegs 4.16 Australia Auckland Summer Open 2020
3 3 4 Patrick Ponce 4.24 United States Northeast Championship 2019
4 4 5 Nicolás Sánchez 4.38 United States GA Cubers Feet Fest 2019
... ... ... ... ... ... ...
995 995 994 Francis Wong Jia Yen 7.18 Malaysia Melaka Open 2014
996 996 994 Henry Lichner 7.18 United States Wiscube 2019
997 997 994 Mharr Justhinne Ampong 7.18 Philippines UCFO 2017
998 998 994 Sakib Ibn Rashid Rhivu 7.18 Bangladesh City of Joy 2017
999 999 994 Vasilis Vasileris 7.18 Greece Thessalonicube 2016

1000 rows × 6 columns

A major component of what I wanted to analyze was the year of each recorded entry, but unfortunately, the dataset did not have a column for year. However, I noticed that, per WCA regulation, year is required after the name of every competition, so if I extract the last four digits of every string in the competition column of averages and singles, I could get the year that each entry was recorded. Also, since there seems to be two spaces before each competition name and one after, I removed the first two digits and the last one from each string in competition for both dataframes.

averages['year'] = averages['competition'].str[-5:-1] #-5 to -1 since there seems to be a space after every year
averages['year'] = pd.to_numeric(averages['year'])
averages['competition'] = averages['competition'].str[2:-1]
singles['year'] = singles['competition'].str[-5:-1]
singles['year'] = pd.to_numeric(singles['year'])
singles['competition'] = singles['competition'].str[2:-1]
averages
Unnamed: 0 rank name time country competition year
0 0 1 Ruihang Xu (许瑞航) 5.48 China Wuhan Open 2021 2021
1 1 2 Feliks Zemdegs 5.53 Australia Odd Day in Sydney 2019 2019
2 2 3 Tymon Kolasiński 5.54 Poland LLS II Biała Podlaska 2021 2021
3 3 4 Yezhen Han (韩业臻) 5.57 China Guangdong Open 2021 2021
4 4 5 Max Park 5.59 United States Houston Winter 2020 2020
... ... ... ... ... ... ... ...
995 995 993 Junze Zhao (赵俊泽) 9.05 China Xuzhou Open 2018 2018
996 996 993 Marwan Kamal 9.05 Canada National Capital Region 2019 2019
997 997 993 Mattheo de Wit 9.05 Netherlands Hoogland Open 2019 2019
998 998 993 Pierre Meunier 9.05 France Elsass Open 2019 2019
999 999 993 Zhiyong Liu (柳智勇) 9.05 China Xiamen Autumn 2019 2019

1000 rows × 7 columns

I also want to analyze data about all listed competitions in the WCA database. However, some WCA competitions last for several days, so the date column contains values which are invalid for conversion into datetime, such as “Aug 28 - 29, 2021” or “Jul 31 - Aug 1, 2021”. Date ranges, as far as I know, cannot be converted into a datetime value, so my strategy is just to extract the first 6 digits in the string and the last four digits, then concatenate them together with a space between to get a proper datetime format.

comps = competitions = pd.read_csv('all_comp.csv')
comps['date'] = comps['date'].str[:6] + ' ' + comps['date'].str[-4:]
comps['date'] = pd.to_datetime(comps['date'])
comps
Unnamed: 0 date name country venue
0 0 2021-09-05 Flen - Lilla Manschetten 2021 Sweden, Flen Aktivitetshuset Skjortan
1 1 2021-09-04 SST Naprawa 2021 Poland, Naprawa Zespół Szkolno-Przedszkolny w Naprawie
2 2 2021-09-04 Flen - Stora Manschetten 2021 Sweden, Flen Aktivitetshuset Skjortan
3 3 2021-08-29 Chaotic Craigie 2021 Australia, Perth, Western Australia Craigie Leisure Centre
4 4 2021-08-28 Bologna Back from Holidays 2021 Italy, Castenaso, Bologna PalaGioBat
... ... ... ... ... ...
7089 7089 2004-04-03 Caltech Spring 2004 United States, Pasadena, California California Institute of Technology
7090 7090 2004-01-24 Caltech Winter 2004 United States, Pasadena, California California Institute of Technology
7091 7091 2003-10-11 Dutch Open 2003 Netherlands, Veldhoven ASML
7092 7092 2003-08-23 World Championship 2003 Canada, Toronto, Ontario Ontario Science Center
7093 7093 1982-06-05 World Championship 1982 Hungary, Budapest NaN

7094 rows × 5 columns

I’d also like to remove rows that have missing values in the venue column.

comps = comps.dropna()
comps
Unnamed: 0 date name country venue
0 0 2021-09-05 Flen - Lilla Manschetten 2021 Sweden, Flen Aktivitetshuset Skjortan
1 1 2021-09-04 SST Naprawa 2021 Poland, Naprawa Zespół Szkolno-Przedszkolny w Naprawie
2 2 2021-09-04 Flen - Stora Manschetten 2021 Sweden, Flen Aktivitetshuset Skjortan
3 3 2021-08-29 Chaotic Craigie 2021 Australia, Perth, Western Australia Craigie Leisure Centre
4 4 2021-08-28 Bologna Back from Holidays 2021 Italy, Castenaso, Bologna PalaGioBat
... ... ... ... ... ...
7088 7088 2004-04-16 France 2004 France, Paris Novotel Les Halles
7089 7089 2004-04-03 Caltech Spring 2004 United States, Pasadena, California California Institute of Technology
7090 7090 2004-01-24 Caltech Winter 2004 United States, Pasadena, California California Institute of Technology
7091 7091 2003-10-11 Dutch Open 2003 Netherlands, Veldhoven ASML
7092 7092 2003-08-23 World Championship 2003 Canada, Toronto, Ontario Ontario Science Center

7076 rows × 5 columns

I also wanted to split the country column to get the country name only, instead of it with the city the competition was hosted in.

comps['trueCountry'] = comps['country'].str.split(',').str.get(0)
comps
/tmp/ipykernel_932/2980643713.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  comps['trueCountry'] = comps['country'].str.split(',').str.get(0)
Unnamed: 0 date name country venue trueCountry
0 0 2021-09-05 Flen - Lilla Manschetten 2021 Sweden, Flen Aktivitetshuset Skjortan Sweden
1 1 2021-09-04 SST Naprawa 2021 Poland, Naprawa Zespół Szkolno-Przedszkolny w Naprawie Poland
2 2 2021-09-04 Flen - Stora Manschetten 2021 Sweden, Flen Aktivitetshuset Skjortan Sweden
3 3 2021-08-29 Chaotic Craigie 2021 Australia, Perth, Western Australia Craigie Leisure Centre Australia
4 4 2021-08-28 Bologna Back from Holidays 2021 Italy, Castenaso, Bologna PalaGioBat Italy
... ... ... ... ... ... ...
7088 7088 2004-04-16 France 2004 France, Paris Novotel Les Halles France
7089 7089 2004-04-03 Caltech Spring 2004 United States, Pasadena, California California Institute of Technology United States
7090 7090 2004-01-24 Caltech Winter 2004 United States, Pasadena, California California Institute of Technology United States
7091 7091 2003-10-11 Dutch Open 2003 Netherlands, Veldhoven ASML Netherlands
7092 7092 2003-08-23 World Championship 2003 Canada, Toronto, Ontario Ontario Science Center Canada

7076 rows × 6 columns

The singles and averages dataframes do not have the date for which each result came from, but they do have the competition name. And the competition names are all stored in the comps dataframe. So I can take the dates from the comps dataframe and put them into a new column in singles and averages using the apply method and a lambda function. Also, since I want to use a few of scikit-learn’s tools, and they do not have built-in support for handling datetime values, I will preprocess the values date column and put them into a new column, called daysSinceMin, which, as the name suggests, gives the number of days that result came from since the very oldest date value in date.

from datetime import datetime

averages['date'] = averages['competition'].apply(lambda x : comps[comps['name'] == x].iloc[0,1])
singles['date'] = singles['competition'].apply(lambda x : comps[comps['name'] == x].iloc[0,1])

averages['daysSinceMin'] = (averages['date'] - averages['date'].min()).dt.days
singles['daysSinceMin'] = (singles['date'] - singles['date'].min()).dt.days

singles
Unnamed: 0 rank name time country competition year date daysSinceMin
0 0 1 Yusheng Du (杜宇生) 3.47 China Wuhu Open 2018 2018 2018-11-24 2506
1 1 2 Ruihang Xu (许瑞航) 4.06 China Wuhan Open 2021 2021 2021-06-05 3430
2 2 3 Feliks Zemdegs 4.16 Australia Auckland Summer Open 2020 2020 2020-02-29 2968
3 3 4 Patrick Ponce 4.24 United States Northeast Championship 2019 2019 2019-02-16 2590
4 4 5 Nicolás Sánchez 4.38 United States GA Cubers Feet Fest 2019 2019 2019-12-14 2891
... ... ... ... ... ... ... ... ... ...
995 995 994 Francis Wong Jia Yen 7.18 Malaysia Melaka Open 2014 2014 2014-07-19 917
996 996 994 Henry Lichner 7.18 United States Wiscube 2019 2019 2019-12-21 2898
997 997 994 Mharr Justhinne Ampong 7.18 Philippines UCFO 2017 2017 2017-12-02 2149
998 998 994 Sakib Ibn Rashid Rhivu 7.18 Bangladesh City of Joy 2017 2017 2017-06-03 1967
999 999 994 Vasilis Vasileris 7.18 Greece Thessalonicube 2016 2016 2016-07-30 1659

1000 rows × 9 columns

Visualizing the Data#

In this section, I will be using Altair to visualize the data.

The first thing I want to do is to visualize the distribution of the fastest single and average solve times in a bar chart. Unfortunately, since the data in time is continuous, I’d like to group the data into bins. For this, I use the cut method of pandas. First, I’ll find the minimum and maximum time values in each dataframe.

import altair as alt
sg = alt.Chart(singles).mark_bar(color='red').encode(
    alt.X('time', bin=alt.Bin(maxbins=20)),
    alt.Y('count(time)')
)
alt.Chart(averages).mark_bar(color='blue').encode(
    alt.X('time', bin=alt.Bin(maxbins=20)),
    alt.Y('count(time)')
)

The interpretation of the unusually low counts for the 9.0 binned group in averages is that each dataframe only includes 1000 rows. The actual number of recorded times in that group was cut off.

I’m also interested in what year each competitor’s result came from.

alt.Chart(singles).mark_bar(color='red').encode(
    alt.X('year:O'),
    alt.Y('count(year)')
)
alt.Chart(averages).mark_bar(color='blue').encode(
    alt.X('year:O'),
    alt.Y('count(year)')
)

As expected, there are few results from 2017 or earlier; many competitors who still participate in this event are probably still improving their times over the years, and the results from past years would have come from those who are not active anymore. The effect of the COVID-19 pandemic is also very obvious; the upward trend in top times suddenly dipped in 2020 for both averages and singles.

I’d also like to see the trend in the number of competitions per year.

alt.data_transformers.disable_max_rows()
alt.Chart(comps).mark_bar().encode(
    alt.X('year(date)', bin=alt.Bin(maxbins=20)),
    alt.Y('count(date)')
)

Once again, the effect of the COVID-19 lockdown is obvious. The number of competitions hosted in 2020 and 2021 was much lower than in the years prior.

Now, I would like to see which countries hosted the most cubing competitions.

alt.Chart(comps).mark_bar().encode(
    alt.X('trueCountry', sort='-y'),
    alt.Y('count(trueCountry):N')
)
len(comps['trueCountry'].unique())
115

Amazingly, the United States has hosted about 1400 competitions, China about 600, and in total, 115 countries around the world have hosted Rubik’s Cube competitions.

Let’s also see if there’s any noticable correlation between the date at which a top result occured and the time result itself.

alt.Chart(singles).mark_circle(color='red').encode(
    alt.X('date:T'),
    alt.Y('time', scale=alt.Scale(domain=(3.3,7.5)))
)
alt.Chart(averages).mark_circle(color='blue').encode(
    alt.X('date:T'),
    alt.Y('time', scale=alt.Scale(domain=(5.3,9.2)))
)

There are a few noticable patterns in both graphs. First of all, as before, there is a huge gap in mid-2020, as expected as a result of the pandemic lockdown. The other is that the values seem to not be very concentrated before 2018, since most top competitors are probably still active.

Preciting Times Using Regression Models#

Seemingly, there is no linear correlation between date and time (result), but let’s fit a LinearRegression object to the data for average, using daysSinceMin as the only input feature, and using time as the target, just for curiosity’s sake, then visualize the regression line against the actual data.

from sklearn.linear_model import LinearRegression
singlesReg = LinearRegression()
singlesReg.fit(singles[['daysSinceMin']],singles['time'])
singles['pred_lin'] = singlesReg.predict(singles[['daysSinceMin']])

c1 = alt.Chart(singles).mark_circle(color='red').encode(
    alt.X('daysSinceMin'),
    alt.Y('time', scale=alt.Scale(domain=(3.3,7.5)))
)

c2 = alt.Chart(singles).mark_line(color='lime').encode(
    x='daysSinceMin',
    y='pred_lin'
)
c1 + c2
averagesReg = LinearRegression()
averagesReg.fit(averages[['daysSinceMin']],averages['time'])
averages['pred_lin'] = averagesReg.predict(averages[['daysSinceMin']])

c3 = alt.Chart(averages).mark_circle(color='blue').encode(
    alt.X('daysSinceMin'),
    alt.Y('time', scale=alt.Scale(domain=(5.3,9.2)))
)

c4 = alt.Chart(averages).mark_line(color='orange').encode(
    x='daysSinceMin',
    y='pred_lin'
)
c3 + c4
from sklearn.metrics import r2_score
r2_score(singles['time'],singles['pred_lin'])
0.00044524837645920634
r2_score(averages['time'],averages['pred_lin'])
0.021245491845280462

As expected, the \(R^2\) value for both models is close to 0, which is terrible. Clearly, a linear regression model would not have been appropriate for this data anyway.

Again, just for fun, we’ll try to perform degree 8 polynomial regression on the data in singles and averages, then plot the results on an Altair chart against the actual data points.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

singles_copy = singles.copy()
singles_sub = singles_copy.sample(n=200, random_state=0)
modelSingle = Pipeline([
    ("poly", PolynomialFeatures(degree=8)),
    ("reg", LinearRegression())
])
modelSingle.fit(singles_sub[['daysSinceMin']], singles_sub['time'])
singles_copy['pred_poly'] = modelSingle.predict(singles_copy[['daysSinceMin']])

c5 = alt.Chart(singles_copy).mark_line(clip=True, color = 'lime').encode(
    alt.X('daysSinceMin'),
    alt.Y('pred_poly', scale=alt.Scale(domain=(3.3,7.5)))
)
c1 + c5
averages_copy = averages.copy()
averages_sub = averages_copy.sample(n=200, random_state=0)
modelSingle = Pipeline([
    ("poly", PolynomialFeatures(degree=8)),
    ("reg", LinearRegression())
])
modelSingle.fit(averages_sub[['daysSinceMin']], averages_sub['time'])
averages_copy['pred_poly'] = modelSingle.predict(averages_copy[['daysSinceMin']])

c6 = alt.Chart(averages_copy).mark_line(clip=True, color = 'orange').encode(
    alt.X('daysSinceMin'),
    alt.Y('pred_poly', scale=alt.Scale(domain=(3.3,7.5)))
)
c3 + c6

Once again, as expected, a polynomial regression model is not appropriate for this data.

Classifying Results Using Decision Trees#

Since logistic regression and decisoin trees are methods of classifying, those would be ill-suited for this particular problem. However, there is a chance we can use them (particularly decision trees) for classifying which country a particular competitor is from based on their time and date of result. I predict it will not be very successful, though. I’ll only do this for averages.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(averages[['time', 'daysSinceMin']], averages['country'], train_size=0.3, random_state=0)

clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=2)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_leaf_nodes=3, random_state=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf.score(X_test, y_test)
0.2985714285714286
averages['country'].value_counts()[0]/len(averages)
0.23

Unsurprisingly, the prediction score is quite poor. It is better than guessing the most commonly occuring country (United States), however. Let’s visualize the decision tree.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure(figsize=(20,10))
_ = plot_tree(clf, feature_names=clf.feature_names_in_, filled=True)
../../_images/c42b2aaffb9c030a5e201835c03ac4392c15c1c3247adf4a8348033719db09fa.png
clf.classes_
array(['  Algeria ', '  Argentina ', '  Australia ', '  Austria ',
       '  Belarus ', '  Bolivia ', '  Brazil ', '  Canada ', '  Chile ',
       '  China ', '  Colombia ', '  Denmark ', '  Finland ', '  France ',
       '  Germany ', '  Hong Kong ', '  Hungary ', '  India ',
       '  Indonesia ', '  Iran ', '  Israel ', '  Italy ', '  Japan ',
       '  Malaysia ', '  Mexico ', '  Mongolia ', '  Netherlands ',
       '  New Zealand ', '  Norway ', '  Peru ', '  Philippines ',
       '  Poland ', '  Portugal ', '  Republic of Korea ', '  Russia ',
       '  Singapore ', '  Slovenia ', '  Spain ', '  Sweden ',
       '  Switzerland ', '  Taiwan ', '  Turkey ', '  Ukraine ',
       '  United Kingdom ', '  United States ', '  Vietnam '],
      dtype=object)

Seemingly, the decision tree is only predicting United States and China, which is expected since those two are the most commonly occuring countries. And the decision tree is only creating decision boundaries perpendicular to the axis representing daysSinceMin.

As for the discussion about overfitting; I believe that the minimum value for the U-shaped test error curve will be quite high no matter how we split our testing and training data. I find that there is very little visible correlation between time, date, and country, meaning a decision tree is also relatively useless in the problem of classifying. Perhaps if I had access to information about which hardware (cube) was used to achieve each result, there would be an emerging pattern based on the popularity of certain products in certain countries, but that data is unfortunately not logged on the WCA database.

Generating a Random Set of Times#

Although our previously discussed methods of regression and classifying are ineffective with this dataset, one pattern which is almost certainly true is that the solve time data is normally distributed, or at least approximately normal. We unfortunately do not have access to the full WCA database with every single competitor’s fastest single solve time, so we cannot calculate the mean and standard deviation.

However, my idea is this: I will see how many total competitors have participated in the 3x3 Rubik’s Cube event in the entire history of the WCA, then generate a random set of mock times following the Gaussian curve. I will play around with the mean and standard deviation parameters until I get data that vaguely matches our real data from the top 1000 results, then extrapolate to hopefully predict the real mean and standard deviation, or at least draw some conclusions about every top single 3x3 result.

First, I’ll define a function which takes two inputs, mean and std, and uses them to generate a random set of 200000 values (I chose 200000 because that is the total number of competitors in the WCA 3x3 event, per their website). I’ll make a dataframe from those randomly generated times, then filter out the results greater than 7.3 and less than 3.4 to more easily visualize the mock data side-by-side with the real data.

import numpy as np
def random_times(mean, std):
    data_points = np.random.normal(mean, std, 200000)
    df = pd.DataFrame({'time': data_points, 'mock': 'Mock'})
    filtered_df = df[(df['time'] <= 7.4) & (df['time']>= 3.4)]
    return filtered_df
single_copy = singles.copy()

Since my goal is to create some Altair charts, I’d like to separate the mock data from the real data, hence the code below. Then I concatenate the mock data and the real data to get resultdf.

single_copy['mock'] = 'Real'
resultdf = pd.concat([single_copy, random_times(15.1,3.05)], ignore_index=True)

I played around with a few values for mean and std, and landed on 15.1 and 3.05 respectively, which I found was similar enough to the real data.

Finally, to put the two histograms side-by-side to compare them:

alt.Chart(resultdf).mark_bar(color='red').encode(
    alt.X('time', bin=alt.Bin(maxbins=20)),
    alt.Y('count(time)'),
    color='mock:N',
    column='mock:N'
)

The mock data looks quite similar to the real data, other than the fact that, among other things, there are less leftmost values for the real data. So, given the similarity in the histograms, I will conclude (without rigor) that the mean top single solve time for all competitors in the WCA is 15.1, with standard deviation 3.05. How does this compare to the actual data?

I decided to search on the WCA, and found that the median value was approximately 21.8. This is quite far off from our predicted value of 15.1! But this does not necessarily mean the worst; we can still conclude that the distribution of single solve times is not perfectly normal. The suggestion is that the distribution of solve times is actually right-skewed, which makes sense logically; there is a very high likelihood of there being outliers on the right extreme (solve times that are more than a minute long).

Summary#

First, I did a lot of cleaning data so that I could then visualize it, mainly using Altair. After doing some data visualization, I tried applying some of the regression and classifying tools that we learned in the latter weeks of class, but was unfortunately unsuccessful based on the fact that there was little visible correlation between the variables I was analyzing. But even though those trials were not fruitful, I used some random generation of normal data to conclude that the distribution of Rubik’s cube solve times is right-skewed, which is consistent with the actual data provided by the WCA.

References#

Your code above should include references. Here is some additional space for references.

  • List any other references that you found helpful.

https://www.kaggle.com/code/suharkov/evolution-of-rubik-s-cube-eda: This analysis of the same dataset that I was working with gave me a good headstart of what statistics to look at.

https://openscoring.io/blog/2020/03/08/sklearn_date_datetime_pmml/#:~:text=A datetime is a complex,be fed to Scikit-Learn: This is where I learned that I should preprocess the datetime data to work with sklearn.

https://altair-viz.github.io/altair-tutorial/notebooks/03-Binning-and-aggregation.html: This is where I learned to do binning in Altair.

https://www.worldcubeassociation.org/results/rankings/333/single: The WCA, which I referenced numerous times in the project. They provided all the data for this project.