dragracer
Packagelibrary(dragracer)
#> If you can't love these data, how in the hell are you gonna love yourself? Can I get an amen?
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)The dragracer package has three data sets. The first is
episode-level data (rpdr_ep). These data contain some more
granular information about each episode that may not be discernible from
how episodes are typically summarized on Wikipedia (e.g. mini-challenge
winners, runway themes [where applicable], lip-sync song and artist).
The second data set is contestant-level (rpdr_contestants).
This data frame includes the contestant name, hometown, and purported
date of birth and age by the start of the show. The third data set is
episode-contestant-level data (rpdr_contep). This is the
most familiar form of the data that a reader of the show’s Wikipedia
entries could discern. They include information about how a contestant
fared in a particular episode (i.e. whether they won, scored high, were
safe, scored low, or were in the bottom). The show’s fans are accustomed
to seeing this form of the data as akin to a pyramid. However, I convert
the data from wide to long, making the data akin to a survival
data-generating process.
Here are some potential uses of the data.
A user can learn about how to summarize data. Here, we can get the
average age of the contestants by season from the
rpdr_contestants data.
rpdr_contestants %>%
group_by(season) %>%
summarize(mean_age = mean(age))
#> # A tibble: 14 × 2
#> season mean_age
#> <chr> <dbl>
#> 1 S01 31
#> 2 S02 27.6
#> 3 S03 28.2
#> 4 S04 29.2
#> 5 S05 28
#> 6 S06 29.3
#> 7 S07 30.2
#> 8 S08 29.8
#> 9 S09 30.4
#> 10 S10 28.3
#> 11 S11 29.3
#> 12 S12 28.4
#> 13 S13 29.6
#> 14 S14 26.4A user can also see which musical artists have appeared most for lip-syncs. The answer here is, unsurprisingly, RuPaul.
rpdr_ep %>%
group_by(lipsyncartist) %>%
summarize(n = n()) %>%
na.omit %>%
arrange(-n) %>% head(10)
#> # A tibble: 10 × 2
#> lipsyncartist n
#> <chr> <int>
#> 1 RuPaul 11
#> 2 Britney Spears 6
#> 3 Madonna 5
#> 4 Aretha Franklin 4
#> 5 Lady Gaga 4
#> 6 Whitney Houston 4
#> 7 Ariana Grande 3
#> 8 Cher 3
#> 9 Donna Summer 3
#> 10 Blondie 2A user can also see how Jinkx Monsoon, the GOAT, fared in all her episodes.
rpdr_contep %>%
filter(contestant == "Jinkx Monsoon") %>%
select(season, contestant, episode, outcome, finale)
#> # A tibble: 12 × 5
#> season contestant episode outcome finale
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 S05 Jinkx Monsoon 1 SAFE 0
#> 2 S05 Jinkx Monsoon 2 HIGH 0
#> 3 S05 Jinkx Monsoon 3 HIGH 0
#> 4 S05 Jinkx Monsoon 4 HIGH 0
#> 5 S05 Jinkx Monsoon 5 WIN 0
#> 6 S05 Jinkx Monsoon 6 HIGH 0
#> 7 S05 Jinkx Monsoon 7 HIGH 0
#> 8 S05 Jinkx Monsoon 8 HIGH 0
#> 9 S05 Jinkx Monsoon 9 WIN 0
#> 10 S05 Jinkx Monsoon 10 HIGH 0
#> 11 S05 Jinkx Monsoon 11 BTM 0
#> 12 S05 Jinkx Monsoon 14 WIN 1Previous versions of the data included all sorts of information at the contestant-level. For release, I decided to strip that information from the data in order to allow the user to learn how to do this. For example, if you were interested in summarizing how each contestant did in their particular season on various metrics, here’s how you might do that.
First, let’s merge in the mini-challenge data. Mini-challenges are irregular; not every episode has them. Indeed, Season 12 had very few of them. So, they get special treatment in the episode-level data.
rpdr_ep %>%
select(season, minicw1:minicw3) %>%
group_by(season) %>%
gather(Category, contestant, minicw1:minicw3) %>%
na.omit %>%
group_by(season, contestant) %>%
summarize(minicwins = n()) %>%
left_join(rpdr_contestants, .) %>%
mutate(minicwins = ifelse(is.na(minicwins), 0, minicwins)) -> D
#> `summarise()` has regrouped the output.
#> Joining with `by = join_by(season, contestant)`
#> ℹ Summaries were computed grouped by season and contestant.
#> ℹ Output is grouped by season.
#> ℹ Use `summarise(.groups = "drop_last")` to silence this message.
#> ℹ Use `summarise(.by = c(season, contestant))` for per-operation grouping
#> (`?dplyr::dplyr_by`) instead.Now, let’s merge in data from the episode-contestant-level about how each contestant fared, excluding finales and specials. We’ll calculate all sorts of things here, including estimated “points per episode” and “Dusted or Busted” scores.
rpdr_contep %>%
filter(participant == 1 & finale == 0 & penultimate == 0) %>%
mutate(high = ifelse(outcome == "HIGH", 1, 0),
win = ifelse(outcome == "WIN", 1, 0),
low = ifelse(outcome == "LOW", 1, 0),
safe = ifelse(outcome == "SAFE", 1, 0),
highsafe = ifelse(outcome %in% c("HIGH", "SAFE"), 1, 0),
winhigh = ifelse(outcome %in% c("HIGH", "WIN"), 1, 0),
btm = ifelse(outcome == "BTM", 1, 0),
lowbtm = ifelse(outcome %in% c("BTM", "LOW"), 1, 0)) %>%
group_by(season,contestant,rank) %>%
mutate(numcontests = n()) %>%
group_by(season,contestant, numcontests, rank) %>%
summarize(perc_high = sum(high)/unique(numcontests),
perc_win = sum(win)/unique(numcontests),
perc_winhigh = sum(winhigh)/unique(numcontests),
perc_low = sum(low)/unique(numcontests),
perc_btm = sum(btm)/unique(numcontests),
perc_lowbtm = sum(lowbtm)/unique(numcontests),
num_high = sum(high),
num_win = sum(win),
num_winhigh = sum(winhigh),
num_btm = sum(btm),
num_low = sum(low),
num_lowbtm = sum(lowbtm),
db_score = 2*sum(win, na.rm=T) +
1*sum(high, na.rm=T) +
(sum(safe, na.rm=T)*0) +
(sum(low, na.rm=T)*-1) + (sum(btm, na.rm=T)*-2)) %>%
ungroup() %>%
mutate(points = (2*num_win + num_high - num_low + (-2)*num_btm),
ppe = points/numcontests) %>%
full_join(D, .) -> D
#> `summarise()` has regrouped the output.
#> Joining with `by = join_by(season, contestant)`
#> ℹ Summaries were computed grouped by season, contestant, numcontests, and rank.
#> ℹ Output is grouped by season, contestant, and numcontests.
#> ℹ Use `summarise(.groups = "drop_last")` to silence this message.
#> ℹ Use `summarise(.by = c(season, contestant, numcontests, rank))` for
#> per-operation grouping (`?dplyr::dplyr_by`) instead.How, let’s look at who had the highest “Dusted or Busted” score across all seasons.
D %>%
arrange(-db_score) %>%
head(10) %>%
select(season, contestant, rank, db_score)
#> # A tibble: 10 × 4
#> season contestant rank db_score
#> <chr> <chr> <dbl> <dbl>
#> 1 S06 Bianca Del Rio 1 10
#> 2 S05 Jinkx Monsoon 1 9
#> 3 S09 Shea Couleé 3 9
#> 4 S13 Gottmik 3 9
#> 5 S13 Rosé 3 9
#> 6 S09 Sasha Velour 1 8
#> 7 S13 Symone 1 8
#> 8 S02 Tyra Sanchez 1 7
#> 9 S03 Raja 1 7
#> 10 S03 Manila Luzon 2 7Let’s also see who has the highest “points per episode” score.
D %>%
arrange(-ppe) %>%
head(10) %>%
select(season, contestant, rank, ppe)
#> # A tibble: 10 × 4
#> season contestant rank ppe
#> <chr> <chr> <dbl> <dbl>
#> 1 S06 Bianca Del Rio 1 1
#> 2 S05 Jinkx Monsoon 1 0.818
#> 3 S09 Shea Couleé 3 0.818
#> 4 S01 Ongina 5 0.8
#> 5 S02 Tyra Sanchez 1 0.778
#> 6 S09 Sasha Velour 1 0.727
#> 7 S01 Nina Flowers 2 0.667
#> 8 S13 Gottmik 3 0.643
#> 9 S13 Rosé 3 0.643
#> 10 S04 Sharon Needles 1 0.636Feel free to use the data for your own ends or learn R from it.