We pull in krangl (similar to pandas) and lets-plot with %use
, it's easy and it sets up rich output
Fuel is 'officially supported' but was causing some problems. Anyway we need to manually import jsoup and moshi, they aren't supported by Kotlin-Jupyter.
@file:Repository("https://repo1.maven.org/maven2/")
@file:DependsOn("com.github.kittinunf.fuel:fuel:2.2.3")
@file:DependsOn("com.github.kittinunf.fuel:fuel-coroutines:2.2.3")
@file:DependsOn("org.jsoup:jsoup:1.13.1")
@file:DependsOn("com.squareup.moshi:moshi-kotlin:1.9.3")
import java.io.File
import kotlinx.coroutines.*
import com.github.kittinunf.result.Result
import com.github.kittinunf.fuel.Fuel
import com.github.kittinunf.fuel.core.FuelManager
import com.github.kittinunf.fuel.coroutines.*
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import com.squareup.moshi.*
%use krangl, lets-plot
Note that normally we could simply annotate each class with @JsonClass
to tell Moshi to auto-build Json adapters.
I don't believe that's possible with Kotlin-Jupyter (happy to be wrong about this) so we will create adapters by hand
// @JsonClass(generateAdapter = true)
data class ScoringPlay(
val quarter : Int,
val timeString : String,
val secondsElapsed : Int,
val team : String,
val detail : String,
val awayscore : Int,
val homescore : Int
)
// @JsonClass(generateAdapter = true)
data class PFRWeek(val season : Int, val weeknumber : Int, val pfrURLs : List<String>)
// @JsonClass(generateAdapter = true)
data class PFRGame(
val season : Int,
val week : Int,
val pfrURL : String,
val hometeam : String,
val awayteam : String,
val homescore : Int,
val awayscore : Int,
val scoringplays : List<ScoringPlay>
)
// @JsonClass(generateAdapter = true)
data class TeamRecord(
val season : Int,
val teamname : String,
val url : String,
val abbr : String,
val wins : Int,
val losses : Int,
val ties : Int,
val pointsFor : Int,
val pointsAgainst : Int,
val pfrOSRS : Float,
val pfrDSRS : Float
)
data class PFRData(val games : List<PFRGame>, val records : List<TeamRecord>)
// in Kotlin-Jupyter, I don't think we can use codegen or reflection to auto-generate json adapters
// easy enough to do it here manually
val moshi : Moshi = Moshi.Builder().add(KotlinJsonAdapterFactory()).build()
val adapterScoringPlay : JsonAdapter<ScoringPlay> = moshi.adapter(ScoringPlay::class.java)
val adapterPFRGame : JsonAdapter<PFRGame> = moshi.adapter(PFRGame::class.java)
val adapterPFRWeek : JsonAdapter<PFRWeek> = moshi.adapter(PFRWeek::class.java)
val adapterTeamRecords : JsonAdapter<TeamRecord> = moshi.adapter(TeamRecord::class.java)
val adapterPFRData : JsonAdapter<PFRData> = moshi.adapter(PFRData::class.java)
we need a list of historical games. But more than that, we are looking for every instance of a similar, 20+ point deficit mid-3rd quarter. So just getting the game's final score isn't enough, we actually need to scrape every score of every game. Fortunately pro-football-reference has that data.
also -- we only need to scrape once. see a few cells down... we will try to import a .json file, and only execute the scraping if that .json file doesn't exist
weeks is just a conduit to get a list of all the game URLS -- we won't save it
from weeks, we can get games, which we will persist
we also need team records, which we also persist
// scraping NFL weeks from PFR - really we are only interested in the URL to each boxscore
fun getWeeks(seasonRange : IntRange, weekRange : IntRange = IntRange(1,3)) : List<PFRWeek> {
return seasonRange.fold(mutableListOf<PFRWeek>(), { accumulator , year ->
println("season: ${year}")
weekRange.map {w ->
println("- week: ${w}")
val (_, _, result) = Fuel.get("https://www.pro-football-reference.com/years/${year}/week_${w}.htm")
.responseString()
when (result) {
// we don't want to try to continue if there's been an error
is Result.Failure -> throw result.getException()
is Result.Success -> {
val pfrPage = result.get()
val doc : Document = Jsoup.parse(pfrPage)
val hrefs : List<String> =
doc.select(".game_summaries .game_summary .gamelink a")
.map {element -> element.attr("href")}
accumulator.add(PFRWeek(season = year, weeknumber = w, pfrURLs = hrefs))
}
}
}
accumulator
})
}
// scraping from PFR - we are getting the teams, final score, and scoring plays to be able to calculate all in-game point margins
// this is an ASYNC function (using Kotlin coroutines)
// note the `suspend fun`, `.awaitStringResponseResult()` and `coroutineScope`, `async` and `awaitAll`
// this is the only function that takes long enough to be worth async-ing
suspend fun getGames(weeks : List<PFRWeek>) : List<PFRGame> {
val games = mutableListOf<PFRGame>()
coroutineScope {
weeks.forEach { week ->
week.pfrURLs.map { url ->
async {
println("Game: season = ${week.season}, week = ${week.weeknumber}, url = ${url}")
val (_, _, result) = Fuel.get("https://www.pro-football-reference.com${url}")
.awaitStringResponseResult()
when (result) {
// we don't want to try to continue if there's been an error
is Result.Failure -> throw result.getException()
is Result.Success -> {
val pfrPage = result.get()
val doc : Document = Jsoup.parse(pfrPage)
val scoreboxes = doc.select(".scorebox > div")
val scorerows = doc.select("table#scoring tbody tr")
var currentQuarter = 1 // PFR only "announces" the quarter once (not on every row) so we need a stateholder
val scores = scorerows.map { r ->
currentQuarter = r.select("th[data-stat='quarter']").text().let {
when(it.trim()) {
"OT" -> 5
"OT2" -> 6
"" -> currentQuarter // when there's no value, we use the latest value
else -> it.toInt() // when a numerical value is present, (obviously) that's the new value
}
}
val secondsElapsed : Int = r.select("td[data-stat='time']").text().split(":").let {
(currentQuarter - 1) * 900 +
(14 - it[0].toInt()) * 60 +
(60 - it[1].toInt())
}
ScoringPlay(
quarter = currentQuarter,
timeString = r.select("td[data-stat='time']").text(),
secondsElapsed = secondsElapsed, // r.select("td[data-stat='time']").text(),
team = r.select("td[data-stat='team']").text(),
detail = r.select("td[data-stat='description']").text(),
awayscore = r.select("td[data-stat='vis_team_score']").text().toInt(),
homescore = r.select("td[data-stat='home_team_score']").text().toInt()
)
}
games.add(PFRGame(
season = week.season,
week = week.weeknumber,
pfrURL = url,
hometeam = scoreboxes[0].select("strong a").text(),
awayteam = scoreboxes[1].select("strong a").text(),
homescore = scoreboxes[0].select(".scores .score").text().toInt(),
awayscore = scoreboxes[1].select(".scores .score").text().toInt(),
scoringplays = scores
)
)
println("new game added!")
}
}
}
}.awaitAll()
}
}
return games
}
// scraping final season records from PFR - we want to know the record for the teams with large deficits
fun getTeamRecords(seasonRange : IntRange) : List<TeamRecord> {
val teamRecords = mutableListOf<TeamRecord>()
seasonRange.forEach { year ->
println("season: ${year}")
val (_, _, result) = Fuel.get("https://www.pro-football-reference.com/years/${year}/").responseString()
when (result) {
// we don't want to try to continue if there's been an error
is Result.Failure -> throw result.getException()
is Result.Success -> {
val pfrPage = result.get()
val doc : Document = Jsoup.parse(pfrPage)
val recordRows = doc.select(".content_grid tbody tr:not([class*=thead])")
recordRows.forEach { r ->
println(r.select("th a").text())
teamRecords.add(TeamRecord(
season = year,
teamname = r.select("th a").text(),
abbr = r.select("th a").attr("href").substringBeforeLast("/").substringAfterLast("/"),
url = r.select("th a").attr("href"),
wins = r.select("td[data-stat='wins']").text().toInt(),
losses = r.select("td[data-stat='losses']").text().toInt(),
ties = r.select("td[data-stat='ties']").text().let { if (it.isBlank()) 0 else it.toInt() },
pointsFor = r.select("td[data-stat='points']").text().toInt(),
pointsAgainst = r.select("td[data-stat='points_opp']").text().toInt(),
pfrOSRS = r.select("td[data-stat='srs_offense']").text().toFloat(),
pfrDSRS = r.select("td[data-stat='srs_defense']").text().toFloat(),
))
}
}
}
}
return teamRecords
}
val dataFile : File = File("e:/pfrdata_async.json")
if (!dataFile.exists()) {
println("...scraping data from Pro-Football-Reference...")
val pfrWeeks : List<PFRWeek> = getWeeks(seasonRange = IntRange(2015,2019), weekRange = IntRange(1,21))
runBlocking {
val pfrGames : List<PFRGame> = getGames(pfrWeeks) // this is the only async function
val teamRecords : List<TeamRecord> = getTeamRecords(seasonRange = IntRange(2015,2019))
val pfrData = PFRData(games = pfrGames , records = teamRecords)
dataFile.writeText( adapterPFRData.toJson(pfrData) )
}
} else {
println("...loading previously-scraped data...")
}
val (rawGames, teamRecords) = adapterPFRData.fromJson(dataFile.readText())!!
...loading previously-scraped data...
rawGames.size // count of all games in data set, including playoffs... should equal (256 + 11 playoffs) * 5 = 1335
1335
teamRecords.size // 5 seasons * 32 teams = 160 season records
160
rawGames[999].scoringplays.forEach { println(it) } // example of a list of scoring plays
ScoringPlay(quarter=1, timeString=10:23, secondsElapsed=277, team=49ers, detail=Robbie Gould 40 yard field goal, awayscore=0, homescore=3) ScoringPlay(quarter=2, timeString=13:01, secondsElapsed=1019, team=49ers, detail=Robbie Gould 29 yard field goal, awayscore=0, homescore=6) ScoringPlay(quarter=2, timeString=9:47, secondsElapsed=1213, team=49ers, detail=George Kittle 85 yard pass from Nick Mullens (Robbie Gould kick), awayscore=0, homescore=13) ScoringPlay(quarter=2, timeString=0:08, secondsElapsed=1792, team=49ers, detail=Dante Pettis 1 yard pass from Nick Mullens (Robbie Gould kick), awayscore=0, homescore=20) ScoringPlay(quarter=3, timeString=6:05, secondsElapsed=2335, team=Broncos, detail=Phillip Lindsay 3 yard rush (Brandon McManus kick), awayscore=7, homescore=20) ScoringPlay(quarter=4, timeString=3:53, secondsElapsed=3367, team=Broncos, detail=DaeSean Hamilton 1 yard pass from Case Keenum (Brandon McManus kick), awayscore=14, homescore=20)
class GameAnalysis(val source : PFRGame, val teamRecords : List<TeamRecord>) {
val isPlayoff : Boolean = source.week >= 18
// we're going to check all scores within the exact period (9 to 6 min left in 3Q)
// we're also going to check the last score prior to the exact period, because that flows into the period.
// example... TD scored at 13:00 left in 3Q to make score 24-0
// if no more scoring for a while, that means that at 9:00 left in 3Q then the score is 24-0, and this game qualifies.
val relevantScores : List<ScoringPlay> =
// all scores within the period
source.scoringplays.filter { p -> (p.secondsElapsed >= 2160 && p.secondsElapsed <= 2340) }.toMutableList().apply {
// if there were any scores prior to the period, add the last one to the list
if (source.scoringplays.filter { p -> p.secondsElapsed <= 2160 }.isNotEmpty())
this.add(0, source.scoringplays.filter { p -> p.secondsElapsed <= 2160 }.last())
}
fun qualifies() : Boolean {
return relevantScores.filter {p -> abs(p.homescore - p.awayscore) >= 23 && abs(p.homescore - p.awayscore) <= 26}.isNotEmpty()
}
// this would fail if a team was trailing at the start of the period but managed to take a 23+ point lead by the end of it
// that's a minimum 24-point swing in 3 minutes, we will ignore this scenario
fun leadingTeamWon() : Boolean? {
return when {
!qualifies() -> null
relevantScores.first().homescore > relevantScores.first().awayscore -> source.homescore > source.awayscore
relevantScores.first().homescore < relevantScores.first().awayscore -> source.homescore < source.awayscore
else -> throw Exception("leadingTeamWon error")
}
}
fun leader(away : Int, home : Int) : String = when {
away > home -> "away"
away < home -> "home"
away == home -> "tie"
else -> "uh-oh"
}
fun opponent(side : String) : String = when(side) {
"away" -> "home"
"home" -> "away"
"tie" -> "tie"
else -> "uh-oh"
}
fun teamName(side : String) : String = when(side) {
"away" -> source.awayteam
"home" -> source.hometeam
"tie" -> "tie"
else -> "uh-oh"
}
val winner : String = leader(source.awayscore, source.homescore)
val winningTeam : String = teamName(winner)
val losingTeam : String = teamName(opponent(winner))
fun teamRecord(team : String, season : Int) : TeamRecord =
teamRecords.filter {r -> r.season == season && r.teamname == team }.first()
val display = if (!qualifies()) "doesn't qualify" else
relevantScores.filter {p -> abs(p.homescore - p.awayscore) >= 23 && abs(p.homescore - p.awayscore) <= 26}.last().let {
val scoreText = if (it.homescore > it.awayscore) "${it.homescore}-${it.awayscore}" else "${it.awayscore}-${it.homescore}"
val finalScoreText = if (source.homescore > source.awayscore) "${source.homescore}-${source.awayscore}" else "${source.awayscore}-${source.homescore}"
"s${source.season}-w${source.week.toString().padStart(2, '0')} " +
if (leadingTeamWon() ?: false) "$losingTeam trailed $winningTeam $scoreText and lost $finalScoreText" else
"$winningTeam trailed $losingTeam $scoreText but came back to win $finalScoreText"
}
// val displayx = largestFirstHalfMargin?.let { fhm -> "s${source.season}-w${source.week.toString().padStart(2, '0')} " +
// "${fhm.trailingTeam} (${opponent(fhm.leadingSide)}) trailed by " +
// "${fhm.points} to ${fhm.leadingTeam} " +
// "and ${if (winner == fhm.leadingSide) "lost" else if (winner == "tie") "tied" else "won"} :: " +
// "final record: ${teamRecord(fhm.trailingTeam, source.season).wins} wins"
// } ?: "no first-half scoring"
}
val qualifiedGames : List<GameAnalysis> = rawGames.map { g -> GameAnalysis(g, teamRecords)}.filter {g -> g.qualifies()}
qualifiedGames.size
78
qualifiedGames.filter { g -> g.leadingTeamWon() ?: false }.size
77
qualifiedGames.filter { g -> !(g.leadingTeamWon() ?: false) }.first().display
s2016-w21 New England Patriots trailed Atlanta Falcons 28-3 but came back to win 34-28
val trailerWinTotals = qualifiedGames.filter { g -> g.leadingTeamWon() ?: false }.map {g -> g.teamRecord(g.losingTeam, g.source.season).wins}
trailerWinTotals
[8, 7, 5, 11, 9, 7, 3, 8, 5, 6, 7, 3, 10, 6, 10, 2, 1, 7, 3, 10, 7, 6, 5, 8, 4, 12, 10, 4, 5, 9, 5, 0, 9, 8, 0, 5, 8, 6, 3, 6, 6, 5, 7, 7, 4, 13, 4, 7, 5, 5, 6, 3, 4, 6, 7, 9, 3, 13, 4, 4, 8, 7, 3, 6, 8, 5, 7, 7, 2, 5, 10, 6, 13, 7, 6, 7, 9]
trailerWinTotals.average()
6.311688311688312
val p = lets_plot() { x = trailerWinTotals } + ggsize(640, 240)
p + geom_bar(stat=Stat.count()) +
xlab("season total wins") + ylab("qualifying games") +
xlim(IntRange(0,16)) + ggtitle("distribution of total season wins by large-deficit teams")
// note Stat.count() is the default for bar charts (geom_bar) so we can leave it out
qualifiedGames.filter {g -> g.teamRecord(g.losingTeam, g.source.season).wins >= 10}.forEach {g -> println(g.display)}
s2015-w03 Kansas City Chiefs trailed Green Bay Packers 31-7 and lost 38-28 s2015-w16 Green Bay Packers trailed Arizona Cardinals 31-8 and lost 38-8 s2015-w19 Seattle Seahawks trailed Carolina Panthers 31-7 and lost 31-24 s2016-w13 Miami Dolphins trailed Baltimore Ravens 24-0 and lost 38-6 s2016-w17 Oakland Raiders trailed Denver Broncos 24-0 and lost 24-6 s2016-w20 Green Bay Packers trailed Atlanta Falcons 31-7 and lost 44-21 s2016-w21 New England Patriots trailed Atlanta Falcons 28-3 but came back to win 34-28 s2017-w20 Minnesota Vikings trailed Philadelphia Eagles 31-7 and lost 38-7 s2018-w17 New Orleans Saints trailed Carolina Panthers 23-0 and lost 33-14 s2019-w11 Houston Texans trailed Baltimore Ravens 24-0 and lost 41-7 s2019-w12 Green Bay Packers trailed San Francisco 49ers 23-0 and lost 37-8
val trailerWinTotals = qualifiedGames.filter { g -> !g.isPlayoff && g.leadingTeamWon() ?: false }.map {g -> g.teamRecord(g.losingTeam, g.source.season).wins}
trailerWinTotals.average()
6.121621621621622
We'd use data frames in Python (pandas) or R (dplyr). Kotlin's krangl
is less mature and earier we stated our data isn't exactly tabular. Because all our data is in defined classes, we can use those for plots instead.
// key is the number of wins (0-16), value is the number of times that was the final win total
// 5 seasons * 32 teams = 160 season win totals
val seasonWinTotals : Map<Int, Int> = IntRange(0,16).fold(mutableMapOf<Int, Int>(), { acc, i ->
acc[i] = teamRecords.filter {tr -> tr.wins == i}.size
acc
})
seasonWinTotals
{0=1, 1=1, 2=2, 3=8, 4=9, 5=16, 6=16, 7=23, 8=12, 9=20, 10=17, 11=11, 12=10, 13=11, 14=2, 15=1, 16=0}
val p = lets_plot() { x = seasonWinTotals.keys } + ggsize(640, 240)
p + geom_bar(stat=Stat.identity) { y=seasonWinTotals.values } +
xlab("season total wins") + ylab("seasons") +
xlim(IntRange(0,16)) + ggtitle("distribution of total season wins by all teams, 2015-2019")
// same data as above chart. but we will need it in a Map form a next...
val trailerWinCounts : Map<Int, Int> = IntRange(0,16).fold(mutableMapOf<Int, Int>(), { acc, i ->
acc[i] = trailerWinTotals.filter {twt -> twt == i}.size
acc
})
trailerWinCounts
{0=2, 1=1, 2=2, 3=7, 4=7, 5=11, 6=11, 7=14, 8=7, 9=5, 10=3, 11=1, 12=1, 13=2, 14=0, 15=0, 16=0}
val trailerProbabilites : Map<Int, Double> = IntRange(0,16).fold(mutableMapOf<Int, Double>(), { acc, i ->
acc[i] = trailerWinCounts[i]!!.div(16.0 * seasonWinTotals[i]!!) // we won't have nulls because both maps have same keys 0-16
acc
})
val p = lets_plot() { x = trailerProbabilites.keys } + ggsize(640, 240)
p + geom_bar(stat=Stat.identity) { y = trailerProbabilites.values } +
xlab("season total wins") + ylab("P(qualifying)") +
xlim(IntRange(0,16)) + ggtitle("Probability of being a large-deficit team")