Kotlin/JVM package dependencies + imports

We pull in krangl (similar to pandas) and lets-plot with %use, it's easy and it sets up rich output

Fuel is 'officially supported' but was causing some problems. Anyway we need to manually import jsoup and moshi, they aren't supported by Kotlin-Jupyter.

In [1]:
@file:Repository("https://repo1.maven.org/maven2/")
@file:DependsOn("com.github.kittinunf.fuel:fuel:2.2.3")
@file:DependsOn("com.github.kittinunf.fuel:fuel-coroutines:2.2.3")
@file:DependsOn("org.jsoup:jsoup:1.13.1")
@file:DependsOn("com.squareup.moshi:moshi-kotlin:1.9.3")
@file:DependsOn("de.mpicbg.scicomp:krangl:0.13")
In [2]:
import java.io.File
import kotlinx.coroutines.*
import com.github.kittinunf.result.Result
import com.github.kittinunf.fuel.Fuel
import com.github.kittinunf.fuel.core.FuelManager
import com.github.kittinunf.fuel.coroutines.*
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import com.squareup.moshi.*
In [3]:
%use krangl, lets-plot

define our classes

Note that normally we could simply annotate each class with @JsonClass to tell Moshi to auto-build Json adapters. I don't believe that's possible with Kotlin-Jupyter (happy to be wrong about this) so we will create adapters by hand

In [4]:
// @JsonClass(generateAdapter = true)
data class ScoringPlay(
    val quarter : Int,
    val timeString : String,
    val secondsElapsed : Int,
    val team : String,
    val detail : String,
    val awayscore : Int,
    val homescore : Int
)

// @JsonClass(generateAdapter = true)
data class PFRWeek(val season : Int, val weeknumber : Int, val pfrURLs : List<String>)

// @JsonClass(generateAdapter = true)
data class PFRGame(
    val season : Int,
    val week : Int,
    val pfrURL : String,
    val hometeam : String, 
    val awayteam : String, 
    val homescore : Int,
    val awayscore : Int,
    val scoringplays : List<ScoringPlay>
)

// @JsonClass(generateAdapter = true)
data class TeamRecord(
    val season : Int,
    val teamname : String,
    val url : String,
    val abbr : String, 
    val wins : Int,
    val losses : Int,
    val ties : Int,
    val pointsFor : Int,
    val pointsAgainst : Int,
    val pfrOSRS : Float,
    val pfrDSRS : Float
)

data class PFRData(val games : List<PFRGame>, val records : List<TeamRecord>)
In [5]:
// in Kotlin-Jupyter, I don't think we can use codegen or reflection to auto-generate json adapters
// easy enough to do it here manually
val moshi : Moshi = Moshi.Builder().add(KotlinJsonAdapterFactory()).build()
val adapterScoringPlay : JsonAdapter<ScoringPlay> = moshi.adapter(ScoringPlay::class.java)
val adapterPFRGame : JsonAdapter<PFRGame> = moshi.adapter(PFRGame::class.java)
val adapterPFRWeek : JsonAdapter<PFRWeek> = moshi.adapter(PFRWeek::class.java)
val adapterTeamRecords : JsonAdapter<TeamRecord> = moshi.adapter(TeamRecord::class.java)
val adapterPFRData : JsonAdapter<PFRData> = moshi.adapter(PFRData::class.java)

define scraping functions

weeks is just a conduit to get a list of all the game URLS -- we won't save it

from weeks, we can get games, which we will persist

we also need team records, which we also persist

In [6]:
// scraping NFL weeks from PFR - really we are only interested in the URL to each boxscore
fun getWeeks(seasonRange : IntRange, weekRange : IntRange = IntRange(1,3)) : List<PFRWeek> {
    
    return seasonRange.fold(mutableListOf<PFRWeek>(), { accumulator , year ->
    
        println("season: ${year}")
        weekRange.map {w ->

            println("- week: ${w}")
            val (_, _, result) = Fuel.get("https://www.pro-football-reference.com/years/${year}/week_${w}.htm")
                .responseString()

            when (result) {
                // we don't want to try to continue if there's been an error
                is Result.Failure -> throw result.getException()  
                is Result.Success -> {
                    val pfrPage = result.get()
                    val doc : Document = Jsoup.parse(pfrPage)
                    val hrefs : List<String> = 
                        doc.select(".game_summaries .game_summary .gamelink a")
                            .map {element -> element.attr("href")}
                    accumulator.add(PFRWeek(season = year, weeknumber = w, pfrURLs = hrefs))
                }
            }
        }
        accumulator
    })
}
In [7]:
// scraping from PFR - we are getting the teams, final score, and scoring plays to be able to calculate all in-game point margins

// this is an ASYNC function (using Kotlin coroutines)
// note the `suspend fun`, `.awaitStringResponseResult()` and `coroutineScope`, `async` and `awaitAll`

// this is the only function that takes long enough to be worth async-ing

suspend fun getGames(weeks : List<PFRWeek>) : List<PFRGame> {
    
    val games = mutableListOf<PFRGame>()
    
    coroutineScope {
        
        weeks.forEach { week -> 

            week.pfrURLs.map { url ->
                async {
                    println("Game: season = ${week.season}, week = ${week.weeknumber}, url = ${url}")
                    val (_, _, result) = Fuel.get("https://www.pro-football-reference.com${url}")
                        .awaitStringResponseResult()
                    
                    when (result) {
                        // we don't want to try to continue if there's been an error
                        is Result.Failure -> throw result.getException()  
                        is Result.Success -> {
                            val pfrPage = result.get()
                            val doc : Document = Jsoup.parse(pfrPage)
                            val scoreboxes = doc.select(".scorebox > div")
                            val scorerows = doc.select("table#scoring tbody tr")
                            var currentQuarter = 1 // PFR only "announces" the quarter once (not on every row) so we need a stateholder
                            val scores = scorerows.map { r -> 
                                currentQuarter = r.select("th[data-stat='quarter']").text().let {
                                    when(it.trim()) {
                                        "OT" -> 5
                                        "OT2" -> 6
                                        "" -> currentQuarter // when there's no value, we use the latest value
                                        else -> it.toInt() // when a numerical value is present, (obviously) that's the new value
                                    }
                                }
                                val secondsElapsed : Int = r.select("td[data-stat='time']").text().split(":").let {
                                    (currentQuarter - 1) * 900 + 
                                        (14 - it[0].toInt()) * 60 + 
                                            (60 - it[1].toInt())
                                }
                                ScoringPlay(
                                    quarter = currentQuarter,
                                    timeString = r.select("td[data-stat='time']").text(),
                                    secondsElapsed = secondsElapsed, // r.select("td[data-stat='time']").text(),
                                    team = r.select("td[data-stat='team']").text(),
                                    detail = r.select("td[data-stat='description']").text(),
                                    awayscore = r.select("td[data-stat='vis_team_score']").text().toInt(),
                                    homescore = r.select("td[data-stat='home_team_score']").text().toInt()
                                ) 
                            }
                            games.add(PFRGame(
                                        season = week.season,
                                        week = week.weeknumber,
                                        pfrURL = url,
                                        hometeam = scoreboxes[0].select("strong a").text(),
                                        awayteam = scoreboxes[1].select("strong a").text(),
                                        homescore = scoreboxes[0].select(".scores .score").text().toInt(),
                                        awayscore = scoreboxes[1].select(".scores .score").text().toInt(),
                                        scoringplays = scores
                                    )
                            )
                            println("new game added!")
                        }
                    }
                }
            }.awaitAll()
        }
    
    }
    return games
}
In [8]:
// scraping final season records from PFR - we want to know the record for the teams with large deficits

fun getTeamRecords(seasonRange : IntRange) : List<TeamRecord> {
    
    val teamRecords = mutableListOf<TeamRecord>()
    
    seasonRange.forEach { year ->
    
        println("season: ${year}")
        val (_, _, result) = Fuel.get("https://www.pro-football-reference.com/years/${year}/").responseString()

        when (result) {
            // we don't want to try to continue if there's been an error
            is Result.Failure -> throw result.getException()  
            is Result.Success -> {
                val pfrPage = result.get()
                val doc : Document = Jsoup.parse(pfrPage)
                val recordRows = doc.select(".content_grid tbody tr:not([class*=thead])")
                recordRows.forEach { r -> 
                    println(r.select("th a").text())
                    teamRecords.add(TeamRecord(
                        season = year,
                        teamname = r.select("th a").text(),
                        abbr = r.select("th a").attr("href").substringBeforeLast("/").substringAfterLast("/"), 
                        url = r.select("th a").attr("href"),
                        wins = r.select("td[data-stat='wins']").text().toInt(),
                        losses = r.select("td[data-stat='losses']").text().toInt(),
                        ties = r.select("td[data-stat='ties']").text().let { if (it.isBlank()) 0 else it.toInt() },
                        pointsFor = r.select("td[data-stat='points']").text().toInt(),
                        pointsAgainst = r.select("td[data-stat='points_opp']").text().toInt(),
                        pfrOSRS = r.select("td[data-stat='srs_offense']").text().toFloat(),
                        pfrDSRS = r.select("td[data-stat='srs_defense']").text().toFloat(),
                    )) 
                }
            }
        }
    }
    
    return teamRecords
}

pull in data... load Json file if it exists, or perform scrape

In [9]:
val dataFile : File = File("e:/pfrdata_async.json")
if (!dataFile.exists()) {
    println("...scraping data from Pro-Football-Reference...")
    val pfrWeeks : List<PFRWeek> = getWeeks(seasonRange = IntRange(2015,2019), weekRange = IntRange(1,21))    
    runBlocking {
        val pfrGames : List<PFRGame> = getGames(pfrWeeks) // this is the only async function
        val teamRecords : List<TeamRecord> = getTeamRecords(seasonRange = IntRange(2015,2019))
        val pfrData = PFRData(games = pfrGames , records = teamRecords)
        dataFile.writeText( adapterPFRData.toJson(pfrData) )        
    }
} else {
    println("...loading previously-scraped data...")
}

val (rawGames, teamRecords) = adapterPFRData.fromJson(dataFile.readText())!!
...loading previously-scraped data...

some basic understanding of the data

In [10]:
rawGames.size  // count of all games in data set, including playoffs
Out[10]:
1335
In [11]:
teamRecords.size  // 5 seasons * 32 teams = 160 season records
Out[11]:
160
In [22]:
rawGames[2].scoringplays  // example of a list of scoring plays
Out[22]:
[ScoringPlay(quarter=1, timeString=9:28, secondsElapsed=332, team=Bears, detail=Robbie Gould 28 yard field goal, awayscore=0, homescore=3), ScoringPlay(quarter=1, timeString=0:43, secondsElapsed=857, team=Packers, detail=James Jones 13 yard pass from Aaron Rodgers (Mason Crosby kick), awayscore=7, homescore=3), ScoringPlay(quarter=2, timeString=7:49, secondsElapsed=1331, team=Bears, detail=Matt Forte 1 yard rush (Robbie Gould kick), awayscore=7, homescore=10), ScoringPlay(quarter=2, timeString=2:32, secondsElapsed=1648, team=Packers, detail=Mason Crosby 37 yard field goal, awayscore=10, homescore=10), ScoringPlay(quarter=2, timeString=0:08, secondsElapsed=1792, team=Bears, detail=Robbie Gould 50 yard field goal, awayscore=10, homescore=13), ScoringPlay(quarter=3, timeString=11:56, secondsElapsed=1984, team=Packers, detail=James Jones 1 yard pass from Aaron Rodgers (Mason Crosby kick), awayscore=17, homescore=13), ScoringPlay(quarter=3, timeString=4:57, secondsElapsed=2403, team=Bears, detail=Robbie Gould 44 yard field goal, awayscore=17, homescore=16), ScoringPlay(quarter=4, timeString=10:26, secondsElapsed=2974, team=Packers, detail=Randall Cobb 5 yard pass from Aaron Rodgers (Mason Crosby kick), awayscore=24, homescore=16), ScoringPlay(quarter=4, timeString=1:55, secondsElapsed=3485, team=Packers, detail=Eddie Lacy 2 yard rush (Mason Crosby kick), awayscore=31, homescore=16), ScoringPlay(quarter=4, timeString=0:34, secondsElapsed=3566, team=Bears, detail=Martellus Bennett 24 yard pass from Jay Cutler (Robbie Gould kick), awayscore=31, homescore=23)]

write a class with a function to find games that qualify... 23-26 pt lead at 9 to 6 min left in 3rd Qtr

In [116]:
class GameAnalysis(val source : PFRGame, val teamRecords : List<TeamRecord>) {
    
    val isPlayoff : Boolean = source.week >= 18
    
    // we're going to check all scores within the exact period (9 to 6 min left in 3Q)
    // we're also going to check the last score prior to the exact period, because that flows into the period.
    // example... TD scored at 13:00 left in 3Q to make score 24-0
    //   if no more scoring for a while, that means that at 9:00 left in 3Q then the score is 24-0, and this game qualifies.
    val relevantScores : List<ScoringPlay> = 
        // all scores within the period
        source.scoringplays.filter { p -> (p.secondsElapsed >= 2160 && p.secondsElapsed <= 2340) }.toMutableList().apply {
            // if there were any scores prior to the period, add the last one to the list
            if (source.scoringplays.filter { p -> p.secondsElapsed <= 2160 }.isNotEmpty()) 
                this.add(0, source.scoringplays.filter { p -> p.secondsElapsed <= 2160 }.last())
        }
    
    fun qualifies() : Boolean {
        return relevantScores.filter {p -> abs(p.homescore - p.awayscore) >= 23 && abs(p.homescore - p.awayscore) <= 26}.isNotEmpty()
    }
    
    // this would fail if a team was trailing at the start of the period but managed to take a 23+ point lead by the end of it
    // that's a minimum 24-point swing in 3 minutes, we will ignore this scenario
    fun leadingTeamWon() : Boolean? {
        return when {
            !qualifies() -> null
            relevantScores.first().homescore > relevantScores.first().awayscore -> source.homescore > source.awayscore
            relevantScores.first().homescore < relevantScores.first().awayscore -> source.homescore < source.awayscore
            else -> throw Exception("leadingTeamWon error")
        }
    }
    
    
    fun leader(away : Int, home : Int) : String = when {
        away > home -> "away" 
        away < home -> "home" 
        away == home -> "tie" 
        else -> "uh-oh"
    }
    
    fun opponent(side : String) : String = when(side) {
        "away" -> "home"
        "home" -> "away"
        "tie" -> "tie"
        else -> "uh-oh"
    }
    
    fun teamName(side : String) : String = when(side) {
        "away" -> source.awayteam
        "home" -> source.hometeam
        "tie" -> "tie"
        else -> "uh-oh"
    }
    
    val winner : String = leader(source.awayscore, source.homescore)
    val winningTeam : String = teamName(winner)
    val losingTeam : String = teamName(opponent(winner))
    
    fun teamRecord(team : String, season : Int) : TeamRecord = 
        teamRecords.filter {r -> r.season == season && r.teamname == team }.first()
    
    val display = if (!qualifies()) "doesn't qualify" else
        relevantScores.filter {p -> abs(p.homescore - p.awayscore) >= 23 && abs(p.homescore - p.awayscore) <= 26}.last().let {
            val scoreText = if (it.homescore > it.awayscore) "${it.homescore}-${it.awayscore}" else "${it.awayscore}-${it.homescore}"
            val finalScoreText = if (source.homescore > source.awayscore) "${source.homescore}-${source.awayscore}" else "${source.awayscore}-${source.homescore}"
            "s${source.season}-w${source.week.toString().padStart(2, '0')} " +
            if (leadingTeamWon() ?: false) "$losingTeam trailed $winningTeam $scoreText and lost $finalScoreText" else
                "$winningTeam trailed $losingTeam $scoreText but came back to win $finalScoreText"
                
        }
    
//     val displayx = largestFirstHalfMargin?.let { fhm -> "s${source.season}-w${source.week.toString().padStart(2, '0')} " +
//         "${fhm.trailingTeam} (${opponent(fhm.leadingSide)}) trailed by " +
//         "${fhm.points} to ${fhm.leadingTeam} " + 
//         "and ${if (winner == fhm.leadingSide) "lost" else if (winner == "tie") "tied" else "won"} :: " +
//         "final record: ${teamRecord(fhm.trailingTeam, source.season).wins} wins"
//     } ?: "no first-half scoring"
     
}

count the number of qualifying games

In [117]:
val qualifiedGames : List<GameAnalysis> = rawGames.map { g -> GameAnalysis(g, teamRecords)}.filter {g -> g.qualifies()}
qualifiedGames.size
Out[117]:
78

how many games did the leading team win?

In [118]:
qualifiedGames.filter { g -> g.leadingTeamWon() ?: false }.size
Out[118]:
77

only one game when the leading team did not win, what game is it? probably that Super Bowl...

In [119]:
qualifiedGames.filter { g -> !(g.leadingTeamWon() ?: false) }.first().display
Out[119]:
s2016-w21 New England Patriots trailed Atlanta Falcons 28-3 but came back to win 34-28
In [120]:
val trailerWinTotals = qualifiedGames.filter { g -> g.leadingTeamWon() ?: false }.map {g -> g.teamRecord(g.losingTeam, g.source.season).wins}
trailerWinTotals
Out[120]:
[8, 7, 5, 11, 9, 7, 3, 8, 5, 6, 7, 3, 10, 6, 10, 2, 1, 7, 3, 10, 7, 6, 5, 8, 4, 12, 10, 4, 5, 9, 5, 0, 9, 8, 0, 5, 8, 6, 3, 6, 6, 5, 7, 7, 4, 13, 4, 7, 5, 5, 6, 3, 4, 6, 7, 9, 3, 13, 4, 4, 8, 7, 3, 6, 8, 5, 7, 7, 2, 5, 10, 6, 13, 7, 6, 7, 9]
In [121]:
trailerWinTotals.average()
Out[121]:
6.311688311688312
In [122]:
val p = lets_plot() { x = trailerWinTotals } + ggsize(640, 240)

p + geom_bar(stat=Stat.count()) +
    xlab("season total wins") + ylab("qualifying games") + 
    xlim(IntRange(0,16)) + ggtitle("distribution of total season wins by large-deficit teams")
    
// note Stat.count() is the default for bar charts (geom_bar) so we can leave it out
Out[122]:

interesting, in 10 games the trailing team finished the season with 10+ wins, let's take a closer look at those

In [135]:
qualifiedGames.filter {g -> g.teamRecord(g.losingTeam, g.source.season).wins >= 10}.forEach {g -> println(g.display)}
s2015-w03 Kansas City Chiefs trailed Green Bay Packers 31-7 and lost 38-28
s2015-w16 Green Bay Packers trailed Arizona Cardinals 31-8 and lost 38-8
s2015-w19 Seattle Seahawks trailed Carolina Panthers 31-7 and lost 31-24
s2016-w13 Miami Dolphins trailed Baltimore Ravens 24-0 and lost 38-6
s2016-w17 Oakland Raiders trailed Denver Broncos 24-0 and lost 24-6
s2016-w20 Green Bay Packers trailed Atlanta Falcons 31-7 and lost 44-21
s2016-w21 New England Patriots trailed Atlanta Falcons 28-3 but came back to win 34-28
s2017-w20 Minnesota Vikings trailed Philadelphia Eagles 31-7 and lost 38-7
s2018-w17 New Orleans Saints trailed Carolina Panthers 23-0 and lost 33-14
s2019-w11 Houston Texans trailed Baltimore Ravens 24-0 and lost 41-7
s2019-w12 Green Bay Packers trailed San Francisco 49ers 23-0 and lost 37-8

4 of these games were in the playoffs, NO vs Carolina was a "rest the starters" game, Oak vs Denver was post-Derek Carr injury, almost all others were losses to eventual playoff teams

In [136]:
val trailerWinTotals = qualifiedGames.filter { g -> !g.isPlayoff && g.leadingTeamWon() ?: false }.map {g -> g.teamRecord(g.losingTeam, g.source.season).wins}
trailerWinTotals.average()
Out[136]:
6.121621621621622

"wrangle" or rearrange data into digestible form for easy plotting

We'd use data frames in Python (pandas) or R (dplyr). Kotlin's krangl is less mature and earier we stated our data isn't exactly tabular. Because all our data is in defined classes, we can use those for plots instead.

In [137]:
// key is the number of wins (0-16), value is the number of times that was the final win total
// 5 seasons * 32 teams = 160 season win totals

val seasonWinTotals : Map<Int, Int> = IntRange(0,16).fold(mutableMapOf<Int, Int>(), { acc, i -> 
    acc[i] = teamRecords.filter {tr -> tr.wins == i}.size
    acc
})
seasonWinTotals
Out[137]:
{0=1, 1=1, 2=2, 3=8, 4=9, 5=16, 6=16, 7=23, 8=12, 9=20, 10=17, 11=11, 12=10, 13=11, 14=2, 15=1, 16=0}
In [138]:
val p = lets_plot() { x = seasonWinTotals.keys } + ggsize(640, 240)

p + geom_bar(stat=Stat.identity) { y=seasonWinTotals.values } +
    xlab("season total wins") + ylab("seasons") + 
    xlim(IntRange(0,16)) + ggtitle("distribution of total season wins by all teams, 2015-2019")
Out[138]:
In [140]:
// same data as above chart. but we will need it in a Map form a next...

val trailerWinCounts : Map<Int, Int> = IntRange(0,16).fold(mutableMapOf<Int, Int>(), { acc, i -> 
    acc[i] = trailerWinTotals.filter {twt -> twt == i}.size
    acc
})
trailerWinCounts
Out[140]:
{0=2, 1=1, 2=2, 3=7, 4=7, 5=11, 6=11, 7=14, 8=7, 9=5, 10=3, 11=1, 12=1, 13=2, 14=0, 15=0, 16=0}
In [133]:
val trailerProbabilites : Map<Int, Double> = IntRange(0,16).fold(mutableMapOf<Int, Double>(), { acc, i -> 
    acc[i] = trailerWinCounts[i]!!.div(16.0 * seasonWinTotals[i]!!) // we won't have nulls because both maps have same keys 0-16
    acc
})

val p = lets_plot() { x = trailerProbabilites.keys } + ggsize(640, 240)

p + geom_bar(stat=Stat.identity) { y = trailerProbabilites.values } +
    xlab("season total wins") + ylab("P(qualifying)") + 
    xlim(IntRange(0,16)) + ggtitle("Probability of being a large-deficit team")
Out[133]:

... again, the likelihood of being in a big deficit similar to what the Patriots faced in Super Bowl 51 is much larger for weak teams.

23-26 point deficit with between 9 and 6 minutes left in the 3rd Quarter

In [ ]: