Filling the Gaps A Multiple Imputation Approach to Estimating Aging Curves in Baseball Quang NguyenGregory J. Matthews

2025-04-27 0 0 428.99KB 19 页 10玖币
侵权投诉
Filling the Gaps: A Multiple Imputation Approach to
Estimating Aging Curves in Baseball
Quang NguyenGregory J. Matthews
March 11, 2024
Abstract
In sports, an aging curve depicts the relationship between average performance and
age in athletes’ careers. This paper investigates the aging curves for offensive players in
Major League Baseball. We study this problem in a missing data context and account
for different types of dropouts of baseball players during their careers. We employ a
multiple imputation framework for multilevel data to impute the player performance
associated with the missing seasons, and estimate the aging curves based on the imputed
datasets. We then evaluate the effects of different dropout mechanisms on the aging
curves through simulation, before applying our method to analyze MLB player data
from past seasons. Results suggest an overestimation of the aging curves constructed
without considering the unobserved seasons, whereas estimates obtained from multiple
imputation address this shortcoming.
Keywords: aging curve; baseball; multiple imputation; survival bias
Department of Statistics & Data Science, Carnegie Mellon University
Department of Mathematics and Statistics, Loyola University Chicago
1
arXiv:2210.02383v3 [stat.AP] 11 Mar 2024
1 Introduction
The rise and fall of an athlete is a popular topic of discussion in the sports media today.
Questions regarding whether a player has reached their peak, is past their prime, or is good
enough to remain in their respective professional league are often seen in different media
outlets such as news articles, television debate shows, and podcasts. The average performance
of players by age throughout their careers is visually represented by an aging curve. This
graph typically consists of a horizontal axis representing a time variable (usually age or
season) and a vertical axis showing a performance metric at each time point in a player’s
career.
One significant challenge associated with the study of aging curves in sports is survival
bias, as pointed out by Lichtman (2009), Turtoro (2019), Judge (2020a), and Schuckers et al.
(2023). In particular, the aging effects are not often determined from a full population of
athletes (i.e., all players who have ever played) in a given league. That is, only players that
are good enough to remain are observed; whereas those who might be involved, but do not
actually participate or are not talented enough to compete, are being completely disregarded.
This very likely results in an overestimation of the aging curves.
As such, player survivorship and dropout can be viewed as a missing data problem. There
are several distinct cases of player absence from professional sport at different points in their
careers. Early on, teams may elect to assign their young prospects to their minor/development
league affiliates for several years of nurture. Many of those players would end up receiving a
call-up to join the senior squad, when the team believes they are ready. During the middle of
a player’s career, a nonappearance could occur due to various reasons. Injury is unavoidable
in sports, and this could cost a player at least one year of their playing time. Personal reasons
such as contract situation and more recently, concerns regarding a global pandemic, could
also lead to athletes sitting out a season. Later on, a player, by either their choice or their
team’s choice, might head for retirement because they cannot perform at a level like they
used to, leading to unobserved seasons that could have been played.
The primary aim of this paper is to apply missing data techniques to the estimation of
aging curves. In doing so, we focus on baseball and pose the following research question:
What would the aging curve look like if all players competed in every season within a fixed
2
range of age? In other words, what would have happened if a player who was forced to retire
from their league at a certain age had played a full career? The manuscript continues with a
review of existing literature on aging curves in baseball and other sports in Section 2. Next,
we describe our data and methods used to perform our analyses in Section 3. After that, our
approach is implemented through simulation and analyses of real baseball data in Sections 4
and 5. Finally, in Section 6, we conclude with a discussion of the results, limitations, and
directions for future work.
2 Literature Review
To date, we find a considerable amount of previous work related to aging curves and career
trajectory of athletes. This body of work consists of several key themes, a wide array of
statistical methods, and applications in many sports besides baseball such as basketball,
hockey, and track and field, to name a few.
A typical notion in the baseball aging curves literature is the assumption of a quadratic
form for modeling the relationship between performance and age. Morris (1983) looked at
Ty Cobb’s batting average trajectory using parametric empirical Bayes and used shrinkage
methods to obtain a parabolic curve for Cobb’s career performance. Albert (1992) proposed a
quadratic random effects log-linear model for smoothing a batter’s home run rates throughout
their career. Berry et al. (1999) implemented a nonparametric method to estimate the age
effect on performance in baseball, hockey, and golf. However, Albert (1999) weighed in on
this nonparametric approach and questioned the assumptions that the peak age and periods
of growth and decline are the same for all players. Albert (1999) ultimately preferred a
second-degree polynomial function for estimating age effect in baseball, which is a parametric
model. Continuing his series of work on aging trajectories, Albert (2002) proposed a Bayesian
exchangeable model for baseball hitting performance. This approach combined quadratic
regression estimates and assumes similar careers for players born in the same decade. Fair
(2008) and Bradbury (2009) both used a fixed-effects regression to examine age effects in the
MLB, also assuming a quadratic aging curve form. A major drawback of Bradbury (2009)’s
study is that the analysis only considered players with longer baseball careers.
In addition to baseball, studies on aging curves have also been conducted for other sports.
3
Early on, Moore (1975) looked at the association between age and running speed in track
and field and produced aging curves for different running distances using an exponential
model. Fair (1994) and Fair (2007) studied the age effects in track and field, swimming, chess,
and running, in addition to their latter work in baseball, as mentioned earlier. In triathlon,
Villaroel et al. (2011) assumed a quadratic relationship between performance and age, as
many have previously considered. As for basketball, Page et al. (2013) used a Gaussian
process regression in a hierarchical Bayesian framework to model age effect in the NBA.
Additionally, Lailvaux et al. (2014) used NBA and WNBA data to investigate and test for
potential sex differences in the aging of comparable performance indicators. Vaci et al. (2019)
applied Bayesian cognitive latent variable modeling to explore aging and career performance
in the NBA, accounting for player position and activity level. In tennis, Kovalchik (2014)
studied age and performance trends in men’s tennis using change point analysis.
Another convention in the aging curve modeling literature is the assumption of discrete
observations. Specifically, most researchers use regression modeling and consider a data
measurement for each season played throughout a player’s career. In contrast to previous
approaches, Wakim and Jin (2014) took a different route and considered functional data
analysis as the primary tool for modeling MLB and NBA aging curves. This is a continuous
framework which treats the entire career performance of an athlete as a smooth function. In
a similar functional data setting, Leroy et al. (2018) investigated the performance progression
curves in swimming.
A subset of the literature on aging and performance in sports specifically studies the
question: At what age do athletes peak? Schulz and Curnow (1988) looked at the age of
peak performance for track and field, swimming, baseball, tennis, and golf. A follow-up study
to this work was done by Schulz et al. (1994), where the authors focused on baseball and
found that the average peak age for baseball players is between 27 and 30, considering several
performance measures. Later findings on baseball peak age also showed consistency with the
results in Schulz et al. (1994). Fair (2008) determined the peak-performance age in baseball
to be 28, whereas Bradbury (2009) determined that baseball hitters and pitchers reach the
top of their careers at about 29 years old. In soccer, Dendir (2016) determined that the peak
age for footballers in the top leagues falls within the range of 25 to 27.
The idea of player survivorship is only mentioned in a small number of articles. To our
4
摘要:

FillingtheGaps:AMultipleImputationApproachtoEstimatingAgingCurvesinBaseballQuangNguyen∗GregoryJ.Matthews†March11,2024AbstractInsports,anagingcurvedepictstherelationshipbetweenaverageperformanceandageinathletes’careers.ThispaperinvestigatestheagingcurvesforoffensiveplayersinMajorLeagueBaseball.Westud...

展开>> 收起<<
Filling the Gaps A Multiple Imputation Approach to Estimating Aging Curves in Baseball Quang NguyenGregory J. Matthews.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:428.99KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注