Deep Inventory Management Dhruv Madeka Amazon madedamazon.com

2025-05-06 0 0 804.76KB 33 页 10玖币
侵权投诉
Deep Inventory Management
Dhruv Madeka
Amazon, maded@amazon.com
Kari Torkkola
Amazon, karito@amazon.com
Carson Eisenach
Amazon, ceisen@amazon.com
Anna Luo
Pinterest*, annaluo676@gmail.com
Dean P. Foster
Amazon, foster@amazon.com
Sham M. Kakade
Amazon, Harvard University, shamisme@amazon.com
This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control
system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this
dynamic program has historically been considered intractable, our results show that several policy learning
approaches are competitive with or outperform classical methods. In order to train these algorithms, we
develop novel techniques to convert historical data into a simulator. On the theoretical side, we present
learnability results on a subclass of inventory control problems, where we provide a provable reduction of
the reinforcement learning problem to that of supervised learning. On the algorithmic side, we present a
model-based reinforcement learning procedure (Direct Backprop) to solve the periodic review inventory
control problem by constructing a differentiable simulator. Under a variety of metrics Direct Backprop
outperforms model-free RL and newsvendor baselines, in both simulations and real-world deployments.
Key words : reinforcement learning, inventory control, differentiable simulation
1. Introduction
A periodic review inventory control system determines the optimal inventory level that should be
held for different products by attempting to balance the cost of meeting customer demand with
the cost of holding too much inventory. Inventory level is reviewed periodically and adjustments
*
*
Acknowledgments:
The authors would like to thank Romain Menegaux, Robert Stine, Alvaro Maggiar, Salal
Humair, Ping Xu, Vafa Khoshaein, Yash Kanoria, and numerous others from the Supply Chain Optimization
Technologies group for their invaluable feedback.
* Work done while at Amazon.
1
arXiv:2210.03137v3 [cs.LG] 28 Nov 2022
Madeka et al.: Deep Inventory Management
2
are made by procuring more inventory or removing existing inventory through various means. The
inventory management problem can be abstracted as a Markov Decision Process (MDP) [
38
] -
though a number of complexities make it difficult or impossible to solve using traditional dynamic
program (DP) methods.
A major complexity is that demand is not a constant value or a deterministic function, but a
random variable with unknown dynamics that exhibit seasonalities, temporal correlations, trends
and spikes. Demand that is not met by the inventory in the warehouse or store (called on-hand
inventory) is lost since customers tend to go to competitors. This results in a non-linearity in the
state evolution dynamics when demand is lost, and a censoring1of the historical data.
Another complexity that arises is that an order placed by a retailer to a vendor incurs a time lag
between the actual placement of the order and the arrival of the items in the warehouse (usually
called the Vendor Lead Time or VLT ), another random variable. For a modern retailer, price
matching is also another concern [
24
] - as exogenous changes in prices by competitors force the
price to have properties outside of being just another decision variable.
In this work, we present the first Deep Reinforcement Learning (DRL) based periodic review
inventory system that is able to handle many of the challenges that make the DP solution intractable.
Our model is able to handle lost sales, correlated demand, stochastic vendor lead-times and
exogenous price matching. We also present new techniques in correcting historical data to mitigate
the issues arising from the fact that this data is “off-policy”. These techniques allow us use historical
data directly as a simulator, as opposed to building models of the various state variables. Finally,
we observe that the “state-evolution” of our target application is both known, and differentiable.
To that end, we propose a model-based reinforcement learning (RL) approach (DirectBackprop)
where we directly optimize the policy objective by constructing a differentiable simulator. This
approach can be used to solve a class of decision making problems with continuous reward and
transition structure (of which inventory control is a special case). To motivate our work, we present a
collection of results in Section 7that illustrate why the inventory management problem is efficiently
learnable from historical data. Finally we show in Section 9, that for the inventory management
problem, DirectBackprop outperforms model free RL methods which in turn outperform newsvendor
baselines.
Our contributions. Our main contribution is to provide a Deep Reinforcement Learning approach
to solving a periodic review inventory control system with stochastic vendor lead times, lost sales,
correlated demand, and price matching. Specifically:
1i.e., a lack of observability.
Madeka et al.: Deep Inventory Management 3
1.
By modifying historical data we are able to model the inventory management problem as a
largely exogenous Interactive Decision Process.
2.
We present a novel algorithm ‘DirectBackprop’ that utilizes the differentiable nature of the
problem to achieve state-of-the-art performance.
3.
We present a collection of learnability results that motivate our approach and provide under-
standing of its empirical efficacy.
4.
Finally, we present empirical results from the first large-scale deployed end to end Deep RL
system that considers a realistic discrete-time periodic review inventory management system with
non-independent demand, price matching, random lead times and lost sales at each decision epoch.
Organization The first part of the paper (Section 3) sets up the mathematical formulation
and framework for the Interactive Decision Process (IDP) we are solving. Section 4.2 describes
the techniques we use to convert our historical data into a reliable simulator. Sections 6and 6.1
cast the periodic review inventory system as a differential control problem and describe how this
differentiable simulator can be used to train neural policies.
Section 7presents a collection of theoretical results that illustrate how our framework can be used
to backtest
any
policy, not just those based on Reinforcement Learning. Theorem 3describes how
these results apply even when some of the historical data is unobserved. In Section 9, we present a
collection of experimental results in both simulated and real-world settings.
2. Related Work
Inventory models [
29
,
26
,
43
] often assume that any sales that are missed due to a lack of inventory
(usually called Out of Stock or OOS) are backlogged and filled in the next period. It is estimated
through the analysis of large scale surveys that only 15% - 23% of customers are willing to delay
a purchase when confronted by an out-of-stock item [
18
,
50
]. This indicates that a lost sales
assumption is more realistic in competitive environments than a full-backlogging assumption. See
[6] for a comprehensive review of the lost sales inventory literature.
Karlin et al. [
26
] study an inventory system with a continuous demand density, lost sales and
a constant lead time of 1. They show that a base stock policy (i.e. a policy that orders up to a
target inventory level)
2
is suboptimal in this case even with linear cost structures, pointing out that
the way “a lag in delivery influences the decision process is analogous to that of a convex ordering
cost.” This indicates that a newsvendor model with a base stock policy is already suboptimal for a
modern retailer’s inventory system. Under certain conditions, such as a fixed lead time with a large
2For a formal definition of base stock policies, see [14] and references within.
Madeka et al.: Deep Inventory Management
4
lost demand penalty cost, it can be shown that the base stock policy for a backordered model can
be asymptotically optimal for a lost sales model [
21
]. However, as in our problem, with longer lead
times and a fixed penalty, a constant order policy may outperform a base stock policy [41].
Our problem further deviates from the literature in that our lead times are modeled as a random
variable. Inventory systems with lost sales and stochastic lead times are scarcely studied in the
literature. Zipkin [
55
] showed that even for a simple setting of lead times that are constant (but
larger than one time unit), stochastic demands, and lost sales - base stock policies tend to be
numerically worse than myopic or constant order policies as the lead time increases. Kaplan [
25
]
demonstrates how, when the maximum delay period is constrained, the state space for a random
lead time model can be reduced to a single dimension. Nahmias et al. [
34
] formulate myopic policies
for an inventory system where excess sales are lost and the lead time is random. Janakiraman et al.
[
23
] develop convexity results for an inventory management system with lost sales, non-crossing
lead times and order up to policies.
Demand censoring has been studied in the context of the spiral down effect in revenue management
[
8
] as well as for the censored newsvendor problem (see [
5
] and citations within). We have not seen
a historical data correction similar to the one in Section 4.2 being applied to the policy learning
problem framed in Section 6.
For large retailers in competitive markets, price matching causes abrupt and random changes in
the price of a product during a decision epoch. This causes the price to behave like an exogenous
stochastic process, as opposed to an endogenous decision variable. There is no literature we could
find that addresses the problem of a periodic review inventory system with price matching effects.
Reinforcement learning has been applied in many sequential decision making problems where
learning from a large number of simulations is possible, such as games and simulated physics models
of multi-joint systems. While the use of deep learning for forecasting demand has developed recently
[
52
,
28
,
51
,
13
,
53
], the usage of reinforcement learning to directly produce decisions in Inventory
Management has been limited. Giannoccaro et al. [
15
] consider a periodic review system with
fixed prices, costs, and full backlogging of demand and show that the SMART RL Algorithm [
10
]
can outperform baseline integrated policies. Oroojlooyjadid et al. [
36
] integrate the forecasting
and optimization steps in a newsvendor problem by using a Deep Neural Network (DNN) to
directly predict a quantity that minimizes the newsvendor loss function. Balaji et al. [
3
] study
the multi-period newsvendor problem with uniformly distributed price and cost, constant VLT
and stationary, Poisson distributed demand and show that Proximal Policy Optimization [
45
] can
eventually beat standard newsvendor benchmarks.
Madeka et al.: Deep Inventory Management 5
Gijsbrechts et al. [
16
] consider lost sales, dual sourcing and multi-echelon problems and show
that modern Deep Reinforcement Learning algorithms such as A3C [
31
] are competitive with
state-of-the-art heuristics. Qi et al. [
39
] train a neural network to predict the outputs of an (ex-post)
“oracle” policy. While this means that their approach does not follow a Predict-then-Optimize (PTO)
framework, the oracle policy does not allow them to handle how the variability of the exogenous
variables (such as demand, price etc.) might influence the optimal ex-ante policy.3
Differentiable simulators have been studied [
47
] and applied to problems varying from physics
[
20
] and protein folding [
22
]. In the context of inventory management, they have been applied [
17
]
to studying the sensitivity of inventory costs to optimal parameters for base stock levels. They have
not been studied in the context of directly learning (neural) policies through the gradients from the
simulator.
Exogenous Markov Decision Processes and their learnability have been studied [
46
], our work
allows the policy to be dependent on the entire trajectory of the Exogenous random variables.
3. Problem Formulation
3.1. Mathematical Notation
Denote by
R
and
N
the set of real and natural numbers respectively. Let (
·
)
+
refer to the classical
positive part operator i.e. (
·
)
+
=
max
(
·,
0). The inventory management problem seeks to find the
optimal inventory level for each product
i
in the set of retailer’s products, which we denote by
A
.
We assume all random variables are defined on a canonical probability space (Ω
,F,
). Let
θ
Θ
denotes some parameter set. We use
E
to denote an expectation operator of a random variable
with respect to some probability measure . Let
kQ1,Q2kTV
denote the total variation distance
between two probability measures Q1and Q2.
3.2. Construction of the Interactive Decision Process
We will construct our Interactive Decision Process (IDP) in two steps: First we will describe the
driving stochastic “noise” processes which govern our problem. These are things like demand, price
changes, cost and other things that are outside of our control. The assumption is that nothing in
this set is influenced by our actions. Second, we will describe our decision process which can depend
on all the previous information contained in the above processes.
3By taking gradients against the true reward function, our approach is able to handle this variability
摘要:

DeepInventoryManagementDhruvMadekaAmazon,maded@amazon.comKariTorkkolaAmazon,karito@amazon.comCarsonEisenachAmazon,ceisen@amazon.comAnnaLuoPinterest*,annaluo676@gmail.comDeanP.FosterAmazon,foster@amazon.comShamM.KakadeAmazon,HarvardUniversity,shamisme@amazon.comThisworkprovidesaDeepReinforcementLearn...

展开>> 收起<<
Deep Inventory Management Dhruv Madeka Amazon madedamazon.com.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:804.76KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注