Deep Inventory Management Dhruv Madeka Amazon madedamazon.com

2025-05-06 0 0 804.76KB 33 页 10玖币

侵权投诉

Deep Inventory Management

Dhruv Madeka

Amazon, maded@amazon.com

Kari Torkkola

Amazon, karito@amazon.com

Carson Eisenach

Amazon, ceisen@amazon.com

Anna Luo

Pinterest*, annaluo676@gmail.com

Dean P. Foster

Amazon, foster@amazon.com

Sham M. Kakade

Amazon, Harvard University, shamisme@amazon.com

This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control

system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this

dynamic program has historically been considered intractable, our results show that several policy learning

approaches are competitive with or outperform classical methods. In order to train these algorithms, we

develop novel techniques to convert historical data into a simulator. On the theoretical side, we present

learnability results on a subclass of inventory control problems, where we provide a provable reduction of

the reinforcement learning problem to that of supervised learning. On the algorithmic side, we present a

model-based reinforcement learning procedure (Direct Backprop) to solve the periodic review inventory

control problem by constructing a diﬀerentiable simulator. Under a variety of metrics Direct Backprop

outperforms model-free RL and newsvendor baselines, in both simulations and real-world deployments.

Key words : reinforcement learning, inventory control, diﬀerentiable simulation

1. Introduction

A periodic review inventory control system determines the optimal inventory level that should be

held for diﬀerent products by attempting to balance the cost of meeting customer demand with

the cost of holding too much inventory. Inventory level is reviewed periodically and adjustments

Acknowledgments:

The authors would like to thank Romain Menegaux, Robert Stine, Alvaro Maggiar, Salal

Humair, Ping Xu, Vafa Khoshaein, Yash Kanoria, and numerous others from the Supply Chain Optimization

Technologies group for their invaluable feedback.

* Work done while at Amazon.

arXiv:2210.03137v3 [cs.LG] 28 Nov 2022

Madeka et al.: Deep Inventory Management

are made by procuring more inventory or removing existing inventory through various means. The

inventory management problem can be abstracted as a Markov Decision Process (MDP) [

] -

though a number of complexities make it diﬃcult or impossible to solve using traditional dynamic

program (DP) methods.

A major complexity is that demand is not a constant value or a deterministic function, but a

random variable with unknown dynamics that exhibit seasonalities, temporal correlations, trends

and spikes. Demand that is not met by the inventory in the warehouse or store (called on-hand

inventory) is lost since customers tend to go to competitors. This results in a non-linearity in the

state evolution dynamics when demand is lost, and a censoring1of the historical data.

Another complexity that arises is that an order placed by a retailer to a vendor incurs a time lag

between the actual placement of the order and the arrival of the items in the warehouse (usually

called the Vendor Lead Time or VLT ), another random variable. For a modern retailer, price

matching is also another concern [

] - as exogenous changes in prices by competitors force the

price to have properties outside of being just another decision variable.

In this work, we present the ﬁrst Deep Reinforcement Learning (DRL) based periodic review

inventory system that is able to handle many of the challenges that make the DP solution intractable.

Our model is able to handle lost sales, correlated demand, stochastic vendor lead-times and

exogenous price matching. We also present new techniques in correcting historical data to mitigate

the issues arising from the fact that this data is “oﬀ-policy”. These techniques allow us use historical

data directly as a simulator, as opposed to building models of the various state variables. Finally,

we observe that the “state-evolution” of our target application is both known, and diﬀerentiable.

To that end, we propose a model-based reinforcement learning (RL) approach (DirectBackprop)

where we directly optimize the policy objective by constructing a diﬀerentiable simulator. This

approach can be used to solve a class of decision making problems with continuous reward and

transition structure (of which inventory control is a special case). To motivate our work, we present a

collection of results in Section 7that illustrate why the inventory management problem is eﬃciently

learnable from historical data. Finally we show in Section 9, that for the inventory management

problem, DirectBackprop outperforms model free RL methods which in turn outperform newsvendor

baselines.

Our contributions. Our main contribution is to provide a Deep Reinforcement Learning approach

to solving a periodic review inventory control system with stochastic vendor lead times, lost sales,

correlated demand, and price matching. Speciﬁcally:

1i.e., a lack of observability.

Madeka et al.: Deep Inventory Management 3

By modifying historical data we are able to model the inventory management problem as a

largely exogenous Interactive Decision Process.

We present a novel algorithm ‘DirectBackprop’ that utilizes the diﬀerentiable nature of the

problem to achieve state-of-the-art performance.

We present a collection of learnability results that motivate our approach and provide under-

standing of its empirical eﬃcacy.

Finally, we present empirical results from the ﬁrst large-scale deployed end to end Deep RL

system that considers a realistic discrete-time periodic review inventory management system with

non-independent demand, price matching, random lead times and lost sales at each decision epoch.

Organization The ﬁrst part of the paper (Section 3) sets up the mathematical formulation

and framework for the Interactive Decision Process (IDP) we are solving. Section 4.2 describes

the techniques we use to convert our historical data into a reliable simulator. Sections 6and 6.1

cast the periodic review inventory system as a diﬀerential control problem and describe how this

diﬀerentiable simulator can be used to train neural policies.

Section 7presents a collection of theoretical results that illustrate how our framework can be used

to backtest

any

policy, not just those based on Reinforcement Learning. Theorem 3describes how

these results apply even when some of the historical data is unobserved. In Section 9, we present a

collection of experimental results in both simulated and real-world settings.

2. Related Work

Inventory models [

] often assume that any sales that are missed due to a lack of inventory

(usually called Out of Stock or OOS) are backlogged and ﬁlled in the next period. It is estimated

through the analysis of large scale surveys that only 15% - 23% of customers are willing to delay

a purchase when confronted by an out-of-stock item [

]. This indicates that a lost sales

assumption is more realistic in competitive environments than a full-backlogging assumption. See

[6] for a comprehensive review of the lost sales inventory literature.

Karlin et al. [

] study an inventory system with a continuous demand density, lost sales and

a constant lead time of 1. They show that a base stock policy (i.e. a policy that orders up to a

target inventory level)

is suboptimal in this case even with linear cost structures, pointing out that

the way “a lag in delivery inﬂuences the decision process is analogous to that of a convex ordering

cost.” This indicates that a newsvendor model with a base stock policy is already suboptimal for a

modern retailer’s inventory system. Under certain conditions, such as a ﬁxed lead time with a large

2For a formal deﬁnition of base stock policies, see [14] and references within.

Madeka et al.: Deep Inventory Management

lost demand penalty cost, it can be shown that the base stock policy for a backordered model can

be asymptotically optimal for a lost sales model [

]. However, as in our problem, with longer lead

times and a ﬁxed penalty, a constant order policy may outperform a base stock policy [41].

Our problem further deviates from the literature in that our lead times are modeled as a random

variable. Inventory systems with lost sales and stochastic lead times are scarcely studied in the

literature. Zipkin [

] showed that even for a simple setting of lead times that are constant (but

larger than one time unit), stochastic demands, and lost sales - base stock policies tend to be

numerically worse than myopic or constant order policies as the lead time increases. Kaplan [

]

demonstrates how, when the maximum delay period is constrained, the state space for a random

lead time model can be reduced to a single dimension. Nahmias et al. [

] formulate myopic policies

for an inventory system where excess sales are lost and the lead time is random. Janakiraman et al.

[

] develop convexity results for an inventory management system with lost sales, non-crossing

lead times and order up to policies.

Demand censoring has been studied in the context of the spiral down eﬀect in revenue management

[

] as well as for the censored newsvendor problem (see [

] and citations within). We have not seen

a historical data correction similar to the one in Section 4.2 being applied to the policy learning

problem framed in Section 6.

For large retailers in competitive markets, price matching causes abrupt and random changes in

the price of a product during a decision epoch. This causes the price to behave like an exogenous

stochastic process, as opposed to an endogenous decision variable. There is no literature we could

ﬁnd that addresses the problem of a periodic review inventory system with price matching eﬀects.

Reinforcement learning has been applied in many sequential decision making problems where

learning from a large number of simulations is possible, such as games and simulated physics models

of multi-joint systems. While the use of deep learning for forecasting demand has developed recently

[

], the usage of reinforcement learning to directly produce decisions in Inventory

Management has been limited. Giannoccaro et al. [

] consider a periodic review system with

ﬁxed prices, costs, and full backlogging of demand and show that the SMART RL Algorithm [

]

can outperform baseline integrated policies. Oroojlooyjadid et al. [

] integrate the forecasting

and optimization steps in a newsvendor problem by using a Deep Neural Network (DNN) to

directly predict a quantity that minimizes the newsvendor loss function. Balaji et al. [

] study

the multi-period newsvendor problem with uniformly distributed price and cost, constant VLT

and stationary, Poisson distributed demand and show that Proximal Policy Optimization [

] can

eventually beat standard newsvendor benchmarks.

Madeka et al.: Deep Inventory Management 5

Gijsbrechts et al. [

] consider lost sales, dual sourcing and multi-echelon problems and show

that modern Deep Reinforcement Learning algorithms such as A3C [

] are competitive with

state-of-the-art heuristics. Qi et al. [

] train a neural network to predict the outputs of an (ex-post)

“oracle” policy. While this means that their approach does not follow a Predict-then-Optimize (PTO)

framework, the oracle policy does not allow them to handle how the variability of the exogenous

variables (such as demand, price etc.) might inﬂuence the optimal ex-ante policy.3

Diﬀerentiable simulators have been studied [

] and applied to problems varying from physics

[

] and protein folding [

]. In the context of inventory management, they have been applied [

]

to studying the sensitivity of inventory costs to optimal parameters for base stock levels. They have

not been studied in the context of directly learning (neural) policies through the gradients from the

simulator.

Exogenous Markov Decision Processes and their learnability have been studied [

], our work

allows the policy to be dependent on the entire trajectory of the Exogenous random variables.

3. Problem Formulation

3.1. Mathematical Notation

Denote by

and

the set of real and natural numbers respectively. Let (

)

refer to the classical

positive part operator i.e. (

)

max

(

·,

0). The inventory management problem seeks to ﬁnd the

optimal inventory level for each product

in the set of retailer’s products, which we denote by

We assume all random variables are deﬁned on a canonical probability space (Ω

,F,

). Let

θ∈

denotes some parameter set. We use

to denote an expectation operator of a random variable

with respect to some probability measure . Let

kQ1,Q2kTV

denote the total variation distance

between two probability measures Q1and Q2.

3.2. Construction of the Interactive Decision Process

We will construct our Interactive Decision Process (IDP) in two steps: First we will describe the

driving stochastic “noise” processes which govern our problem. These are things like demand, price

changes, cost and other things that are outside of our control. The assumption is that nothing in

this set is inﬂuenced by our actions. Second, we will describe our decision process which can depend

on all the previous information contained in the above processes.

3By taking gradients against the true reward function, our approach is able to handle this variability

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepInventoryManagementDhruvMadekaAmazon,maded@amazon.comKariTorkkolaAmazon,karito@amazon.comCarsonEisenachAmazon,ceisen@amazon.comAnnaLuoPinterest*,annaluo676@gmail.comDeanP.FosterAmazon,foster@amazon.comShamM.KakadeAmazon,HarvardUniversity,shamisme@amazon.comThisworkprovidesaDeepReinforcementLearn...

展开>> 收起<<

Deep Inventory Management Dhruv Madeka Amazon madedamazon.com.pdf

共33页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Inventory Management Dhruv Madeka Amazon madedamazon.com

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: