From Model-Based to Model-Free Learning Building Control for Demand Response

2025-05-06 0 0 2.63MB 16 页 10玖币

侵权投诉

From Model-Based to Model-Free: Learning Building Control for Demand Response

David Biagionia,c, Xiangyu Zhanga,∗, Christiane Adcocka,b, Michael Sinnera, Peter Grafa, Jennifer Kinga

aNational Renewable Energy Laboratory, Golden, CO 80401, U.S.A.

bStanford University, Stanford, CA 94305, U.S.A.

cMaplewell Energy, Broomﬁeld, CO 80021, U.S.A.

Abstract

Grid-interactive building control is a challenging and important problem for reducing carbon emissions, increasing energy eﬃ-

ciency, and supporting the electric power grid. Currently researchers and practitioners are confronted with a choice of control

strategies ranging from model-free (purely data-driven) to model-based (directly incorporating physical knowledge) to hybrid

methods that combine data and models. In this work, we identify state-of-the-art methods that span this methodological spec-

trum and evaluate their performance for multi-zone building HVAC control in the context of three demand response programs. We

demonstrate, in this context, that hybrid methods oﬀer many beneﬁts over both purely model-free and model-based methods as long

as certain requirements are met. In particular, hybrid controllers are relatively sample eﬃcient, fast online, and high accuracy so

long as the test case falls within the distribution of training data. Like all data-driven methods, hybrid controllers are still subject

to generalization errors when applied to out-of-sample scenarios. Key takeaways for control strategies are summarized and the

developed software framework is open-sourced.

Keywords: Grid-interactive building, demand response, diﬀerentiable optimization layers, diﬀerentiable programming, hybrid

control, reinforcement learning, model predictive control, diﬀerentiable predictive control

1. Introduction

The increasingly frequent occurrence of extreme weather

events and their severe social and economic consequences make

climate change one of the most pressing problems currently

faced by human society [1]. To address this, many coun-

tries have proposed quantitative targets to reverse the outcome

caused by human-induced climate change and guide the transi-

tion to a decarbonized society. For example, the U.S. govern-

ment aims to decarbonize the power sector by 2035 and reach

net zero emissions by 2050 [2]. Similarly, China’s “30-60” plan

targets to hit “carbon peak” and “carbon neutral” by 2030 and

2060, respectively [3]. A key part of reaching these goals will

be reducing building energy consumption, which in the U.S.,

for example, accounts for 70% of national electricity use [4]

and 30% of national carbon emissions [5]. An eﬀective set of

tools to reduce building energy consumption are technologies

for grid-interactive eﬃcient buildings (GEBs.) According to

[6], adopting these technologies across the U.S. would reduce

power sector emissions by 6% by 2030 and save $100-200 bil-

lion by 2040. To achieve these reductions, it is urgent and nec-

essary to develop technologies that enable buildings to be grid-

interactive, such as through demand response (DR) programs.

∗Corresponding author

Email addresses: dave@maplewelleng.com (David Biagioni),

Xiangyu.Zhang@nrel.gov (Xiangyu Zhang ), janiad@stanford.edu

(Christiane Adcock), Michael.Sinner@nrel.gov (Michael Sinner),

Peter.Graf@nrel.gov (Peter Graf), Jennifer.King@nrel.gov (Jennifer

King)

One successful method for building-grid interaction through

DR is model predictive control (MPC). MPC has been widely

studied for optimizing building-side control objectives [7,8];

see [9] for a comprehensive review. Many recent works add

grid service objectives to the MPC formulation to deliver con-

trol that balances the trade-oﬀs between thermal comfort, en-

ergy cost, and DR requirements. For example, Tang et al. [10]

propose leveraging thermal energy storage in buildings to pro-

vide fast DR while maintaining indoor comfort and Hu et al.

[11] design a mixed integer linear programming-based MPC

controller to make building ﬂoor heating systems grid respon-

sive. However, as pointed out by [9], while MPC controllers

have been researched for more than a decade, they have not yet

been widely adopted in buildings owing to challenges in cre-

ating accurate yet simple predictive models and solving poten-

tially compute-intensive optimization problems during online

control.

To circumvent the diﬃculties with MPC, researchers have

tried purely data-driven methods, such as model-free reinforce-

ment learning (RL). Building simulators such as EnergyPlus

or thermal dynamics models learned from smart thermostat

data can be used for RL controller training, as demonstrated

in [12,13,14,15]. As RL has less stringent requirements for

building models than MPC, it is easier to obtain models for new

buildings, although several reviews highlight the need for more

research on transfer learning to further improve generalization

[16,17]. RL also has the beneﬁt that online operation only re-

quires a single forward pass through a policy, such as a neural

network, which is generally much faster than the online opti-

Preprint submitted to Elsevier

arXiv:2210.10203v1 [eess.SY] 18 Oct 2022

mization required by MPC. For a more in-depth review on us-

ing RL for DR, refer to [18]. The use of model-free RL comes

at a price, however: training can require large amounts of data

and time, and it can be challenging to enforce safety or physical

constraints. The primary approach for the later has been adding

soft penalties for constraint violation, which only encourages

but doesn’t strictly enforce the constraints. While recent re-

search has attempted to address some of these challenges for

building control using transfer learning [19,20,21] and safe

RL [22], it is still an active area of research.

Recently, hybrids of MPC and model-free RL have shown

promise for building control. Intuitively, a model-based method

such as MPC embodies physical domain knowledge and thus

can guide a data-driven learning process while a learning-based

method such as RL can adapt the model to a speciﬁc system.

Based on this intuition, hybrid control methods should train of-

ﬂine faster than RL, run online faster than MPC, and identify

at least as good control actions as either method. Several hy-

brid methods have been proposed in recent years, notably MPC

with a learned terminal cost (RL-MPC) [23,24], approximate

MPC (aMPC) based on imitation learning [25,26], and diﬀer-

entiable predictive control (DPC) [27] based on diﬀerentiable

programming [28,29]. aMPC and DPC use a similar approach

and have comparable control performance, but DPC is simpler

to implement and has lower training cost [27].

These works include some comparisons of hybrid methods

to MPC and/or RL for building control. For single building

control, [24] compares RL-MPC to MPC and RL and [27] com-

pares DPC to MPC and RL. For multi-zone building control,

[25] compares aMPC to MPC. These studies each compare

one hybrid controller to one or two baseline controllers rather

than comparing hybrid controllers to each other and they make

these comparisons on a limited set of metrics, such as only

control performance and/or online evaluation time. Also, they

only consider baseline building energy control rather than grid-

interactive control. In addition, some of the hybrid methods are

applied to building control in a limited capacity: RL-MPC ei-

ther uses a restrictive form for the terminal cost [23] or uses

a computationally expensive sampling method [24] and DPC

has only been trained using a linearized building model [27].

Finally, the above comparisons use a mix of building models

and learning frameworks, some of which are not released open-

source, making it challenging to reproduce, build upon, or com-

pare between these works.

To address the knowledge gaps described above, we present

the ﬁrst paper providing a comprehensive comparison of model-

based, learning-based, and hybrid methods for building control.

The novel contributions of this study are that we:

1. Compare a full spectrum of state-of-the-art control meth-

ods from model-based to model-free and from learning-based

to learning-free, as shown in Figure. 1.

2. Evaluate control methods on a thorough set of met-

rics: training performance, control performance, generalizabil-

ity, and online computational time. Based on our experiments,

we summarize the pros and cons of each method and for what

test cases a given method is suitable.

3. Include DR programs in the analysis of building con-

Figure 1: A quadrant illustration on control methods studied in this paper based

on two feature axes: (1) use of learning and (2) explicit use of a model. See

the nomenclature section for controllers’ full names and Section 3for detailed

formulations.

Nomenclature

Controller Abbreviations

OPT Optimal Open-Loop Control

MPC Nonconvex model predictive control

MPC-C Convex model predictive control

MPC-CL Convex model predictive control with learned terminal cost

DPC Diﬀerentiable predictive control

RLC Reinforcement learning control

RBC Rule based control

Indices and Related Constants

τDuration of control step

k;KLookahead step index; total lookahead steps

t;NControl step index; total control steps

zNumber of thermal zones in the building

Variables

uControl action

wExogenous inputs

xState variable in building dynamics

trollers and extend these methods to grid-interactive control.

4. Use robust, general forms of hybrid methods to ensure a

fair comparison between methods. Unlike previous works, we

apply RL-MPC using a general form for the cost function and

a computationally eﬃcient training approach and we train DPC

on a nonlinear building model.

5. Implement seven control methods, two building mod-

els, and three DR programs in a modular, documented, open-

source code base [30] (URL: https://github.com/NREL/

learning-building-control) to enable future comparisons

to and extensions of this work.

The rest of the paper is arranged as follows: Section 2cov-

ers the problem formulation by describing our building model,

cost function, DR programs, general building control problem,

and measured signals. Section 3explains the control methods

compared in this work. Training methods for the data-driven

methods are described in Section 4. Experimental details and

results are presented in Section 5and conclusions in Section 6.

2. Model and Control Problem Formulation

To investigate building control under demand response

(DR), we focus on controlling the heating, ventilation and air-

conditioning (HVAC) system during the summer when only

cooling is necessary; this work naturally extends to heating. In

this section we present the high-level HVAC model and con-

trol problem formulation. Details for setting up experiments

are presented in Section 5.

2.1. Building Thermal Dynamics

Studying building HVAC control requires a building thermal

dynamics model for 1) applying model-based control methods,

2) generating data to train learning-based control methods, and

3) evaluating any control method. We choose the discrete-time

model for a z-zone building

T(t+1) =AT(t)+Bdiag( ˙

m(t))(1zTsupply(t)−T(t))+Gw(t),(1)

where tis the control step index, Tis the indoor temperature

in each zone i∈ {1, ..., z}:T=[T1, ..., Tz]>,˙

mis the mass

ﬂow rate of air to each zone: ˙

m=[ ˙m1, ..., ˙mz]>, and Tsupply is

the temperature of the air supplied to all zones. The control

action u∈Rz+1consists of ˙

mand Tsupply. The exogenous in-

put w∈Rz+1consists of the outdoor temperature and per-zone

solar heat gain. The time-invariant system matrices A∈Rz×z,

B∈Rz×z, and G∈Rz×(z+1) are identiﬁed from a higher ﬁdelity

EnergyPlus [31] model, as done in [32]. Frequently used terms

are deﬁned again in the nomenclature table for easy referenc-

ing.

The term diag( ˙

m(t))1zTsupply is a product among elements

of uso the building model is bilinear and thus nonlinear and

non-convex1. To simplify notation, we subsequently refer to

the dynamics as F, where

x(t+1) =F(x(t),u(t),w(t)),(2)

and x(t) is the building state, obtained via a linear transforma-

tion of T.

2.2. Power Consumption and Problem Cost

The HVAC power consumption is the sum of chiller and fan

power,

Ptotal(t)=Pchiller(t)+Pfan(t),(3)

Pchiller(t)=1>

z˙

m(t)

COP Tout(t)−Tsupply(t),(4)

Pfan(t)=k11>

z˙

m(t)3+k2,(5)

where COP is the chiller’s coeﬃcient of performance, Tout is

the outdoor temperature, and k1and k2are known fan parame-

ters.

The cost cof one control step tis the weighted sum of the

electricity cost and a penalty for violating the thermal comfort

of building residents,

c(x(t),u(t),w(t)) =λ(t)Ptotal(t)τ+µ

i=1

[T(t),T(t)] Ti(t),(6)

1diag( ˙

m(t)) denotes a diagonal matrix with the elements of ˙

m(t) along the

diagonal and 1zdenotes a z×1 vector of ones.

where λis the price of electricity discussed in section 2.3 and

τis the duration of a control step. The penalty uses the band

deviation function which, for a range of desired values hx,xi, is

[x,x](x):=





max 0,x−x

| {z }

Lower constraint violation

+max (0,x−x)

| {z }

Upper constraint violation







.(7)

The thermal comfort band for our control problem, hT(t),T(t)i,

varies in time. Deviations from the band are penalized at a ﬁxed

per-unit price µ.

2.3. Demand Response (DR) Programs

To thoroughly compare controller performance for building

control under DR, we consider three common DR programs:

time-of-use pricing (TOU), real-time pricing (RTP), and power-

constrained (PC). Each program results in a diﬀerent price for

electricity λ(t). TOU uses a preset price for each time of day,

RTP uses a dynamically-varying price that depends on the hour-

by-hour wholesale electricity market, and PC uses a constant

price but includes an extra term in (6) to penalize consump-

tion above a given power limit during a power-constraint event.

Details of each DR program are presented in Section 5.1.1.

2.4. Building Control Problem Formulation

Building control minimizes the cost to keep the zone tem-

peratures within a comfort band over a N-step horizon while

satisfying the building thermal dynamics Fand constraints on

HVAC actuation:

min

u0,...,uN−1

N−1

t=0

c(xt,ut,wt) (8a)

s.t. xt+1=F(xt,ut,wt),(8b)

xt∈ X,ut∈ U,(8c)

x0=x(0).(8d)

The initial building state is x(0). In general, we constrain x

and uto lie in sets Xand U, respectively. In this work, xis

unconstrained, X=Rz, and each element in uis constrained by

upper and lower bounds, making Ua box polyhedron in Rz+1.

As the ﬁrst step for comparing a spectrum of building con-

trollers and to focus on their core capabilities, we assume:

A.1 There is no error in the building model. This translates

to using the same model in our controllers and building

simulation.

A.2 Forecasts are accurate up to a K-step lookahead. This

means at tthe disturbances {w(t),w(t+1), ..., w(t+K−1)}

are accurately known. We investigate how the choice of K

aﬀects each controller.

Given our baseline comparison, future work can investigate

the eﬀect of model and forecast error, such as by integrating

[33]. By ﬁrst comparing the control methods without these

errors, we make it possible for future work to isolate to what

extent one method outperforms another due to diﬀerences in

baseline performance, sensitivity to model error, and sensitivity

to forecast error.

2.5. Measured signals

All controllers in this study rely on signals from the building

and environment. To enable a fair comparison, at step t, all

controllers use the following signals for decision-making:

•current zone temperature T(t) or system state x(t),

•current time step t,

•K-step forecasts of disturbances from the current time,

wK(t)={w(t),w(t+1), ..., w(t+K−1)},

•DR-speciﬁc information Λ, as detailed in Section 5.

As will be presented in Section 3, some control methods

use these signals directly in an optimization formulation,

while others transform them into a feature vector, s(t)=

g(x(t),wK(t),Λ(t),t). Details of the mapping function g(·) are

given in Section 5.1.2.

3. Control Methods

3.1. Open-loop Optimal Control Baseline (OPT)

The control problem (8) can be solved in an open-loop man-

ner with no feedback by solving the problem once at time step

t=0 and then applying the resulting control actions to the sys-

tem — we do so using the numerical solver IPOPT [34]. Since

we assume there is no error in the building model and forecasts

of exogenous inputs (recall assumptions A.1 and A.2), this ap-

proach gives the optimal control actions. As such, we refer to

it as the optimal control baseline (OPT) and use it to evaluate

other control methods, which are approximations to OPT.

On a real system with modeling and forecast errors, solving

the open-loop problem (8) would work poorly, due to accumu-

lation of errors. The subsequent sections present the controllers

studied in this paper, which approximate OPT as shown in Fig-

ure 2. The approximation methods vary, in particular in their

use of the building model and learning.

3.2. Model Predictive Control (MPC)

Where OPT solves the control problem (8) once over the full

control horizon, model predictive control (MPC) instead solves

an analogous problem at every control step index tover a looka-

head horizon K≤N. This updates the optimal control action

online to account for any modeling mismatches and respond to

updated disturbance forecasts. The MPC problem is

min

ut,...,ut+K−1

t+K−1

k=t

c(xk,uk,wk) (9a)

s.t. xk+1=F(xk,uk,wk),(9b)

xk∈ X,uk∈ U,(9c)

xt=x(t),(9d)

where kis the lookahead step index and tis the control step

index. MPC solves problem (9) once for each value of tfrom

0 to N−1. The non-convex optimization problem (9) is solved

online using IPOPT; once the solution, {u∗

t, ..., u∗

t+K−1}, is found,

the ﬁrst action is applied to the system (u(t)=u∗

t.) Then the

new state, x(t+1), is observed and problem (9) is solved again,

starting from time t+1.

MPC does not give the same control actions as OPT, because

the objective functions in problems (8) and (9) are not the same;

MPC is an approximation to OPT, meaning it gives suboptimal

control performance. Choosing the total number of lookahead

steps Krequires balancing control performance and computa-

tional cost. When we neglect error, a larger Kimproves the

approximation to OPT and thus improves control performance

but also increases computational cost. Note that when error

is included, increasing Kdoes not necessarily improve control

performance since it improves the approximation to OPT but

increases accumulation of error.

Beyond the choice of K, some MPC implementations aim to

improve control performance by adding a terminal cost func-

tion, ct+K(xt+K), which models the cost of ending in state xt+K.

The optimal terminal cost can in theory be found through dy-

namic programming [35], where a sequence of one-step control

problems are solved backwards in time (starting with t=N−1)

using the Bellman equation. However, even for discrete-time

ﬁnite horizon problems such as (8), DP requires optimizing

over the space of functions and is thus an inﬁnite-dimensional

problem. This makes it challenging to determine an appropriate

ct+K(xt+K); for simplicity, we do not include a terminal cost in

our MPC implementation. However, we explore using learning

to ﬁnd the terminal cost in Section 3.4.

3.3. Convex Model Predictive Control (MPC-C)

Besides choosing a small lookahead horizon K, one can re-

duce the computational cost of MPC by using a simpler model,

such as a convex one. We denote MPC with a convex building

model as MPC-C. The convex model is the same as that de-

scribed in Section 2.1 except that the non-convex bilinear terms

in cand Fare linearized around the system’s current operating

point. The MPC-C problem is

min

ut,...,ut+K−1

t+K−1

k=t

˜c(xk,uk,wk) (10a)

s.t. xk+1=˜

F(xk,uk,wk),(10b)

xk∈ X,uk∈ U,(10c)

xt=x(t),(10d)

where ˜cand ˜

Fare the convex cost and dynamics.

3.4. Convex Model Predictive Control with Learned Terminal

Cost (MPC-CL)

Reinforced MPC (RL-MPC) is a hybrid control approach

that addresses the challenge of determining an appropriate ter-

minal cost function, ct+K(xt+K); this challenge was described in

Section 3.2. RL-MPC includes the terminal cost in MPC by

parameterizing the terminal cost, ct+K(xt+K;θ), and learning the

optimal parameters θ∗oﬄine through training. During training,

for a batch of data and control timestep one solves the MPC

problem over the lookahead horizon, evaluates the actual cost

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FromModel-BasedtoModel-Free:LearningBuildingControlforDemandResponseDavidBiagionia,c,XiangyuZhanga,,ChristianeAdcocka,b,MichaelSinnera,PeterGrafa,JenniferKingaaNationalRenewableEnergyLaboratory,Golden,CO80401,U.S.A.bStanfordUniversity,Stanford,CA94305,U.S.A.cMaplewellEnergy,Broomeld,CO80021,U.S.A....

展开>> 收起<<

From Model-Based to Model-Free Learning Building Control for Demand Response.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

From Model-Based to Model-Free Learning Building Control for Demand Response

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: