
II. RELATED WORK
In [3], Paudel and Gooi proposed a pricing strategy for a
P2P energy trading platform using the alternating direction
method of multiplier (ADMM). They introduced a P2P market
operator that offers prices and provides a platform for energy
trading from prosumers to consumers. Given the many factors
influencing the energy trading price, the authors chose to
focus more on the social welfare of the community microgrid.
They modeled the personal satisfaction of each household
considering consumption and used it to define a function
of welfare that depends on the per unit price of energy.
The optimization process determines the best energy platform
price that maximizes the aggregated welfare of the community
microgrid. However, this work did not consider the dynamic
nature of energy pricing and provided only a stable solution
that does not vary over time.
On the other hand, Kim, Zhang, et al. [4] were one of the
first to apply RL to allow setting an optimal retail price based
on the dynamics of the customer behavior and the change in
electricity cost. More specifically, they formulated the dynamic
pricing problem as an MDP problem, where a service provider
decides the action of choosing a retail energy price at each
time step t. They defined the cost as the weighted sum of
the service provider’s cost and the customers’ cost at each
time step. They solved this MDP problem by adopting a Q-
learning algorithm with some proposed improvements. Apart
from dynamic pricing, the authors also considered the case
where customers can schedule their energy consumption based
on the observed energy price to minimize their long-term cost,
which turns this problem into a multi-agent learning case.
However, the authors did not consider the prosumers’ energy
generation capability that largely influences the smart-grid
dynamics and impacts the retail energy price. Furthermore,
the Q-learning algorithm used in this work has high memory
space requirements to store the state-action values and takes
a long time to converge, making it inefficient to apply with
bigger state spaces. Our formulation of the reward function is
inspired by this work.
Likewise, authors in [7] formulated multi-timescale dispatch
and scheduling for a smart-grid model as an MDP problem
considering the uncertainty of wind generation and energy
demand. Specifically, they proposed the dispatching and pric-
ing in two timescales: real-time and day-ahead scheduling.
While the authors made a vast contribution to the integration
of wind power into the bulk power grid, they did not consider
customers who can generate wind power and actively trade
energy with other customers within a smart grid.
Other approaches propose statistical regression models,
which identify the set of independent variables required for the
complex process of forecasting the electricity price. Authors
in [6] argue that there is no fit-for-all set of variables and
hence narrowed down their scope by selecting 19 variables
based on the characteristics of the UK energy market. They
performed a multivariable regression using gradient boosting,
random forests, and XGBoost, where the task of each of the
models was to make an electricity price forecast 1-12 hours
ahead.
Instead of focusing on maximizing social welfare, Joe-
Wong, Sen, et al. [5] approached the price offerings optimizing
problem from the service provider’s point of view, maximizing
its revenue. By assessing consumers’ device-specific schedul-
ing flexibility and modeling their willingness to shift the
energy consumption to off-peak periods, the authors formu-
lated an optimization problem to determine cost-minimizing
prices for service providers. The authors also argue that real-
time pricing is less customer friendly than day-ahead price
scheduling since it does not allow the customers to plan their
activities in advance and thus creates more uncertainty.
In Table I, we show the comparison of our proposed
framework with the previous studies on dynamic pricing
mechanisms for smart grid scenarios.
III. PROBLEM FORMULATION
In this work, we define a microgrid composed of a service
provider (SP), a set of prosumers P, a set of consumers C,
and a community battery. We consider a temporally dynamic
microgrid, where at each time step t, the SP adopts a retail
energy price at:R+7→ R+and a purchase energy price
pt:R+7→ R+. SP uses atto charge both consumers and
prosumers depending on their total load demand and uses
ptto calculate how much it has to pay to the prosumers
for their energy surplus. In other words, SP regulates both
the price to sell energy and the purchase price for which it
buys surplus energy from prosumers. Furthermore, SP can
also purchase the microgrid’s energy requirements from the
utility grid (UG) using a fixed cost function. We also consider
a shared community battery that facilitates energy trading
within the microgrid by storing the surplus energy and partially
covering the customers’ demands when requested.
We assume that the set of retail pricing functions’ and
the set of purchase pricing functions’ coefficients are both
Table I: Summary of Related Work (X: considered, - : not considered)
Approach Price prediction Real data Prosumers’ energy
generation capabilities Shared battery system
[3] Optimization (ADMM) X-X-
[5] Optimization X X - -
[6] ML Regression X X X -
[7] MDP X- - -
[4] RL (Q-Learning) X X - -
Our work RL (DQN) X X X X