Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories Qinqing Zheng1Mikael Henaff1Brandon Amos1Aditya Grover2 Abstract

2025-04-24 0 0 1.4MB 24 页 10玖币
侵权投诉
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories
Qinqing Zheng 1Mikael Henaff 1Brandon Amos 1Aditya Grover 2
Abstract
Natural agents can effectively learn from multiple
data sources that differ in size, quality, and types
of measurements. We study this heterogeneity in
the context of offline reinforcement learning (RL)
by introducing a new, practically motivated semi-
supervised setting. Here, an agent has access to
two sets of trajectories: labelled trajectories con-
taining state, action and reward triplets at every
timestep, along with unlabelled trajectories that
contain only state and reward information. For
this setting, we develop and study a simple meta-
algorithmic pipeline that learns an inverse dynam-
ics model on the labelled data to obtain proxy-
labels for the unlabelled data, followed by the use
of any offline RL algorithm on the true and proxy-
labelled trajectories. Empirically, we find this
simple pipeline to be highly successful — on sev-
eral D4RL benchmarks (Fu et al.,2020), certain
offline RL algorithms can match the performance
of variants trained on a fully labelled dataset even
when we label only 10% of trajectories which are
highly suboptimal. To strengthen our understand-
ing, we perform a large-scale controlled empirical
study investigating the interplay of data-centric
properties of the labelled and unlabelled datasets,
with algorithmic design choices (e.g., choice of
inverse dynamics, offline RL algorithm) to iden-
tify general trends and best practices for training
RL agents on semi-supervised offline datasets.
1 Introduction
One of the key challenges with deploying reinforcement
learning (RL) agents is their prohibitive sample complexity
for real-world applications. Offline reinforcement learn-
ing (RL) can significantly reduce the sample complexity
by exploiting logged demonstrations from auxiliary data
1
Meta AI Research
2
UCLA. Correspondence to: Qinqing
Zheng <zhengqinqing@gmail.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
sources (Levine et al.,2020). Standard offline RL as-
sumes fully logged datasets: the trajectories are complete
sequences of observations, actions, and rewards. However,
contrary to curated benchmarks in use today, the nature of
offline demonstrations in the real world can be highly varied.
For example, the demonstrations could be misaligned due
to frequency mismatch (Burns et al.,2022), use different
sensors, actuators, or dynamics (Reed et al.,2022;Lee et al.,
2022), or lack partial state (Ghosh et al.,2022;Rafailov
et al.,2021;Mazoure et al.,2021) or reward information (Yu
et al.,2022). Successful offline RL in the real world requires
embracing these heterogeneous aspects for maximal data
efficiency, similar to learning in humans.
In this work, we propose a new and practically motivated
semi-supervised setup for offline RL: the offline dataset
consists of some action-free trajectories (which we call un-
labelled) in addition to the standard action-complete trajec-
tories (which we call labelled). In particular, we are mainly
interested in the case where a significant majority of the
trajectories in the offline dataset are unlabelled, and the un-
labelled data might have different qualities than the labelled
ones. One motivating example for this setup is learning
from videos (Schmeckpeper et al.,2020a;b) or third-person
demonstrations (Stadie et al.,2017;Sharma et al.,2019).
There are tremendous amounts of internet videos that can
be potentially used to train RL agents, yet they are without
action labels and are of varying quality. Notably, our setup
has two key properties that differentiate it from traditional
semi-supervised learning:
First, we do not assume that the distribution of the labelled
and unlabelled trajectories are necessarily identical. In
realistic scenarios, we expect these to be different with un-
labelled data having higher returns than labelled data e.g.,
videos of a human professional are easy to obtain whereas
precisely measuring their actions is challenging. We repli-
cate such varied data quality setups in some of our experi-
ments; Figure 1.1 shows an illustration of the difference in
returns between the labelled and unlabelled dataset splits
using the hopper-medium-expert D4RL dataset.
Second, our end goal goes beyond labelling the actions
in the unlabelled trajectories, but rather we intend to use
the unlabelled data for learning a downstream policy that
is better than the behavioral policies used for generating
the offline datasets.
1
arXiv:2210.06518v3 [cs.LG] 22 Jun 2023
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories
Figure 1.1: An example of the return distribution of the
labelled and unlabelled datasets.
Correspondingly, there are two kinds of generalization chal-
lenges in the proposed setup: (i) generalizing from the la-
belled to the unlabelled data distribution and then (ii) going
beyond the offline data distributions to get closer to the
expert distribution. Regular offline RL is only concerned
with the latter, and standard algorithms such as Conservative
Q Learning (
CQL
;Kumar et al. (2020)), TD3BC (
TD3BC
;
Fujimoto & Gu (2021)) or Decision Transformer (
DT
;Chen
et al. (2021)), cannot directly operate on such unlabelled
trajectories. At the same time, na¨
ıvely throwing out the un-
labelled trajectories can be wasteful, especially when they
have high returns. Thus, our paper seeks to answer the
following question:
How can we best leverage the unlabelled data to im-
prove the performance of offline RL algorithms?
To answer this question, we study different approaches to
train policies in the semi-supervised setup described above,
and propose a meta-algorithmic pipeline Semi-Supervised
Offline Reinforcement Learning (
SS-ORL
).
SS-ORL
contains three simple steps: (1) train an inverse dynamics
model (IDM) on the labelled data, which predicts actions
based on transition sequences, (2) fill in proxy-actions for
the unlabelled data, and finally (3) train an offline RL agent
on the combined dataset.
The main takeaway of our paper is:
Given low-quality labelled data,
SS-ORL
agents can
exploit unlabelled data containing high-quality trajecto-
ries to improve performance. The absolute performance
of
SS-ORL
is close to or even matches that of the oracle
agents, which have access to complete action information
of both labelled and unlabelled trajectories.
From a technical standpoint, we address the limitations of
the classic IDM (Pathak et al.,2017) by proposing a novel
stochastic multi-transition IDM that incorporates previous
states to account for non-Markovian behavior policies. To
enable compute and data efcient learning, we conduct thor-
ough ablation studies to understand how the performance
of
SS-ORL
agents are affected by the algorithmic design
choices, and how it varies as a function of data-centric prop-
erties such as the size and return distributions of labelled
and unlabelled datasets. We highlight a few predominant
trends from our experimental findings below:
1.
Proxy-labelling is an effective way to utilize unlabelled
data. For example,
SS-ORL
instantiated with
DT
as
the offline RL method significantly outperforms an
alternative DT-based approach without proxy-labelling.
2.
Simply training the IDM on the labelled dataset out-
performs more sophisticated semi-supervised protocols
such as self-training (Fralick,1967).
3.
Incorporating past information into the IDM improves
generalization.
4.
The performance of
SS-ORL
agents critically depend on
factors such as size and quality of the labelled and unla-
belled datasets, but the effect magnitudes depend on the
offline RL method. For example, we found that
TD3BC
is less sensitive to missing actions then DT and CQL.
2 Related Work
Offline RL The goal of ofine RL is to learn effective poli-
cies from fixed datasets which are generated by unknown
behavior policies. There are two main categories of model-
free ofine RL methods: value-based methods and behavior
cloning (BC) based methods.
Value-based methods attempt to learn value functions based
on temporal difference (TD) updates. There is a line of
work that aims to port existing off-policy value-based on-
line RL methods to the offline setting, with various types
of additional regularization components that encourage the
learned policy to stay close to the behavior policy. Several
representative techniques include specifically tailored pol-
icy parameterizations (Fujimoto et al.,2019;Ghasemipour
et al.,2021), divergence-based regularization on the learned
policy (Wu et al.,2019;Jaques et al.,2019;Kumar et al.,
2019), and regularized value function estimation (Nachum
et al.,2019;Kumar et al.,2020;Kostrikov et al.,2021a;
Fujimoto & Gu,2021;Kostrikov et al.,2021b).
A growing body of recent work formulates offline RL as
a supervised learning problem (Chen et al.,2021;Janner
et al.,2021;Emmons et al.,2021). Compared with value-
based methods, these supervised methods enjoy several
appealing properties including algorithmic simplicity and
training stability. Generally speaking, these approaches can
be viewed as conditional behavior cloning methods (Bain &
Sammut,1995), where the conditioning is based on goals
or returns. Similar to value-based methods, these can be
extended to the online setup as well (Zheng et al.,2022)
and demonstrate excellent performance in hybrid setups
involving both offline data and online interactions.
Semi-Supervised Learning Semi-supervised learning
(SSL) is a sub-area of machine learning that studies ap-
proaches to train predictors from a small amount of labelled
data combined with a large amount of unlabelled data. In
2
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories
supervised learning, predictors only learn from labelled data.
However, labelled training examples often require human
annotation efforts and are thus hard to obtain, whereas un-
labelled data can be comparatively easy to collect. The
research on semi-supervised learning spans several decades.
One of the oldest SSL techniques, self-training, was orig-
inally proposed in the 1960s (Fralick,1967). There, the
predictor is first trained on the labelled data. Then, at each
training round, according to certain selection criteria such
as model uncertainty, a portion of the unlabelled data is
annotated by the predictor and added into the training set
for the next round. Such process is repeated multiple times.
We refer the readers to Zhu (2005); Chapelle et al. (2006);
Ouali et al. (2020); Van Engelen & Hoos (2020) for com-
prehensive literature surveys.
Imitation Learning from Observations There have
been several works in imitation learning (IL) which do
not assume access to the full set of actions, such as
BCO (Torabi et al.,2018a), MoBILE (Kidambi et al.,2021),
GAIfO (Torabi et al.,2018b) or third-person IL approaches
(Stadie et al.,2017;Sharma et al.,2019). The recent work
of Baker et al. (2022) also considered a setup where a small
number of labelled actions are available in addition to a large
unlabelled dataset. A key difference with our work is that
the IL setup typically assumes that all trajectories are gen-
erated by an expert, unlike our offline setup. Further, some
of these methods even permit reward-free interactions with
the environment which is not possible in the offline setup.
Learning from Videos Several works consider training
agents with human video demonstrations (Schmeckpeper
et al.,2020a;b), which are without action annotations. Dis-
tinct from our setup, some of these works allow for online
interactions, assume expert videos, and more broadly, video
data typically specifies agents with different embodiments.
3 Semi-Supervised Offline RL
Preliminaries We model our environment as a Markov
decision process (MDP) (Bellman,1957) denoted by
⟨S,A, p, P, R, γ
, where
S
is the state space,
A
is the
action space,
p(s1)
is the distribution of the initial state,
P(st+1|st, at)
is the transition probability distribution,
R(st, at)
is the deterministic reward function, and
γ
is the
discount factor. At each timestep
t
, the agent observes a state
st∈ S
and executes an action
at∈ A
. The environment
then moves the agent to the next state
st+1 P(·|st, at)
,
and also returns the agent a reward rt=R(st, at).
3.1 Proposed Setup
We assume the agent has access to a static offline dataset
Toffline
. The dataset consists of trajectories collected by
unknown policies, which are generally suboptimal. Let
τ
denote a trajectory and
|τ|
denote its length. We assume that
all the trajectories in
Toffline
contain complete rewards and
states. However, only a small subset of them contain actions.
We are interested in learning a policy by leveraging the of-
fline dataset without interacting with the environment. This
setup is analogous to semi-supervised learning, where ac-
tions serve the role of labels. Hence, we also refer to the
complete trajectories as labelled data (denoted by
Tlabelled
)
and the action-free trajectories as unlabelled data (denoted
by
Tunlabelled
). Further, we assume the labelled and unla-
belled data are sampled from two distributions Plabelled and
Punlabelled
, respectively. In general, the two distributions can
be different. One case we are particularly interested in is
when
Plabelled
generates low-to-moderate quality trajectories,
whereas
Punlabelled
generates trajectories of diverse qualities
including ones with high returns, as shown in Fig 1.1.
Our setup shares some similarities with state-only imitation
learning (Ijspeert et al.,2002;Bentivegna et al.,2002;Torabi
et al.,2019) in the use of action-unlabelled trajectories.
However, there are two fundamental differences. First, in
state-only IL, the unlabelled demonstrations are from the
same distribution as the labelled demonstrations, and both
are generated by a near-optimal expert policy. In our setting,
Plabelled
and
Punlabelled
can be different and are not assumed
to be optimal. Second, many state-only imitation learning
algorithms (e.g., Gupta et al. (2017); Torabi et al. (2018a;b);
Liu et al. (2018); Sermanet et al. (2018)) permit (reward-
free) interactions with the environments similar to their
original counterparts (e.g., Ho & Ermon (2016); Kim et al.
(2020)). This is not allowed in our offline setup, where the
agents are only provided with Tlabelled and Tunlabelled.
3.2 Training Pipeline
RL policies trained on low to moderate quality offline tra-
jectories are often sub-optimal, as many of the trajectories
might not have high returns and only cover a limited part
of the state space. Our goal is to find a way to combine the
action labelled trajectories and the unlabelled action-free
trajectories, so that the offline agent can exploit structures
in the unlabelled data to improve performance.
One natural strategy is to fill in proxy actions for those unla-
belled trajectories, and use the proxy-labelled data together
with the labelled data as a whole to train an offline RL agent.
Since we assume both the labelled and unlabelled trajec-
tories contain the states, we can train an inverse dynamics
model (IDM) ϕthat predicts actions using the states. Once
we obtain the IDM, we use it to generate the proxy actions
for the unlabelled trajectories. Finally, we combine those
proxy-labelled trajectories with the labelled trajectories, and
train an agent using the offline RL algorithm of choice. Our
meta-algorithmic pipeline is summarized in Algorithm 1.
Particularly, we propose a novel stochastic multi-transition
3
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories
Algorithm 1: Semi-supervised offline RL (SS-ORL)
1Input: trajectories Tlabelled and Tunlabelled, IDM transition size
k, offline RL algorithm ORL
// train a stochastic multi-transition
IDM using the labelled data
2b
θargminθP(at,st,k)in Tlabelled [log ϕθ(at|st,k)]
// fill in the proxy actions for the
unlabelled data
3Tproxy
4for each trajectory τTunlabelled do
5batµb
θ(st,k), i.e. mean of
Nµb
θ(st,k),Σb
θ(st,k),t= 1,...,|τ|
6τproxy τwith proxy actions {bat}|τ|
t=1 filled in
7Tproxy Tproxy S{τproxy}
// train an offline RL agent using the
combined data
8πpolicy trained by ORL using dataset Tlabelled STproxy
9Output: π
IDM that incorporates past information to enhance the treat-
ment for stochastic MDPs and non-Markovian behavior
policies. Section 3.2.1 discusses the details.
Of note,
SS-ORL
is a multi-stage pipeline, where the IDM
is trained only on the labelled data in a single round. There
are other possible ways to combine the labelled and unla-
belled data. In Section 3.2.2, we discuss several alternative
design choices and the key reasons why we do not employ
them. Additionally, we present the ablation experiments in
Section 4.2.
3.2.1 STOCHASTIC MULTI-TRANSITION IDM
In past work (Pathak et al.,2017;Burda et al.,2019;Henaff
et al.,2022), the IDM typically learns to map two subsequent
states of the
t
-th transition,
(st, st+1)
, to
at
. In theory,
this is sufficient when the offline dataset is generated by
a single Markovian policy in a deterministic environment,
see Appendix Dfor the analysis. However, in practice,
the offline dataset might contain trajectories logged from
multiple sources.
To provide better treatment for multiple behavior poli-
cies, we introduce a multi-transition IDM that predicts
the distribution of
at
using the most recent
k+ 1
tran-
sitions. More precisely, let
st,k
denote the sequence
smin(0,tk), . . . , st, st+1
. We model
P(at|st,k)
as a multi-
variate Gaussian with a diagonal covariance matrix:
at N µθ(st,k),Σθ(st,k).(1)
Let
ϕθ(at|st,k)
be the probability density function of
Nµθ(st,k),Σθ(st,k)
. Given the labelled trajecto-
ries
Tlabelled
, we minimize the negative log-likelihood loss
P(at,st,k)in Tlabelled [log ϕθ(at|st,k)]
. We call
k
the tran-
sition size parameter. Note that the standard IDM which
predicts
at
from
(st, st+1)
under the
2
loss, is a special
case subsumed by our model: it is equivalent to the case
k= 0
and the diagonal entries of
Σθ
(i.e., the variances of
each action dimension) are all the same.
In essence, we approximate
p(at|st+1, . . . , s1)
by
p(at|st,k)
, and choosing
k > 0
allows us to take past
state information into account. Meanwhile, the theory also
indicates that incorporating future states like
st+2
would
not help to predict
at
(see the analysis in Appendix Dfor
details). For all the experiments in this paper, we use
k= 1
.
We ablate this design choice in Section 4.2. Moreover,
our IDM naturally extends to non-Markovian policies and
stochastic MDPs. This is beyond the scope of this paper, but
we consider them as potential directions for future work.
3.2.2 ALTERNATIVE DESIGN CHOICES
Training without Proxy Labelling
SS-ORL
fills in proxy
actions for the unlabelled trajectories before training the
agent. There, the policy learning task is defined on the
combined dataset of the labelled and unlabelled data. An
alternative approach is to only use the labelled data to
define the policy learning task, but create certain auxiliary
tasks using the unlabelled data. These auxiliary tasks do
not depend on actions, so that proxy-labelling is not needed.
Multitask learning approaches can be employed to train an
agent that solves those tasks together. For example, Reed
et al. (2022) train a generalist agent that processes diverse
sequences with a single transformer model. In a similar vein,
we consider
DT-Joint
, a variant of
DT
, that trains on both
labelled and unlabelled data simultaneously. In a nutshell,
DT-Joint
predicts actions for the labelled trajectories,
and states and rewards for both labelled and unlabelled
trajectories. See Appendix Ffor the implementation details.
Nonetheless, our ablation experiment in Section 4.2 shows
that SS-ORL significantly outperforms DT-Joint.
Self-Training for the IDM The annotation process in
SS-ORL
, which involves training an IDM on the labelled
data and generating proxy actions for the unlabelled trajec-
tories, is similar to one step of self-training (Fralick,1967,
Cf. Section 2), one commonly used approach in standard
semi-supervised learning. However, a key difference is that
we do not retrain the IDM but directly move to the next
stage of training the agent using the combined data. There
are a few reasons that we do not employ self-training for the
IDM. First, it is computationally expensive to execute multi-
ple rounds of training. More importantly, our end goal is to
obtain a downstream policy with improved performance via
utilizing the proxy-labelled data. As a baseline, we consider
self-training for the IDM, where after each training round
we add the proxy-labelled data with low predictive uncer-
tainties into the training set for the next round. Empirically,
we found that this variant underperforms our approach. See
Section 4.2 and Appendix Efor more details.
4
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories
4 Experiments
Our main objectives are to answer four sets of questions:
Q1.
How close can
SS-ORL
agents match the performance
of fully supervised offline RL agents, especially when
only a small subset of trajectories is labelled?
Q2.
How do the
SS-ORL
agents perform under different
design choices for training the IDM, or even avoiding
proxy-labelling completely?
Q3.
How does the performance of
SS-ORL
agents vary as
a function of the size and quality of the labelled and
unlabelled datasets?
Q4.
Do different offline RL methods respond differently to
various setups of the dataset size and quality?
We focus on two
Gym
locomotion tasks,
hopper
and
walker
, with the v2
medium-expert
,
medium
and
medium-replay
datasets from the D4RL bench-
mark (Fu et al.,2020). Due to space constraints, the
results on
medium
and
medium-replay
datasets
are deferred to Appendix C. We respond to the above
questions in Section 4.1,4.2,4.3 and 4.4, respec-
tively. We also include additional experiments on
the
maze2d
environments in Appendix H. For all
experiments, we train
5
instances of each method with
different seeds, and for each instance we roll out
30
evaluation trajectories. Our code is available at
https:
//github.com/facebookresearch/ssorl/.
4.1 Main Evaluation (Q1)
Data Setup We subsample
10%
of the total offline trajec-
tories whose returns are from the bottom
q%
as the labelled
trajectories,
10 q100
. The actions of the remaining
trajectories are discarded to create the unlabelled ones. We
refer to this setup as the coupled setup, since the labelled
data distribution
Plabelled
and the unlabelled data distribution
Punlabelled
will change simultaneously as we vary the value of
q
. As
q
increases, the labelled data quality increases and the
distributions
Plabelled
and
Punlabelled
become closer. When
q= 100
, our setup is equivalent to sampling the labelled
trajectories uniformly and
Plabelled =Punlabelled
. Note that
under our setup, we always have
10%
trajectories labelled
and
90%
unlabelled, and the total amount of data used to
train the offline RL agent is the same as the original offline
dataset. This allows for easy comparison with results under
the standard, fully labelled setup. In Section 4.3, we will de-
couple
Plabelled
and
Punlabelled
for an in-depth understanding
of their individual influences on the SS-ORL agents.
Inverse Dynamics Model We train an IDM as described
in Section 3with
k= 1
. That is, the IDM predicts
at
using
3 consecutive states:
st1, st
and
st+1
, where the mean
and the covariance matrix are predicted by two independent
multilayer perceptrons (MLPs), each containing two
hidden layers and
1024
hidden units per layer. To prevent
overfitting, we randomly sample
10%
of the labelled
trajectories as the validation set, and use the IDM that
yields the best validation error within 100k iterations.
Offline RL Methods We instantiate Algorithm 1with DT,
CQL
and
TD3BC
as the underlying offline RL methods.
DT
is a recently proposed conditional behaviour cloning (BC)
method that uses sequence modelling tools to model the
trajectories.
CQL
is a representative value-based offline RL
method.
TD3BC
is a hybrid method which adds a BC term to
regularize the Q-learning updates. We refer to these instan-
tiations as
SS-DT
,
SS-CQL
and
SS-TD3BC
, respectively.
See Appendix Afor the implementation details.
Results We compare the performance of the
SS-ORL
agents with corresponding baseline and oracle agents. The
baseline agents are trained on the labelled trajectories only,
and the oracle agents are trained on the full offline dataset
with complete action labels. Intuitively, the performance of
the baseline and the oracle agents can be considered as the
(estimated) lower and upper bounds for the performance of
the
SS-ORL
agents. We consider
6
different values of
q
:
10,30,50,70,90
and
100
, and we report the average return
and standard deviation after
200
k iterations. Figure 4.1 plots
the results on the
medium-expert
datasets. On both
datasets, the
SS-ORL
agents consistently improve upon the
baselines. Remarkably, even when the labelled data quality
is low, the
SS-ORL
agents are able to obtain decent returns.
As
q
increases, the performance of the
SS-ORL
agents also
keeps increasing and finally matches the performance of the
oracle agents.
To quantitatively measure how a
SS-ORL
agent tracks the
performance of the corresponding oracle agent, we define
the relative performance gap of SS-ORL agents as
Perf(Oracle-ORL) Perf(SS-ORL)
Perf(Oracle-ORL) ,(2)
and similarly for the baseline agents. Figure 4.2 plots the
average relative performance gap of these agents. Compared
with the baselines, the
SS-ORL
agents notably reduce the
relative performance gap.
Our results generalize to even fewer percentage of labelled
data. Figure 4.3 plots the relative performance gap of the
agents trained on
walker
-
medium-expert
datasets,
when only
1%
of the total trajectories are labelled. See Ap-
pendix C.3 for more experiments. Similar observations can
be found in the results of
medium
and
medium-replay
datasets, see Figure C.1 and C.2.
4.2
Comparison with Alternative Design Choices (Q2)
Training without Proxy-Labelling Figure 4.4 plots the
performance of
DT-Joint
and the
SS-ORL
agents on
the
hopper
-
medium-expert
dataset, using the coupled
5
摘要:

Semi-SupervisedOfflineReinforcementLearningwithAction-FreeTrajectoriesQinqingZheng1MikaelHenaff1BrandonAmos1AdityaGrover2AbstractNaturalagentscaneffectivelylearnfrommultipledatasourcesthatdifferinsize,quality,andtypesofmeasurements.Westudythisheterogeneityinthecontextofofflinereinforcementlearning(R...

展开>> 收起<<
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories Qinqing Zheng1Mikael Henaff1Brandon Amos1Aditya Grover2 Abstract.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:1.4MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注