Preprint. Under Review. SIMSCOOD S YSTEMATIC ANALYSIS OF OUT-OF- DISTRIBUTION GENERALIZATION IN FINE-TUNED

2025-05-06 0 0 1.08MB 19 页 10玖币
侵权投诉
Preprint. Under Review.
SIMSCOOD: SYSTEMATIC ANALYSIS OF OUT-OF-
DISTRIBUTION GENERALIZATION IN FINE-TUNED
SOURCE CODE MODELS
Hossein Hajipour1, Ning Yu2, Cristian-Alexandru Staicu1, Mario Fritz1
1CISPA Helmholtz Center for Information Security
2Salesforce Research
1{hossein.hajipour, staicu, fritz}@cispa.de
2ning.yu@salesforce.com
ABSTRACT
Large code datasets have become increasingly accessible for pre-training source
code models. However, for the fine-tuning phase, obtaining representative training
data that fully covers the code distribution for specific downstream tasks remains
challenging due to the task-specific nature and limited labeling resources. More-
over, fine-tuning pretrained models can result in forgetting previously acquired
pre-training knowledge. These lead to out-of-distribution (OOD) generalization
issues with unexpected model inference behaviors that have not been systemati-
cally studied yet. In this paper, we contribute the first systematic approach that
simulates various OOD scenarios along different dimensions of source code data
properties and study the fine-tuned model behaviors in such scenarios. We inves-
tigate the behaviors of models under different fine-tuning methodologies, includ-
ing full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods. Our
comprehensive analysis, conducted on four state-of-the-art pretrained models and
applied to two code generation tasks, exposes multiple failure modes attributed to
OOD generalization issues. Additionally, our analysis uncovers that LoRA fine-
tuning consistently exhibits significantly better OOD generalization performance
than full fine-tuning across various scenarios.1
1 INTRODUCTION
Figure 1: Our approach simulates out-of-
distribution (OOD) scenarios and analyzes the
corresponding behaviors of models. (I) Origi-
nal source code distribution along a certain di-
mension. (II) OOD simulation by masking out
a sub-region of the distribution. (III) Model
fine-tuning. (IV) Evaluation on OOD data.
There has been increasing success in apply-
ing Large Language Models (LLMs) to vari-
ous source code understanding and generation
tasks. LLMs for codes such as CodeBERT Feng
et al. (2020), GraphCodeBERT Guo et al. (2021),
CodeT5+ Wang et al. (2023), CodeGen Nijkamp
et al. (2023), and Code Llama Rozi`
ere et al.
(2023) are pretrained using large-scale source code
datasets and serve as universal initialization for a
variety of downstream tasks. These tasks include
code summarization (Alon et al., 2019; LeClair
et al., 2020), text-to-code (Iyer et al., 2018), code
translation (Nguyen et al., 2013; Rozi`
ere et al.,
2020), and program repair (Tufano et al., 2018;
Chen et al., 2019; Hajipour et al., 2021).
The emerging abilities of LLMs, such as in-context learning, demonstrate their potential to handle
a wide range of tasks (Wei et al., 2022; Brown et al., 2020). However, it has been shown that not
all tasks can be effectively addressed by relying only on the pretrained LLMs Anil et al. (2022).
1The code and data will be available at https://github.com/hajipour/SimSCOOD
1
arXiv:2210.04802v2 [cs.SE] 30 Oct 2023
Preprint. Under Review.
To adapt pretrained models for specific tasks, they can be fine-tuned with specific datasets for each
downstream task. This fine-tuning process can involve optimizing all parameters or adopting a
parameter-efficient approach (Houlsby et al., 2019; Hu et al., 2022), such as Low-Rank Adaptation
(LoRA)Hu et al. (2022). Considering fine-tuning is prone to catastrophic forgetting (Chen et al.,
2020; Li et al., 2022a), and these models play crucial roles in automatic software development, it
is equally important if not more, to foresee and understand any unexpected models behaviors in
different scenarios beyond in-distribution fine-tuning data.
Despite having access to the large code datasets to pre-train these models, it remains challenging in
practice to fully cover the code distribution, specifically in fine-tuning datasets, where the availability
of labeled data is limited. This mainly stems from the compositional structures of programs and the
complexity of software. Furthermore, it has been shown that fine-tuned models forget previously
learned knowledge Chen et al. (2020), and fully fine-tuning the parameters of the pretrained models
can distort the pretrained features Kumar et al. (2022).
Therefore, it is unclear how the fine-tuned code models generalize to scenarios not seen or are rare
in the fine-tuning distribution Shen et al. (2021). For example, there is a lack of existing studies to
uncover how these models generalize to programs with specific language elements or semantics not
seen in fine-tuning datasets. A common way to study model behaviors in various OOD scenarios
is to collect testing datasets in the complementary domains of the fine-tuning dataset domain (Shen
et al., 2021). However, because the underlying true distribution of source code is intractable, it
is barely feasible to justify whether two raw datasets share a domain or not. Not to mention the
substantial costs to enumerate and constitute a variety of OOD testing datasets.
Simulating various OOD scenarios by masking out sub-regions of training data distribution is an
alternative way to systematically study the model behaviors (Schott et al., 2022; Wiles et al., 2022).
There are several distribution dimensions based on data properties. In the source code domain, we
have access to the structural information to model the source code distribution based on the length,
syntax, and semantics of programs. For example, in terms of the syntax dimension, we can mask
out all the data with uniray expressions or specific API to create a syntax-based OOD scenario.
In this work, we propose a systematic approach to analyzing the behaviors of fine-tuned source code
models in various OOD and few-data regime scenarios. We achieve this by harnessing the token
size, syntax information, and contextual embeddings of programs to simulate the OOD scenarios in
terms of length, syntax, and semantics dimensions, as illustrated in Figure 1. By utilizing these data
dimensions and control over the data, we can systematically examine the performance of fine-tuned
models in OOD scenarios and investigate the generalization capabilities of different fine-tuning
methods.
To summarize, the main contributions of this paper are as follows: 1. Our work pioneers in inves-
tigating the behaviors of the fine-tuned source code models in OOD scenarios. 2. We propose a
systematic approach to simulate various OOD scenarios by masking out sub-regions of source code
distribution along the length, syntax, and semantics dimensions. 3. We find that the performance of
the fine-tuned models can significantly deteriorate in various OOD scenarios despite the models en-
countering similar examples during the pre-training phase. In particular, in syntax and length-based
OOD scenarios, the drop can be as substantial as 90%. 4. Our systematic analysis shows that, while
full fine-tuning and LoRA fine-tuning perform comparably on in-distribution code data, LoRA fine–
tuning demonstrates significantly better performance on OOD data. 5. Our analysis of data/model
properties provides insights into model finetuning and shapes future datasets/research to focus on
the OOD of code models, which has the potential to enhance generalization accuracy across various
code generation tasks.
2 RELATED WORK
Large Language Models for Codes. With the availability of large-scale code datasets (Husain
et al., 2019; Kocetkov et al., 2022), there is growing interest in employing large language mod-
els to develop a unified pre-training model for source code understanding and generation. Code-
BERT Feng et al. (2020) is one of the first models that use pre-training in the source code domain.
CodeBERT extends the RoBERTa-based model Liu et al. (2019) to understand and generate source
code in various programming languages. Guo et al. (2021) extend CodeBERT by using a semantic-
2
Preprint. Under Review.
aware objective function. CodeT5 and CodeT5+ (Wang et al., 2021; 2023) are developed based on
encoder-decoder architecture, making them versatile models for addressing a wide range of code
understanding and code generation tasks. Svyatkovskiy et al. (2020) employ GPT-based (Radford
et al., 2019), which uses decoder-only architecture, for the code completion task. CodeGen Nijkamp
et al. (2023), StarCoder Li et al. (2023), and Code Llama Rozi`
ere et al. (2023) also employ decoder-
only architecture to pre-train code generation models, these models demonstrate impressive results
across a variety of code generation tasks. While these models show remarkable results by following
natural language instructions, it has been demonstrated that LLMs still have difficulty in understand-
ing the codes Austin et al. (2021); Li et al. (2022b), specifically in domain-specific tasks Anil et al.
(2022). In our work, we focus on generation tasks to spot weak and strong points of the fine-tuned
LLMs in generating rare and unseen programs.
Out-of-Distribution Analysis in Natural Languages and Programming Languages. Despite
the importance of OOD analysis and detection in production Shen et al. (2021), there are surprisingly
much fewer efforts to investigate OOD behaviors of NLP and PL approaches (Arora et al., 2021).
Hendrycks et al. (2020); Kong et al. (2020) study the behavior of pretrained large language models in
OOD scenarios. Even though they show that the pretrained models are better calibrated, the provided
results indicate that there is still room for improvement. Bui & Yu (2021) propose an energy-
bounded-based approach to detect OOD data in source code classification tasks. Recently, Shi et al.
(2022) proposed a set of pre-defined scenarios to investigate the compositional generalization of
neural program synthesizers. It is important to note that their investigation was limited to domain-
specific languages, such as SCAN (Lake & Baroni, 2018), and did not incorporate pretrained code
models. In contrast, we proposed the first systematic study to investigate the behavior of fine-tuned
code models across different OOD scenarios.
Fine-tuning LLMs and Catastrophic Forgetting. LLMs have demonstrated impressive capabili-
ties in handling various tasks using zero-shot and few-shot learning approaches (Brown et al., 2020;
Kojima et al., 2022). However, not all tasks can be effectively handled by relying on pretrained
LLMs (Anil et al., 2022; Scialom et al., 2022). For such tasks, we can employ fine-tuning tech-
niques with the datasets for the targeted downstream tasks. Furthermore, recent works indicate that
fine-tuning LLMs with instructions can enhance their capabilities Ouyang et al. (2022); Xu et al.
(2023); Dai et al. (2023). Despite the effectiveness of the fine-tuning procedure, recent work shows
that after fine-tuning, the LLMs can experience catastrophic forgetting in various NLP tasks (Luo
et al., 2023; Chen et al., 2020). Furthermore, Kumar et al. (2022) validates that fully fine-tuning
the models can distort the pretraining feature and adversely impact the OOD generalization perfor-
mance in image classification tasks. In this work, for the first time, we systematically investigate the
behavior of the fine-tuned source code models by carefully designing various OOD scenarios.
3 SIMSCOOD: SIMULATION OF SOURCE CODE OUT-OF-DISTRIBUTION
SCENARIOS
In this work, we propose a systematic approach to investigate the fine-tuned code model behaviors on
OOD data by simulating the OOD scenarios in multiple dimensions. Our simulation strategy allows
us to construct measurable OOD scenarios without the additional costs of accessing another dataset.
More importantly, by simulating the OOD scenarios, we have control over different properties of
OOD scenarios. We achieve this by masking out specific sub-regions of data distribution.
These OOD scenarios span over three data dimensions, including length,syntax, and semantics.
These dimensions cover different aspects of the programs. In length-based OOD scenarios where
we model the program length based on their token sizes (Meneely et al., 2013), we can study the
length-generalization ability of the fine-tuned models. For example, whether the models can produce
longer codes with high quality and how well the models can interpolate over distribution gaps.
Syntax-based scenarios enable us to study the models by masking out specific language elements.
More interestingly, using syntax-based scenarios, we can analyze to what extent each model can
generate unseen language elements. Using semantic-based scenarios, we can investigate how the
models behave if we mask out the data with specific functionalities (e.g., getter functions in Java).
Benefiting from these scenarios, we can also implicitly quantify how well the models compose
different code language elements to achieve unseen or rare functionality.
3
Preprint. Under Review.
Figure 2: Overview of different out-of-distribution scenarios. Part of the data that needs to be
masked out from the training distribution is highlighted by the red rectangles.
Modeling the Distribution of Source Code. Here, we experiment with different pretrained mod-
els and probe their behaviors in each scenario. We achieve this using our new approach that system-
atically constructs various scenarios to challenge the OOD performance of each model. As a result,
the distribution of source code can be characterized using the aforementioned dimensions that we
call properties in the following. We model the joint distribution of the source code as q(p1, ..., pn)
where each piis a specific property of the source code in distribution q. Given this distribution
we can sample a dataset D={x1, . . . , xN|xiq(p1, ..., pn)}. To create each OOD scenario we
need to sample a new dataset ˆ
D={x1, . . . , xN|xiˆq(p1, ..., pn)}where ˆq(pf, ..., pk) = 0,
meaning the samples with properties pf, ..., pkare masked out. Note that we just formulated
OOD scenarios with categorical properties, whereas it also holds for continuous properties by
p(a<pi< b)with a<band a, b R.
To sample dataset ˆ
D, we get inspiration from the rejection sampling technique (Casella et al., 2004).
Here, ˆq(p1, ..., pn)is our target distribution and we consider q(p1, ..., pn)as our proposal distribu-
tion. We reject or accept the sample data xq(p1, ..., pn)using the following step function,
f(x) = 1if P(x)/˜
P
0if P(x)˜
P(1)
Where P(x)returns the properties of data x, and ˜
Pare the properties that we do not want the
sampled data xto contain. Using the rejection sampling technique with a hard-decision function
(Equation 1) we can construct dataset ˆ
D={x1, . . . , xN|xˆq(p1, ..., pn)}with accepted samples,
and also have access to dataset ˜
D={x1, . . . , xN|x˜q(p1, ..., pn)}which are all of the rejected
samples. To examine model behaviors in each OOD scenario, we fine-tune models using ˆ
Ddata,
and test them on test set of ˜
D. Figure 2 depicts an overview of the length-, syntax-, and semantic-
based scenarios. In the following, we provide the details of how we simulate each OOD scenario
(subsection 4.1).
3.1 LENGTH-BASED OOD SCENARIOS
To simulate length-based scenarios, we use the histogram of program token sizes to represent the
distribution of a given dataset. See Figure 2 left as an example. To create each OOD scenario,
according to the rejection sampling technique, we draw samples from the distribution and reject
only the samples in the histogram’s specified sub-region.
As an example, in one of the OOD scenarios, we can consider token size between 120 and 135 as
OOD testing data. Then ˆ
D={xˆq(p1, ..., pn)}where ˆq(120 < pi<135) = 0 is the accepted
data in the rejection sampling technique. Experimenting with the length-based OOD scenarios en-
ables us to analyze how fine-tuned source code models generalize to interpolate and extrapolate over
distribution gaps.
3.2 SYNTAX-BASED OOD SCENARIOS
Each programming language has its own grammar, which is a set of rules to define valid program
statements. Using the grammar, we can parse each program into an abstract syntax tree Guo et al.
(2021) and have access to all of the language elements used in the program. For example, we can
4
摘要:

Preprint.UnderReview.SIMSCOOD:SYSTEMATICANALYSISOFOUT-OF-DISTRIBUTIONGENERALIZATIONINFINE-TUNEDSOURCECODEMODELSHosseinHajipour1,NingYu2,Cristian-AlexandruStaicu1,MarioFritz11CISPAHelmholtzCenterforInformationSecurity2SalesforceResearch1{hossein.hajipour,staicu,fritz}@cispa.de2ning.yu@salesforce.comA...

展开>> 收起<<
Preprint. Under Review. SIMSCOOD S YSTEMATIC ANALYSIS OF OUT-OF- DISTRIBUTION GENERALIZATION IN FINE-TUNED.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.08MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注