Preprint. Under Review. SIMSCOOD S YSTEMATIC ANALYSIS OF OUT-OF- DISTRIBUTION GENERALIZATION IN FINE-TUNED

2025-05-06 1 0 1.08MB 19 页 10玖币

侵权投诉

Preprint. Under Review.

SIMSCOOD: SYSTEMATIC ANALYSIS OF OUT-OF-

DISTRIBUTION GENERALIZATION IN FINE-TUNED

SOURCE CODE MODELS

Hossein Hajipour1, Ning Yu2, Cristian-Alexandru Staicu1, Mario Fritz1

1CISPA Helmholtz Center for Information Security

2Salesforce Research

1{hossein.hajipour, staicu, fritz}@cispa.de

2ning.yu@salesforce.com

ABSTRACT

Large code datasets have become increasingly accessible for pre-training source

code models. However, for the ﬁne-tuning phase, obtaining representative training

data that fully covers the code distribution for speciﬁc downstream tasks remains

challenging due to the task-speciﬁc nature and limited labeling resources. More-

over, ﬁne-tuning pretrained models can result in forgetting previously acquired

pre-training knowledge. These lead to out-of-distribution (OOD) generalization

issues with unexpected model inference behaviors that have not been systemati-

cally studied yet. In this paper, we contribute the ﬁrst systematic approach that

simulates various OOD scenarios along different dimensions of source code data

properties and study the ﬁne-tuned model behaviors in such scenarios. We inves-

tigate the behaviors of models under different ﬁne-tuning methodologies, includ-

ing full ﬁne-tuning and Low-Rank Adaptation (LoRA) ﬁne-tuning methods. Our

comprehensive analysis, conducted on four state-of-the-art pretrained models and

applied to two code generation tasks, exposes multiple failure modes attributed to

OOD generalization issues. Additionally, our analysis uncovers that LoRA ﬁne-

tuning consistently exhibits signiﬁcantly better OOD generalization performance

than full ﬁne-tuning across various scenarios.1

1 INTRODUCTION

Figure 1: Our approach simulates out-of-

distribution (OOD) scenarios and analyzes the

corresponding behaviors of models. (I) Origi-

nal source code distribution along a certain di-

mension. (II) OOD simulation by masking out

a sub-region of the distribution. (III) Model

ﬁne-tuning. (IV) Evaluation on OOD data.

There has been increasing success in apply-

ing Large Language Models (LLMs) to vari-

ous source code understanding and generation

tasks. LLMs for codes such as CodeBERT Feng

et al. (2020), GraphCodeBERT Guo et al. (2021),

CodeT5+ Wang et al. (2023), CodeGen Nijkamp

et al. (2023), and Code Llama Rozi`

ere et al.

(2023) are pretrained using large-scale source code

datasets and serve as universal initialization for a

variety of downstream tasks. These tasks include

code summarization (Alon et al., 2019; LeClair

et al., 2020), text-to-code (Iyer et al., 2018), code

translation (Nguyen et al., 2013; Rozi`

ere et al.,

2020), and program repair (Tufano et al., 2018;

Chen et al., 2019; Hajipour et al., 2021).

The emerging abilities of LLMs, such as in-context learning, demonstrate their potential to handle

a wide range of tasks (Wei et al., 2022; Brown et al., 2020). However, it has been shown that not

all tasks can be effectively addressed by relying only on the pretrained LLMs Anil et al. (2022).

1The code and data will be available at https://github.com/hajipour/SimSCOOD

arXiv:2210.04802v2 [cs.SE] 30 Oct 2023

Preprint. Under Review.

To adapt pretrained models for speciﬁc tasks, they can be ﬁne-tuned with speciﬁc datasets for each

downstream task. This ﬁne-tuning process can involve optimizing all parameters or adopting a

parameter-efﬁcient approach (Houlsby et al., 2019; Hu et al., 2022), such as Low-Rank Adaptation

(LoRA)Hu et al. (2022). Considering ﬁne-tuning is prone to catastrophic forgetting (Chen et al.,

2020; Li et al., 2022a), and these models play crucial roles in automatic software development, it

is equally important if not more, to foresee and understand any unexpected models behaviors in

different scenarios beyond in-distribution ﬁne-tuning data.

Despite having access to the large code datasets to pre-train these models, it remains challenging in

practice to fully cover the code distribution, speciﬁcally in ﬁne-tuning datasets, where the availability

of labeled data is limited. This mainly stems from the compositional structures of programs and the

complexity of software. Furthermore, it has been shown that ﬁne-tuned models forget previously

learned knowledge Chen et al. (2020), and fully ﬁne-tuning the parameters of the pretrained models

can distort the pretrained features Kumar et al. (2022).

Therefore, it is unclear how the ﬁne-tuned code models generalize to scenarios not seen or are rare

in the ﬁne-tuning distribution Shen et al. (2021). For example, there is a lack of existing studies to

uncover how these models generalize to programs with speciﬁc language elements or semantics not

seen in ﬁne-tuning datasets. A common way to study model behaviors in various OOD scenarios

is to collect testing datasets in the complementary domains of the ﬁne-tuning dataset domain (Shen

et al., 2021). However, because the underlying true distribution of source code is intractable, it

is barely feasible to justify whether two raw datasets share a domain or not. Not to mention the

substantial costs to enumerate and constitute a variety of OOD testing datasets.

Simulating various OOD scenarios by masking out sub-regions of training data distribution is an

alternative way to systematically study the model behaviors (Schott et al., 2022; Wiles et al., 2022).

There are several distribution dimensions based on data properties. In the source code domain, we

have access to the structural information to model the source code distribution based on the length,

syntax, and semantics of programs. For example, in terms of the syntax dimension, we can mask

out all the data with uniray expressions or speciﬁc API to create a syntax-based OOD scenario.

In this work, we propose a systematic approach to analyzing the behaviors of ﬁne-tuned source code

models in various OOD and few-data regime scenarios. We achieve this by harnessing the token

size, syntax information, and contextual embeddings of programs to simulate the OOD scenarios in

terms of length, syntax, and semantics dimensions, as illustrated in Figure 1. By utilizing these data

dimensions and control over the data, we can systematically examine the performance of ﬁne-tuned

models in OOD scenarios and investigate the generalization capabilities of different ﬁne-tuning

methods.

To summarize, the main contributions of this paper are as follows: 1. Our work pioneers in inves-

tigating the behaviors of the ﬁne-tuned source code models in OOD scenarios. 2. We propose a

systematic approach to simulate various OOD scenarios by masking out sub-regions of source code

distribution along the length, syntax, and semantics dimensions. 3. We ﬁnd that the performance of

the ﬁne-tuned models can signiﬁcantly deteriorate in various OOD scenarios despite the models en-

countering similar examples during the pre-training phase. In particular, in syntax and length-based

OOD scenarios, the drop can be as substantial as 90%. 4. Our systematic analysis shows that, while

full ﬁne-tuning and LoRA ﬁne-tuning perform comparably on in-distribution code data, LoRA ﬁne–

tuning demonstrates signiﬁcantly better performance on OOD data. 5. Our analysis of data/model

properties provides insights into model ﬁnetuning and shapes future datasets/research to focus on

the OOD of code models, which has the potential to enhance generalization accuracy across various

code generation tasks.

2 RELATED WORK

Large Language Models for Codes. With the availability of large-scale code datasets (Husain

et al., 2019; Kocetkov et al., 2022), there is growing interest in employing large language mod-

els to develop a uniﬁed pre-training model for source code understanding and generation. Code-

BERT Feng et al. (2020) is one of the ﬁrst models that use pre-training in the source code domain.

CodeBERT extends the RoBERTa-based model Liu et al. (2019) to understand and generate source

code in various programming languages. Guo et al. (2021) extend CodeBERT by using a semantic-

Preprint. Under Review.

aware objective function. CodeT5 and CodeT5+ (Wang et al., 2021; 2023) are developed based on

encoder-decoder architecture, making them versatile models for addressing a wide range of code

understanding and code generation tasks. Svyatkovskiy et al. (2020) employ GPT-based (Radford

et al., 2019), which uses decoder-only architecture, for the code completion task. CodeGen Nijkamp

et al. (2023), StarCoder Li et al. (2023), and Code Llama Rozi`

ere et al. (2023) also employ decoder-

only architecture to pre-train code generation models, these models demonstrate impressive results

across a variety of code generation tasks. While these models show remarkable results by following

natural language instructions, it has been demonstrated that LLMs still have difﬁculty in understand-

ing the codes Austin et al. (2021); Li et al. (2022b), speciﬁcally in domain-speciﬁc tasks Anil et al.

(2022). In our work, we focus on generation tasks to spot weak and strong points of the ﬁne-tuned

LLMs in generating rare and unseen programs.

Out-of-Distribution Analysis in Natural Languages and Programming Languages. Despite

the importance of OOD analysis and detection in production Shen et al. (2021), there are surprisingly

much fewer efforts to investigate OOD behaviors of NLP and PL approaches (Arora et al., 2021).

Hendrycks et al. (2020); Kong et al. (2020) study the behavior of pretrained large language models in

OOD scenarios. Even though they show that the pretrained models are better calibrated, the provided

results indicate that there is still room for improvement. Bui & Yu (2021) propose an energy-

bounded-based approach to detect OOD data in source code classiﬁcation tasks. Recently, Shi et al.

(2022) proposed a set of pre-deﬁned scenarios to investigate the compositional generalization of

neural program synthesizers. It is important to note that their investigation was limited to domain-

speciﬁc languages, such as SCAN (Lake & Baroni, 2018), and did not incorporate pretrained code

models. In contrast, we proposed the ﬁrst systematic study to investigate the behavior of ﬁne-tuned

code models across different OOD scenarios.

Fine-tuning LLMs and Catastrophic Forgetting. LLMs have demonstrated impressive capabili-

ties in handling various tasks using zero-shot and few-shot learning approaches (Brown et al., 2020;

Kojima et al., 2022). However, not all tasks can be effectively handled by relying on pretrained

LLMs (Anil et al., 2022; Scialom et al., 2022). For such tasks, we can employ ﬁne-tuning tech-

niques with the datasets for the targeted downstream tasks. Furthermore, recent works indicate that

ﬁne-tuning LLMs with instructions can enhance their capabilities Ouyang et al. (2022); Xu et al.

(2023); Dai et al. (2023). Despite the effectiveness of the ﬁne-tuning procedure, recent work shows

that after ﬁne-tuning, the LLMs can experience catastrophic forgetting in various NLP tasks (Luo

et al., 2023; Chen et al., 2020). Furthermore, Kumar et al. (2022) validates that fully ﬁne-tuning

the models can distort the pretraining feature and adversely impact the OOD generalization perfor-

mance in image classiﬁcation tasks. In this work, for the ﬁrst time, we systematically investigate the

behavior of the ﬁne-tuned source code models by carefully designing various OOD scenarios.

3 SIMSCOOD: SIMULATION OF SOURCE CODE OUT-OF-DISTRIBUTION

SCENARIOS

In this work, we propose a systematic approach to investigate the ﬁne-tuned code model behaviors on

OOD data by simulating the OOD scenarios in multiple dimensions. Our simulation strategy allows

us to construct measurable OOD scenarios without the additional costs of accessing another dataset.

More importantly, by simulating the OOD scenarios, we have control over different properties of

OOD scenarios. We achieve this by masking out speciﬁc sub-regions of data distribution.

These OOD scenarios span over three data dimensions, including length,syntax, and semantics.

These dimensions cover different aspects of the programs. In length-based OOD scenarios where

we model the program length based on their token sizes (Meneely et al., 2013), we can study the

length-generalization ability of the ﬁne-tuned models. For example, whether the models can produce

longer codes with high quality and how well the models can interpolate over distribution gaps.

Syntax-based scenarios enable us to study the models by masking out speciﬁc language elements.

More interestingly, using syntax-based scenarios, we can analyze to what extent each model can

generate unseen language elements. Using semantic-based scenarios, we can investigate how the

models behave if we mask out the data with speciﬁc functionalities (e.g., getter functions in Java).

Beneﬁting from these scenarios, we can also implicitly quantify how well the models compose

different code language elements to achieve unseen or rare functionality.

Preprint. Under Review.

Figure 2: Overview of different out-of-distribution scenarios. Part of the data that needs to be

masked out from the training distribution is highlighted by the red rectangles.

Modeling the Distribution of Source Code. Here, we experiment with different pretrained mod-

els and probe their behaviors in each scenario. We achieve this using our new approach that system-

atically constructs various scenarios to challenge the OOD performance of each model. As a result,

the distribution of source code can be characterized using the aforementioned dimensions that we

call properties in the following. We model the joint distribution of the source code as q(p1, ..., pn)

where each piis a speciﬁc property of the source code in distribution q. Given this distribution

we can sample a dataset D={x1, . . . , xN|xi∼q(p1, ..., pn)}. To create each OOD scenario we

need to sample a new dataset ˆ

D={x1, . . . , xN|xi∼ˆq(p1, ..., pn)}where ˆq(pf, ..., pk) = 0,

meaning the samples with properties pf, ..., pkare masked out. Note that we just formulated

OOD scenarios with categorical properties, whereas it also holds for continuous properties by

p(a<pi< b)with a<band a, b ∈R.

To sample dataset ˆ

D, we get inspiration from the rejection sampling technique (Casella et al., 2004).

Here, ˆq(p1, ..., pn)is our target distribution and we consider q(p1, ..., pn)as our proposal distribu-

tion. We reject or accept the sample data x∼q(p1, ..., pn)using the following step function,

f(x) = 1if P(x)/∈˜

0if P(x)∈˜

P(1)

Where P(x)returns the properties of data x, and ˜

Pare the properties that we do not want the

sampled data xto contain. Using the rejection sampling technique with a hard-decision function

(Equation 1) we can construct dataset ˆ

D={x1, . . . , xN|x∼ˆq(p1, ..., pn)}with accepted samples,

and also have access to dataset ˜

D={x1, . . . , xN|x∼˜q(p1, ..., pn)}which are all of the rejected

samples. To examine model behaviors in each OOD scenario, we ﬁne-tune models using ˆ

Ddata,

and test them on test set of ˜

D. Figure 2 depicts an overview of the length-, syntax-, and semantic-

based scenarios. In the following, we provide the details of how we simulate each OOD scenario

(subsection 4.1).

3.1 LENGTH-BASED OOD SCENARIOS

To simulate length-based scenarios, we use the histogram of program token sizes to represent the

distribution of a given dataset. See Figure 2 left as an example. To create each OOD scenario,

according to the rejection sampling technique, we draw samples from the distribution and reject

only the samples in the histogram’s speciﬁed sub-region.

As an example, in one of the OOD scenarios, we can consider token size between 120 and 135 as

OOD testing data. Then ˆ

D={x∼ˆq(p1, ..., pn)}where ˆq(120 < pi<135) = 0 is the accepted

data in the rejection sampling technique. Experimenting with the length-based OOD scenarios en-

ables us to analyze how ﬁne-tuned source code models generalize to interpolate and extrapolate over

distribution gaps.

3.2 SYNTAX-BASED OOD SCENARIOS

Each programming language has its own grammar, which is a set of rules to deﬁne valid program

statements. Using the grammar, we can parse each program into an abstract syntax tree Guo et al.

(2021) and have access to all of the language elements used in the program. For example, we can

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Preprint.UnderReview.SIMSCOOD:SYSTEMATICANALYSISOFOUT-OF-DISTRIBUTIONGENERALIZATIONINFINE-TUNEDSOURCECODEMODELSHosseinHajipour1,NingYu2,Cristian-AlexandruStaicu1,MarioFritz11CISPAHelmholtzCenterforInformationSecurity2SalesforceResearch1{hossein.hajipour,staicu,fritz}@cispa.de2ning.yu@salesforce.comA...

展开>> 收起<<

Preprint. Under Review. SIMSCOOD S YSTEMATIC ANALYSIS OF OUT-OF- DISTRIBUTION GENERALIZATION IN FINE-TUNED.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint. Under Review. SIMSCOOD S YSTEMATIC ANALYSIS OF OUT-OF- DISTRIBUTION GENERALIZATION IN FINE-TUNED

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: