
Preprint. Under Review.
aware objective function. CodeT5 and CodeT5+ (Wang et al., 2021; 2023) are developed based on
encoder-decoder architecture, making them versatile models for addressing a wide range of code
understanding and code generation tasks. Svyatkovskiy et al. (2020) employ GPT-based (Radford
et al., 2019), which uses decoder-only architecture, for the code completion task. CodeGen Nijkamp
et al. (2023), StarCoder Li et al. (2023), and Code Llama Rozi`
ere et al. (2023) also employ decoder-
only architecture to pre-train code generation models, these models demonstrate impressive results
across a variety of code generation tasks. While these models show remarkable results by following
natural language instructions, it has been demonstrated that LLMs still have difficulty in understand-
ing the codes Austin et al. (2021); Li et al. (2022b), specifically in domain-specific tasks Anil et al.
(2022). In our work, we focus on generation tasks to spot weak and strong points of the fine-tuned
LLMs in generating rare and unseen programs.
Out-of-Distribution Analysis in Natural Languages and Programming Languages. Despite
the importance of OOD analysis and detection in production Shen et al. (2021), there are surprisingly
much fewer efforts to investigate OOD behaviors of NLP and PL approaches (Arora et al., 2021).
Hendrycks et al. (2020); Kong et al. (2020) study the behavior of pretrained large language models in
OOD scenarios. Even though they show that the pretrained models are better calibrated, the provided
results indicate that there is still room for improvement. Bui & Yu (2021) propose an energy-
bounded-based approach to detect OOD data in source code classification tasks. Recently, Shi et al.
(2022) proposed a set of pre-defined scenarios to investigate the compositional generalization of
neural program synthesizers. It is important to note that their investigation was limited to domain-
specific languages, such as SCAN (Lake & Baroni, 2018), and did not incorporate pretrained code
models. In contrast, we proposed the first systematic study to investigate the behavior of fine-tuned
code models across different OOD scenarios.
Fine-tuning LLMs and Catastrophic Forgetting. LLMs have demonstrated impressive capabili-
ties in handling various tasks using zero-shot and few-shot learning approaches (Brown et al., 2020;
Kojima et al., 2022). However, not all tasks can be effectively handled by relying on pretrained
LLMs (Anil et al., 2022; Scialom et al., 2022). For such tasks, we can employ fine-tuning tech-
niques with the datasets for the targeted downstream tasks. Furthermore, recent works indicate that
fine-tuning LLMs with instructions can enhance their capabilities Ouyang et al. (2022); Xu et al.
(2023); Dai et al. (2023). Despite the effectiveness of the fine-tuning procedure, recent work shows
that after fine-tuning, the LLMs can experience catastrophic forgetting in various NLP tasks (Luo
et al., 2023; Chen et al., 2020). Furthermore, Kumar et al. (2022) validates that fully fine-tuning
the models can distort the pretraining feature and adversely impact the OOD generalization perfor-
mance in image classification tasks. In this work, for the first time, we systematically investigate the
behavior of the fine-tuned source code models by carefully designing various OOD scenarios.
3 SIMSCOOD: SIMULATION OF SOURCE CODE OUT-OF-DISTRIBUTION
SCENARIOS
In this work, we propose a systematic approach to investigate the fine-tuned code model behaviors on
OOD data by simulating the OOD scenarios in multiple dimensions. Our simulation strategy allows
us to construct measurable OOD scenarios without the additional costs of accessing another dataset.
More importantly, by simulating the OOD scenarios, we have control over different properties of
OOD scenarios. We achieve this by masking out specific sub-regions of data distribution.
These OOD scenarios span over three data dimensions, including length,syntax, and semantics.
These dimensions cover different aspects of the programs. In length-based OOD scenarios where
we model the program length based on their token sizes (Meneely et al., 2013), we can study the
length-generalization ability of the fine-tuned models. For example, whether the models can produce
longer codes with high quality and how well the models can interpolate over distribution gaps.
Syntax-based scenarios enable us to study the models by masking out specific language elements.
More interestingly, using syntax-based scenarios, we can analyze to what extent each model can
generate unseen language elements. Using semantic-based scenarios, we can investigate how the
models behave if we mask out the data with specific functionalities (e.g., getter functions in Java).
Benefiting from these scenarios, we can also implicitly quantify how well the models compose
different code language elements to achieve unseen or rare functionality.
3