
Code Librarian: A Software Package
Recommendation System
Lili Tao, Alexandru-Petre Cazan, Senad Ibraimoski and Sean Moran
JP Morgan Chase
Email: {lili.tao,alexandru-petre.cazan,senad.ibraimoski,sean.j.moran}@jpmchase.com
Abstract—The use of packaged libraries can significantly
shorten the software development life cycle by improving the
quality and readability of code. In this paper, we present a
recommendation engine called Code Librarian for open source
libraries. A candidate library package is recommended for a
given context if: 1) it has been frequently used with the imported
libraries in the program; 2) it has similar functionality to the
imported libraries in the program; 3) it has similar functionality
to the developer’s implementation, and 4) it can be used efficiently
in the context of the provided code. We apply the state of the art
CodeBERT-based model for analysing the context of the source
code to deliver relevant library recommendations to users.
Index Terms—artificial intelligence, software engineering, rec-
ommender systems
I. INTRODUCTION
Reusing existing software libraries brings many benefits,
including the acceleration of software development and an
increase in the quality and readability of code. In this paper,
we introduce Code Librarian, a software library recommenda-
tion system that uses machine learning techniques to suggest
relevant open source libraries based on the context of the code
already written by a developer [1], [2]. For Python developers
there are more than 350,000 libraries [3] available on PyPi and
new library packages are frequently added. In addition, there
is rapid evolution of standard library practices across various
tasks. For example, in the field of Natural Language Pro-
cessing (NLP) the commonly used libraries quickly expanded
from scikit-learn and genism to bertopic,top2vec,octis, based
on recent advances in NLP. Librarian is an intelligent coding
assistant that helps developers find and reuse quality code and
components.
II. APPROACH AND METHODOLOGY
Figure 1 shows the approach: a) recommendation of com-
plementary libraries by learning which libraries are used most
frequently with those imported; b) recommendation of replace-
able libraries that can replace functionally similar code.
A. Complementary library recommendation
Learning embeddings for library packages: To discover
complimentary libraries we learnt a contextual embedding of
libraries based on their co-occurrence in the same scripts. We
followed [4] for learning the vector representation of library
packages in which a skip-gram model [5] is used to learn
embeddings for libraries based on their usage context. A pair
of imported libraries are deemed a positive example when the
target library co-occurred with the context library within a file
of at least one project. A negative pair are libraries that were
rarely imported together in any source file of any project in
the dataset. Cosine similarity between the embeddings is used
to find very similar, and therefore, complimentary libraries.
Out of sample extension: For new library packages not
included in the training data, rather than re-train the model,
an embedding of the new package is learnt by projecting it
into the latent space. The new embedding can be calculated
by the weighted average of Nco-occurring packages in the
same file, with the weight representing the number of times
the pair appeared together: PN
i=1 wiPi
PN
i=1 wi, where weights wiis the
number of times the unseen library co-occurred with library
Pi
B. Alternative library recommendation
Understanding the topic and functionality of source code
assists with the selection of relevant libraries. We leverage
CodeBERT [6] to learn the contextual representation of the
code and capture the semantic connection between natural
language and programming language. CodeBERT is applied
to generate a text description of each function or Jupyter
notebook cell for IPython notebook files. We concatenate the
text descriptions and use those as a query. The query is used
to retrieve matching libraries based on their description using
a vector-space retrieval method (bag-of-words, TF-IDF).
C. Deployment of Librarian
We developed a demo shown in Figure 2. CodeBERT was
packaged and deployed on AWS Sagemaker, while the main
application was deployed as a service on a Kubernetes cluster.
The user receives library recommendations after uploading a
Jupyter notebook file. For a more seamless user experience
we built a Jupyter notebook extension which subscribes to
cell change events and recommends libraries in real time. On
every cell update event, the current notebook sourcecode is
sent to a CodeBERT model for inference and the results are
shown in a sub-panel (Figure 3).
III. EXPERIMENTAL RESULTS
The proposed system has been evaluated on 375,128 pub-
licly available Python files from GitHub, and on 11,893 files
from a proprietary repository.
Complementary library recommendation: to evaluate com-
plementary library recommendation, we randomly remove one
arXiv:2210.05406v2 [cs.SE] 7 Feb 2023