ProtoFold Neighborhood Inspector
Nicolas F. Chaves-de-Plaza*, Klaus Hildebrandt* and Anna Vilanova**
*CGV group, TU Delft | **Mathematics and Computer Science, TU Eindhoven
1 INTRODUCTION
Post-translational modifications (PTMs) affecting a protein's
residues (amino acids) can disturb its function, leading to illness.
Whether or not a PTM is pathogenic depends on its type and the
status of neighboring residues. In this paper, we present the
ProtoFold Neighborhood Inspector (PFNI), a visualization system
for analyzing residues neighborhoods. The main contribution is a
visualization idiom, the Residue Constellation (RC), for identifying
and comparing three-dimensional neighborhoods based on per-
residue features and spatial characteristics. The RC leverages two-
dimensional representations of the protein's three-dimensional
structure to overcome problems like occlusion, easing the analysis
of neighborhoods that often have complicated spatial arrangements.
Using the PFNI, we explored proteins’ structural PTM data, which
allowed us to identify patterns in the distribution and quantity of
per-neighborhood PTMs that might be related to their pathogenic
status. In the following, we define the tasks that guided the
development of the PFNI and describe the data sources we derived
and used. Then, we introduce the PFNI and illustrate its usage
through an example of an analysis workflow. We conclude by
reflecting on preliminary findings obtained while using the tool on
the provided data and future directions concerning the development
of the PFNI.
2 DATA
The challenge organizers provided two datasets that describe how
PTMs relate to proteins’ structures1. First, the structures dataset
includes, for each residue (amino acid) of three proteins across two
organisms, the 3D coordinates of its alpha, beta, and carboxyl
carbon atoms and its nitrogen atom. It also includes, for each
residue, an estimate of the prediction confidence (pLLDT) and the
type of structure. Second, the modifications dataset lists the PTMs
that affect the residues in the structures dataset. Each row in the
modifications dataset describes a PTM that affects a protein´s
residue, including the PTM’s name/type, classification, and
pathogenic status. A residue in the structures dataset can be related
to multiple modifications.
Finally, we derived a third dataset, neighbors, from the structures
dataset. The neighbors dataset lists the thirty nearest neighbors of
every residue within every protein structure. We used the Euclidean
distance between the residues’ alpha carbons as the metric to filter
the neighbors.
3 MODELING AND TASKS
We define the general goal of the visualization system as enabling
the analysis of three-dimensional neighborhoods of proteins'
residues. Before proceeding, we formulate this goal explicitly using
the language of graph theory.
We model proteins as graphs where residues are nodes indexed
by the position in the protein's primary structure. To simplify
modeling, we use the position of the residue’s alpha carbon as a
proxy of the residue’s position. In terms of links between nodes,
there are two types: topological links between consecutive residues
and spatial ones between residues that are close in three-
dimensional space. The PFNI focuses on the latter. Finally, nodes
and links have attributes. Examples of the former include the
number and type of PTMs and the pLLDT score. As for the latter,
we consider the Euclidean distance between the nodes that the link
connects.
Considering this definition of the problem and the overall goal,
we now list the tasks that the PFNI should support:
• (T1) Select nodes (residues) for neighborhood analysis
based on node-level and neighborhood-level characteristics.
• (T2) Identify salient patterns in the distribution of a
neighborhood's node and link-level characteristics.
• (T3) Compare neighborhoods to assess their similarity, or
dissimilarity, under a feature of interest.
4 VISUALIZATION SYSTEM: PFNI
Figure 1 presents an overview of the ProtoFold Neighborhood
Inspector (PFNI). We developed the prototype using web
technologies like D3 [1] and Three.js [2]. The following paragraphs
describe the system’s main idiom, the Residue Constellation, and
the supporting Three-Dimensional and Bulk Selection widgets.
4.1 Residue Constellation
4.1.1 Primary Structure Orbits (PSO)
We use a radial layout to organize the protein’s two-dimensional
primary structure. The main goal of the PSO is to let users identify
residues for subsequent neighborhood-level analysis. The PSO
consists of three orbits, each consisting of as many arcs of equal
length as residues in the protein. Figure 2 presents the three orbits
in detail. The inner orbit displays the position of each residue in the
chain using a sequential color scale. In the middle orbit, the colors
of the arcs change depending on the user-selected categorical
variable of interest. Finally, the outer orbit uses the arcs’ thickness
to encode a user-selected numeric variable. In Figure 2, the numeric
variable is the number of modifications each residue has.
Figure 5 shows an additional layer of information that this layer
provides. When the user hovers over a residue, ribbons connecting
the residue to its neighbors appear. Using this transient ribbon view,
the user can get a sense of the residue’s neighborhood before adding
it to the Neighborhoods Force Layout.
We chose a radial layout to depict the protein’s primary structure
for two reasons. First, it allows positioning residues in large chains
side-by-side, which permits using the orbits’ arc’s thickness to
judge relationships between numerical values. Second, because the
center of the PSO remains empty, it allows an efficient overview of
the neighborhoods via the ribbon view.
4.1.2 Neighborhoods Force Layout (NFL)
In the center of the Residue Constellation lies the Neighborhoods
Force Layout (NFL), which provides a 2D representation of the
protein 3D neighborhoods. Explicitly, this view organizes user-
selected residues (primary nodes) and their neighbors (secondary
nodes) using a force layout, similarly to [3]. The NFL depicts
primary and secondary nodes differently to reduce clutter and allow
multiple levels of neighborhood analysis, as Figures 3 and 4
illustrate. On the one hand, the NFL uses the neighborhood
summarization glyph (following subsection) to depict primary