of biomarkers of tumors on MRI images [12, 30, 46], organs
structure analysis on MRI images [17, 20, 13, 41]. DNNs
have been recently applied to the task of classifying embryo
stage development. Existing methods [21, 23, 24, 25, 26]
consider time-lapse embryos videos as sequences of images
and utilize 2D-CNNs to perform per-frame classification,
then apply a post-processing step with dynamic program-
ming to enforce the predictions following the monotonic
non-decreasing constraint. Such approaches deal with the
high imbalance across classes as well as can not consider
temporal information. Other works [24, 23] introduce two-
stream networks that incorporate temporal information to
address the imbalance issue while incorporating the mono-
tonic constraint into the learning stage. Despite showing
promising results, these methods process a fixed size of
frame sequence at a time, which could lack the global con-
text of the entire video and also increase the inference time.
In this work, we utilize deformable Transformer [49] to pro-
pose an encoder-decoder deformable Transformer network
for embryos stage development classification. Our pro-
posed network contains three heads aiming classification,
segmentation, and refinement. Our contribution is two-fold
as follows:
•Dataset: We have conducted an Embryos Human dataset
with a total of 440 time-lapse videos of 148,918 images,
gathered from a real-world environment and collected from
a diverse number of patients. The dataset has been carefully
pre-processed, annotated and conducted by three embryol-
ogists. The data will be made available for the research
community, please contact the author.
•Methodology: We propose EmbryosFormer, an effective
framework for monitoring embryo stage development. Our
network is built based on the Unet-like architecture with de-
formable transformer blocks and contains two paths. A con-
trasting path (i.e. deformable transformer encoder) aims to
predict per-class label, whereas an expanding path (i.e. de-
formable transformer decoder) models stage-level by taking
temporal consistency into consideration. The feature en-
coding at the encoding path is optimized by a classification
head, and the temporal coherency at the decoding path is
trained by a segmentation head. Both encoding and decod-
ing paths are cooperatively learned by a collaboration head.
We empirically validate the effectiveness of our proposed
EmbryosFormer by showing that, to the best of our knowl-
edge, it achieves superior performance to all of the current
state-of-the-art methods benchmarked on the two datasets
of Embryos Mouse and Human.
2. Related Work
2.1. Detection Transformer
The core idea behind transformer architecture [42] is
the self-attention mechanism to capture long-range relation-
ships. Transformer has been successfully applied to en-
rich global information in computer vision [47, 4, 43, 40].
When it comes to object detection, Detection Transformer
(DETR) [2] is one of the most well-known approaches,
which performs the task as a set prediction. Unlike tra-
ditional CNNs-based methods [34, 10], Detection Trans-
former (DETR) [2] performs the task as a set prediction.
Even DETR obtains good performance while providing an
efficient way to represent each detected element, it suffers
from high computing complexity of quadratic growth with
image size and slow convergence of global attention mecha-
nism. The recent Deformable Transformer [49] is proposed
to address the limitations while gaining better performance
by incorporating multi-scale feature representation and at-
tending to sparse spatial locations of images. Not only in
the image domain, but DETR is also successfully applied to
the video domain e.g. dense video captioning PDVC [44].
2.2. Embryo stage development classification
Classifying embryo development stages aims to provide
a cue for quality assessment of fertilized blastocysts, which
requires complex analyses of time-lapse imaging videos be-
sides identifying development stages. Traditionally, embry-
ologists must review the embryo images to determine the
time of division for each cell stage development. This pro-
cess does require not only expert knowledge but also ex-
perience and is time-consuming. With the emergence of
DNNs, CNNs have been used to assess embryo images.
Generally, DNNs-based embryo stage development classi-
fication can be divided into two categories: image-based
and sequence-based. In the first group, Khan et al., [14]
utilizes CNNs (i.e., AlexNet) [16] and a Conditional Ran-
dom Field (CRF) [37] to count human embryonic cell over
the first five cell stages. Ng et al., [29] used ResNet [11]
coupled with a dynamic programming algorithm for post-
processing to predict morphokinetic annotations in human
embryos. Later, Lau et al., extend [29] with region of in-
terest (ROI) detection and LSTM [8] for sequential clas-
sification. Rad et al., [33] proposes Cell-Net, which uses
ResNet-50 [11] to parse centroids of each cell from embryo
image. Leahy et al., [21] extracts five key features from
time-lapse videos, including stage classification, which uti-
lizes ResNeXt101 [45] to predict per-class probability for
each image. Malmsten et al., [25] uses Inception-V3 [39]
to classify human embryo images into different cell divi-
sion stages, up to eight cells. While showing promising re-
sults on automatically classifying embryonic cell stage de-
velopment with DNNs, image-based prediction approaches
ignore temporal coherence between time-lapse images and
the monotonic development order constraint during train-
ing. In the second group, Lukyanenko et al., [24] incorpo-
rate CRFs [37] to include the monotonic condition into the
learning process for sequential stage prediction. Lockhart