
Development and validation of deep learning based
embryo selection across multiple days of transfer
Jacob Theilgaard Lassen1,*, Mikkel Fly Kragh1, Jens Rimestad1, Martin Nyg ˚
ard
Johansen1, and Jørgen Berntsen1
1Vitrolife A/S, Aarhus, Denmark
*jtlassen@vitrolife.com
ABSTRACT
This work describes the development and validation of a fully automated deep learning model, iDAScore v2.0, for the evaluation
of embryos incubated for 2, 3, and 5 or more days. The model is trained and evaluated on an extensive and diverse dataset
including 181,428 embryos from 22 IVF clinics across the world. For discriminating transferred embryos with known outcome
(KID), we show AUCs ranging from 0.621 to 0.708 depending on the day of transfer. Predictive performance increased over
time and showed a strong correlation with morphokinetic parameters. The model has equivalent performance to KIDScore D3
on day 3 embryos while significantly surpassing the performance of KIDScore D5 v3 on day 5+ embryos. This model provides
an analysis of time-lapse sequences without the need for user input, and provides a reliable method for ranking embryos for
likelihood to implant, at both cleavage and blastocyst stages. This greatly improves embryo grading consistency and saves
time compared to traditional embryo evaluation methods.
1 Introduction
Prioritizing embryos for transfer and cryopreservation is a long-standing challenge in the field of in vitro fertilization (IVF)
with both academic and commercial research dedicated to its resolution. When multiple good quality embryos are available,
selection of the embryo with the highest likelihood of implantation will shorten time to pregnancy and ultimately live birth.
Traditionally, embryo evaluation has been carried out by manual inspection of either static microscope images or time-lapse
videos of developing embryos. Scoring systems based on morphological and morphokinetic annotations have been used to
rank embryos within patient cohorts as well as decide, which embryos to discard and which to transfer and/or cryopreserve. In
recent years, however, the use of artificial intelligence (AI) to evaluate embryos has shown promise in both automating the
assessment and potentially surpassing the ranking performance of manual inspection [1].
Increasingly, blastocyst transfers has become the preferred development stage for transfer [
2
], and most AI models for
embryo evaluation specifically address embryos cultured to day 5 or later [
3
,
4
,
5
,
6
]. However, blastocyst culture generally
results in a lower number of embryos to choose from, and for patients with poor embryo development, cleavage-stage transfers
may be preferred if there is a risk of a cancelled cycle [
2
]. Few studies exist that focus on both cleavage-stage and blastocyst
transfers [
7
,
8
]. Erlich et al.
[7]
propose a combined model for handling day 3 and day 5 transfers, by predicting a score for
each image in a time-lapse sequence. Scores from previous images in the sequence are then aggregated temporally. The authors
claim that the method provides continuous scoring regardless of development stage and time, and that it outperforms the manual
morphokinetic model, KIDScore D3 [
9
]. However, as they only evaluate on day 5 transfers, they ignore the possibility that
different embryo characteristics may not be equally important for day 3 and day 5 transfers. Kan-Tor et al.
[8]
also propose a
combined model for handling day 3 and day 5 transfers, by first predicting scores for non-overlapping temporal windows, and
then aggregating scores from previous windows in the sequence using logistic regression. The authors show both discrimination
and calibration results together with subgroup analyses on patient age and clinics for day 5 transfers. However, for day 3
transfers, only the overall discrimination performance is presented. Therefore, the calibration and generalization performance
on day 3 embryos across subgroups such as patient age and clinics remains to be seen.
In addition to day of transfer, current AI models often deviate in how they approach automation. Some methods assume
manual preselection by embryologists and can thus be categorized as semi-automated. These are methods that have only been
trained on transferred embryos and therefore generally have not seen embryos of poor quality [
4
,
5
,
6
]. Other methods approach
full automation by training on all embryos, regardless of whether they were transferred or not. These methods rely on other
labels than pregnancy for the non-transferred embryos such as manual deselection by embryologists (discards), results of
preimplantation genetic testing for aneuploidy (PGT-A), or morphokinetic and/or morphological annotations [
7
,
10
,
3
]. For an
AI model to be both fully automated and superior in ranking performance on previously transferred embryos, both aspects need
arXiv:2210.02120v1 [q-bio.QM] 5 Oct 2022