Development and validation of deep learning based embryo selection across multiple days of transfer Jacob Theilgaard Lassen1 Mikkel Fly Kragh1 Jens Rimestad1 Martin Nyg ard

2025-04-27 1 0 642.21KB 16 页 10玖币

侵权投诉

Development and validation of deep learning based

embryo selection across multiple days of transfer

Jacob Theilgaard Lassen1,*, Mikkel Fly Kragh1, Jens Rimestad1, Martin Nyg ˚

ard

Johansen1, and Jørgen Berntsen1

1Vitrolife A/S, Aarhus, Denmark

*jtlassen@vitrolife.com

ABSTRACT

This work describes the development and validation of a fully automated deep learning model, iDAScore v2.0, for the evaluation

of embryos incubated for 2, 3, and 5 or more days. The model is trained and evaluated on an extensive and diverse dataset

including 181,428 embryos from 22 IVF clinics across the world. For discriminating transferred embryos with known outcome

(KID), we show AUCs ranging from 0.621 to 0.708 depending on the day of transfer. Predictive performance increased over

time and showed a strong correlation with morphokinetic parameters. The model has equivalent performance to KIDScore D3

on day 3 embryos while signiﬁcantly surpassing the performance of KIDScore D5 v3 on day 5+ embryos. This model provides

an analysis of time-lapse sequences without the need for user input, and provides a reliable method for ranking embryos for

likelihood to implant, at both cleavage and blastocyst stages. This greatly improves embryo grading consistency and saves

time compared to traditional embryo evaluation methods.

1 Introduction

Prioritizing embryos for transfer and cryopreservation is a long-standing challenge in the ﬁeld of in vitro fertilization (IVF)

with both academic and commercial research dedicated to its resolution. When multiple good quality embryos are available,

selection of the embryo with the highest likelihood of implantation will shorten time to pregnancy and ultimately live birth.

Traditionally, embryo evaluation has been carried out by manual inspection of either static microscope images or time-lapse

videos of developing embryos. Scoring systems based on morphological and morphokinetic annotations have been used to

rank embryos within patient cohorts as well as decide, which embryos to discard and which to transfer and/or cryopreserve. In

recent years, however, the use of artiﬁcial intelligence (AI) to evaluate embryos has shown promise in both automating the

assessment and potentially surpassing the ranking performance of manual inspection [1].

Increasingly, blastocyst transfers has become the preferred development stage for transfer [

], and most AI models for

embryo evaluation speciﬁcally address embryos cultured to day 5 or later [

]. However, blastocyst culture generally

results in a lower number of embryos to choose from, and for patients with poor embryo development, cleavage-stage transfers

may be preferred if there is a risk of a cancelled cycle [

]. Few studies exist that focus on both cleavage-stage and blastocyst

transfers [

]. Erlich et al.

[7]

propose a combined model for handling day 3 and day 5 transfers, by predicting a score for

each image in a time-lapse sequence. Scores from previous images in the sequence are then aggregated temporally. The authors

claim that the method provides continuous scoring regardless of development stage and time, and that it outperforms the manual

morphokinetic model, KIDScore D3 [

]. However, as they only evaluate on day 5 transfers, they ignore the possibility that

different embryo characteristics may not be equally important for day 3 and day 5 transfers. Kan-Tor et al.

[8]

also propose a

combined model for handling day 3 and day 5 transfers, by ﬁrst predicting scores for non-overlapping temporal windows, and

then aggregating scores from previous windows in the sequence using logistic regression. The authors show both discrimination

and calibration results together with subgroup analyses on patient age and clinics for day 5 transfers. However, for day 3

transfers, only the overall discrimination performance is presented. Therefore, the calibration and generalization performance

on day 3 embryos across subgroups such as patient age and clinics remains to be seen.

In addition to day of transfer, current AI models often deviate in how they approach automation. Some methods assume

manual preselection by embryologists and can thus be categorized as semi-automated. These are methods that have only been

trained on transferred embryos and therefore generally have not seen embryos of poor quality [

]. Other methods approach

full automation by training on all embryos, regardless of whether they were transferred or not. These methods rely on other

labels than pregnancy for the non-transferred embryos such as manual deselection by embryologists (discards), results of

preimplantation genetic testing for aneuploidy (PGT-A), or morphokinetic and/or morphological annotations [

]. For an

AI model to be both fully automated and superior in ranking performance on previously transferred embryos, both aspects need

arXiv:2210.02120v1 [q-bio.QM] 5 Oct 2022

Day 2

Day 5+

Day 3

Day 2/3

model

Day 5+

model

Direct cleavage

model

Combination

model

Day 2

Calibration

Day 3

Calibration

Day 5

Calibration

Day

3Scaling

CNN

Logistic regression

Linear transformation

Figure 1. iDAScore v2.0. Two separate tracks handle day 2/3 and day 5+ embryos. The ﬁrst track consists of two 3D

convolutional neural networks (CNN) that predict implantation potential and direct cleavages, followed by separate calibration

models for day 2 and 3. The second track consists of a 3D CNN that predicts implantation potential followed by a day 5+

calibration model. Finally, scores from both tracks are scaled linearly to the range 1.0–9.9.

to be evaluated [

]. The performance of both transferred embryos with known implantation data (KID) and non-transferred

embryos of different qualities and development stages needs to be evaluated in order to ensure general prospective use.

In this study, we describe the development and validation of a fully automated AI model, iDAScore v2.0, for embryo

evaluation on day 2, day 3 and day 5+ embryos. As in our previous work [

], the model is based on 3D convolutions that

simultaneously identify both spatial (morphological) and temporal (morphokinetic) patterns in time-lapse image sequences.

However, whereas our previous work only dealt with ranking performance, in this study, we also calibrate the model to obtain a

linear relationship between model predictions and implantation rates. We train and evaluate our model on an extensive and

diverse dataset including 181,428 embryos from 22 IVF clinics across the world. On independent test data, we present both

discrimination and calibration performance for embryos transferred after 2, 3 and more than 5 days of incubation, individually,

and compare with iDAScore v1 [

] and the manual morphokinetic models, KIDScore D3 [

] and KIDScore D5 v3 [

We also present discrimination performance for a range of subgroups including patient age, insemination method, transfer

protocol, year of treatment, and fertility clinic. Finally, we perform temporal analyses on score developments from day 2 to

5 to illustrate improvements over time in discrimination performance, temporal changes in ranking, and relation to common

morphokinetic parameters used for traditional embryo selection. To the best of our knowledge, our work presents the ﬁrst

AI-based model for ranking embryos from day 2 to day 5+, and is the ﬁrst study to present calibration curves and subgroup

analyses on transferred cleavage-stage embryos.

2 Materials / Methods

Study design

The study was a multi-center retrospective cohort study consisting of 249,635 embryos from 34,620 IVF treatments carried out

across 22 clinics from 2011 to 2020. As the study focused on day 2, day 3 and day 5+ transfers, day 1 (n=1,243) and day 4

(n=182) embryos were excluded, corresponding to embryos incubated less than 36 hours post insemination (hpi) and embryos

incubated between 84 hpi and 108 hpi. Furthermore, embryos without known clinical fate were excluded, as their clinical

outcomes were unknown due to follow-up loss (n=3,192) or because they were still cryopreserved at the time of data collection

and thus had pending outcomes (n=50,392). After data exclusion, 181,428 embryos remained, of which 33,687 were transferred

embryos with known implantation data (KID) measured by the presence of a fetal heartbeat, and 147,741 were discarded by

embryologists either due to arrested development, failed fertilization, aneuploidy, or other clinical deselection criteria. Finally,

the dataset was split into training (85%) and testing (15%) on treatment level, ensuring that all embryos within a given treatment

were either allocated to training or testing. While this split-strategy allows cohort-analyses on the test set, it also mitigates

certain types of biases, as the AI model cannot beneﬁt from overﬁtting to individual patients in the training set. A ﬂow diagram

illustrating patients, exclusion of data points (embryos), and division into training and test subsets is shown in Figure 2.

2/16

Embryos (n=249,635)

- Treatments: 34,620

- Clinics: 22

Follow-up on clinical fate

(n=248,210)

Full dataset (n=181,428)

- KIDp: 8,465

- KIDn: 25,222

- Discard: 147,741

Excluded embryos (n=1,425)

- Day 1: 1,243

- Day 4: 182

Excluded embryos (n=66,782)

- Pending: 50,392

- Unknown: 3,192

Train (85%)

(n=154,875)

Test (15%)

(n=26,553)

Figure 2. Flowchart of the study design.

Day 2 Day 3 Day 5+ Total

Discarded 12,627 14,121 99,288 126,036

KID- 7,095 4,876 9,656 21,627

KID+ 1,453 1,075 4,684 7,212

Total 21,175 20,072 113,628 154,875

(a) Training data

Day 2 Day 3 Day 5+ Total

Discarded 2,029 2,491 17,185 21,705

KID- 1,165 809 1,621 3,595

KID+ 258 194 801 1,253

Total 3,452 3,494 19,607 26,553

(b) Test data

Table 1. Datasets for training and testing the model.

Table 1 shows the speciﬁc number of discarded embryos and KID embryos with positive (KID+) and negative (KID-)

outcomes for each day in the training and test sets. Table 6,Table 7 and Table 8 in the appendix contain further details

on patients age, clinical procedures, and embryos for each clinic in the data subsets of day 2, day 3, and day 5+ embryos,

respectively.

Image data

All embryos were cultured in EmbryoScope

™

, EmbryoScope

™

+, or EmbryoScope

™

Flex incubators (Vitrolife A/S, Aarhus,

Denmark). The incubators acquired time-lapse images during embryo development according to speciﬁc settings in each clinic.

For EmbryoScope

™

incubators, microscope images of 3–9 focal planes of size 500

500 pixels were acquired every 10–30

minutes. For EmbryoScope

™

+ or EmbryoScope

™

Flex incubators, microscope images of 11 focal planes of size 800

800

pixels were acquired every 10 minutes.

Model development

To predict embryo implantation on day 2, 3 and 5+, a combined AI model consisting of several components was developed.

Figure 1 shows a ﬂowchart of the model. If an embryo is incubated more than 84 hpi, raw time-lapse images from 20–148 hpi

are fed to a 3D convolutional neural network (CNN) that outputs a scalar between 0–1 (Day 5+ model). If, however, the embryo

is incubated less than 84 hpi, images from 20–84 hpi are fed to two separate CNN models that evaluate overall implantation

potential (Day 2/3 model) and presence of direct cleavages from one to three cells and from two to ﬁve cells (Direct cleavage

model). The day 2/3 model outputs a scalar between 0–1, and the direct cleavage model outputs two scalars (one for each type

of direct cleavage) between 0–1. A logistic regression model then combines the three outputs into a single scalar. Finally,

outputs from either day 2/3 or day 5+ are calibrated individually for each day to obtain a linear relationship between scores

and implantation rates. At this point, the scores are estimates of pregnancy probabilities representative of the average patient

population (including various diagnostic proﬁles), as opposed to individualized probabilities for each patient. Therefore, to

avoid confusing probabilities as being individualized, the calibrated scores are ultimately rescaled to the range 1.0–9.9, similar

to the range used in our previous work [3] and by the manual morphokinetic model, KIDScore D5v3 [14].

For more details on model architectures, training methodology including data sampling, preprocessing and augmentation

strategies as well as individual results for the components, see Appendix A.

Model validation

Internal validation was used to evaluate the predictive performance of the model on test data in terms of discrimination and

calibration [

]. The area under the receiver operating characteristic curve (AUC) was used to quantify discrimination

and reported with 95% conﬁdence intervals using DeLong’s algorithm [

]. Tests for signiﬁcant differences in AUC were

performed using either paired or unpaired two-tailed DeLong’s test [

]. Bonferroni-adjusted p-values were used for reporting

signiﬁcant differences between subgroups. Calibration was assessed graphically using observed implantation rates in grouped

observations of similar predictions (quantiles) and Loess smoothing [17].

3/16

3 Results

The combined discriminatory performance in terms of AUC for iDAScore v2.0 is presented in Table 2a along with intermediate

results by each component from Figure 1. For each day (2, 3 and 5+), the table lists an AUC on all embryos (KID+ vs. KID-

and discarded) and on KID embryos (KID+ vs. KID-). The AUCs on day 2, 3 and 5+ were 0.862, 0.873 and 0.954 for all

embryos and 0.669, 0.621 and 0.708 for KID embryos. Table 2b provides a comparison with two manual scoring systems,

KIDScore D3 [

] and KIDScore D5 v3 [

], as well as our previous work, iDAScore v1 [

], on embryos in the test set that had

manual morphological and morphokinetic annotations required by KIDScore. iDAScore v1 was evaluated on iDAScore v2.0’s

test set. As this includes training samples from v1, the iDAScore v1 performance may be overestimated. On day 3, the AUCs

of iDAScore v2.0 and KIDScore D3 on KID embryos were 0.608 and 0.610, with no signiﬁcant differences according to a

paired DeLong’s test (

p=0.92

). As such, iDAScore v2.0 seems to perform as well as KIDScore D3 on selecting embryos for

transfer on day 3, however, without requiring any manual annotations. On day 5+, however, the AUCs of iDAScore v2.0 and

KIDScore D5 v3 on KID embryos were 0.694 and 0.644 and signiﬁcantly different (

p<0.001

). When comparing iDAScore

v2.0 against the previous version, iDAScore v1, on KID embryos, AUCs were 0.694 and 0.672 and also signiﬁcantly different

(

p=0.047

). This suggests that the increased amount of training data and slightly modiﬁed training strategies from v1 to v2.0

have improved the model performance signiﬁcantly.

The calibration performance of iDAScore v2.0 is shown in Figure 3 for day 2, 3 and 5+, individually. In general, there is a

good agreement between predicted probabilities and observed implantation rates. Comparing the three curves, we see that both

the ranges of predictions and success rates increased from day 2 to 3 and from day 3 to 5+. That is, the best day 3 embryos had

higher scores and a higher implantation rate than the best day 2 embryos. And on day 5+, we observe both the highest and

lowest scores as well as the highest and lowest implantation rates. This suggests that with more information available on the

blastocyst stage, the model can more conﬁdently assign a probability of implantation, ranging from around 6% for the lowest

scores up to 65% for the highest scores on day 5+. As these predictions were made based on time-lapse images alone, however,

they represent average patient probabilities and not individualized patient probabilities. To predict probabilities on patient-level,

additional characteristics such as patient demographics and clinical practice should be included in the calibration procedure and

analysis [1,18]. However, these aspects are outside the scope of this work.

Day 2 Day 3 Day 5+

All (n=3,452) KID (n=1,423) All (n=3,494) KID (n=1,003) All (n=19,607) KID (n=2,422)

Day 2/3 .856 [.840-.872] .663 [.630-.697] .862 [.844-.879] .611 [.569-.654] - -

Combination .862 [.845-.878] .669 [.635-.703] .873 [.854-.891] .621 [.580-.662] - -

Day 5+ - - - - .954 [.950-.958] .708 [.686-.728]

iDAScore v2.0 .862 [.845-.878] .669 [.635-.703] .873 [.854-.891] .621 [.580-.662] .954 [.950-.958] .708 [.686-.728]

(a) Full test set

Day 3 Day 5+

KID (n=800) KID (n=1,175)

KIDScore D3 .610 [.569-.651] -

KIDScore D5 - .644 [.613-.676]

iDAScore v1 - .672 [.641-.703]

iDAScore v2.0 .608 [.562-.654] .694 [.664-.724]

(b) Comparisons on subset of test set with annotations required by KIDScore D3 and D5.

Table 2. AUCs on the test set for the different model components across days of incubation. All denotes KID+ vs. KID- and

discarded embryos, whereas KID denotes KID+ vs. KID- embryos. All AUCs are reported with 95% conﬁdence intervals in

brackets. (a) lists results on the full test set from Table 1b, whereas (b) compares performance with KIDScore D3, KIDScore

D5 and iDAScore v1 models on embryos that have manual annotations available as required by KIDScore.

4/16

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DevelopmentandvalidationofdeeplearningbasedembryoselectionacrossmultipledaysoftransferJacobTheilgaardLassen1,*,MikkelFlyKragh1,JensRimestad1,MartinNygardJohansen1,andJørgenBerntsen11VitrolifeA/S,Aarhus,Denmark*jtlassen@vitrolife.comABSTRACTThisworkdescribesthedevelopmentandvalidationofafullyautomate...

展开>> 收起<<

Development and validation of deep learning based embryo selection across multiple days of transfer Jacob Theilgaard Lassen1 Mikkel Fly Kragh1 Jens Rimestad1 Martin Nyg ard.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Development and validation of deep learning based embryo selection across multiple days of transfer Jacob Theilgaard Lassen1 Mikkel Fly Kragh1 Jens Rimestad1 Martin Nyg ard

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: