Development and validation of deep learning based embryo selection across multiple days of transfer Jacob Theilgaard Lassen1 Mikkel Fly Kragh1 Jens Rimestad1 Martin Nyg ard

2025-04-27 0 0 642.21KB 16 页 10玖币
侵权投诉
Development and validation of deep learning based
embryo selection across multiple days of transfer
Jacob Theilgaard Lassen1,*, Mikkel Fly Kragh1, Jens Rimestad1, Martin Nyg ˚
ard
Johansen1, and Jørgen Berntsen1
1Vitrolife A/S, Aarhus, Denmark
*jtlassen@vitrolife.com
ABSTRACT
This work describes the development and validation of a fully automated deep learning model, iDAScore v2.0, for the evaluation
of embryos incubated for 2, 3, and 5 or more days. The model is trained and evaluated on an extensive and diverse dataset
including 181,428 embryos from 22 IVF clinics across the world. For discriminating transferred embryos with known outcome
(KID), we show AUCs ranging from 0.621 to 0.708 depending on the day of transfer. Predictive performance increased over
time and showed a strong correlation with morphokinetic parameters. The model has equivalent performance to KIDScore D3
on day 3 embryos while significantly surpassing the performance of KIDScore D5 v3 on day 5+ embryos. This model provides
an analysis of time-lapse sequences without the need for user input, and provides a reliable method for ranking embryos for
likelihood to implant, at both cleavage and blastocyst stages. This greatly improves embryo grading consistency and saves
time compared to traditional embryo evaluation methods.
1 Introduction
Prioritizing embryos for transfer and cryopreservation is a long-standing challenge in the field of in vitro fertilization (IVF)
with both academic and commercial research dedicated to its resolution. When multiple good quality embryos are available,
selection of the embryo with the highest likelihood of implantation will shorten time to pregnancy and ultimately live birth.
Traditionally, embryo evaluation has been carried out by manual inspection of either static microscope images or time-lapse
videos of developing embryos. Scoring systems based on morphological and morphokinetic annotations have been used to
rank embryos within patient cohorts as well as decide, which embryos to discard and which to transfer and/or cryopreserve. In
recent years, however, the use of artificial intelligence (AI) to evaluate embryos has shown promise in both automating the
assessment and potentially surpassing the ranking performance of manual inspection [1].
Increasingly, blastocyst transfers has become the preferred development stage for transfer [
2
], and most AI models for
embryo evaluation specifically address embryos cultured to day 5 or later [
3
,
4
,
5
,
6
]. However, blastocyst culture generally
results in a lower number of embryos to choose from, and for patients with poor embryo development, cleavage-stage transfers
may be preferred if there is a risk of a cancelled cycle [
2
]. Few studies exist that focus on both cleavage-stage and blastocyst
transfers [
7
,
8
]. Erlich et al.
[7]
propose a combined model for handling day 3 and day 5 transfers, by predicting a score for
each image in a time-lapse sequence. Scores from previous images in the sequence are then aggregated temporally. The authors
claim that the method provides continuous scoring regardless of development stage and time, and that it outperforms the manual
morphokinetic model, KIDScore D3 [
9
]. However, as they only evaluate on day 5 transfers, they ignore the possibility that
different embryo characteristics may not be equally important for day 3 and day 5 transfers. Kan-Tor et al.
[8]
also propose a
combined model for handling day 3 and day 5 transfers, by first predicting scores for non-overlapping temporal windows, and
then aggregating scores from previous windows in the sequence using logistic regression. The authors show both discrimination
and calibration results together with subgroup analyses on patient age and clinics for day 5 transfers. However, for day 3
transfers, only the overall discrimination performance is presented. Therefore, the calibration and generalization performance
on day 3 embryos across subgroups such as patient age and clinics remains to be seen.
In addition to day of transfer, current AI models often deviate in how they approach automation. Some methods assume
manual preselection by embryologists and can thus be categorized as semi-automated. These are methods that have only been
trained on transferred embryos and therefore generally have not seen embryos of poor quality [
4
,
5
,
6
]. Other methods approach
full automation by training on all embryos, regardless of whether they were transferred or not. These methods rely on other
labels than pregnancy for the non-transferred embryos such as manual deselection by embryologists (discards), results of
preimplantation genetic testing for aneuploidy (PGT-A), or morphokinetic and/or morphological annotations [
7
,
10
,
3
]. For an
AI model to be both fully automated and superior in ranking performance on previously transferred embryos, both aspects need
arXiv:2210.02120v1 [q-bio.QM] 5 Oct 2022
Day 2
Day 5+
Day 3
Day 2/3
model
Day 5+
model
Direct cleavage
model
Combination
model
Day 2
Calibration
Day 3
Calibration
Day 5
Calibration
Day
2
3Scaling
CNN
Logistic regression
Linear transformation
Figure 1. iDAScore v2.0. Two separate tracks handle day 2/3 and day 5+ embryos. The first track consists of two 3D
convolutional neural networks (CNN) that predict implantation potential and direct cleavages, followed by separate calibration
models for day 2 and 3. The second track consists of a 3D CNN that predicts implantation potential followed by a day 5+
calibration model. Finally, scores from both tracks are scaled linearly to the range 1.0–9.9.
to be evaluated [
1
,
11
]. The performance of both transferred embryos with known implantation data (KID) and non-transferred
embryos of different qualities and development stages needs to be evaluated in order to ensure general prospective use.
In this study, we describe the development and validation of a fully automated AI model, iDAScore v2.0, for embryo
evaluation on day 2, day 3 and day 5+ embryos. As in our previous work [
3
], the model is based on 3D convolutions that
simultaneously identify both spatial (morphological) and temporal (morphokinetic) patterns in time-lapse image sequences.
However, whereas our previous work only dealt with ranking performance, in this study, we also calibrate the model to obtain a
linear relationship between model predictions and implantation rates. We train and evaluate our model on an extensive and
diverse dataset including 181,428 embryos from 22 IVF clinics across the world. On independent test data, we present both
discrimination and calibration performance for embryos transferred after 2, 3 and more than 5 days of incubation, individually,
and compare with iDAScore v1 [
3
,
12
,
13
] and the manual morphokinetic models, KIDScore D3 [
9
] and KIDScore D5 v3 [
14
].
We also present discrimination performance for a range of subgroups including patient age, insemination method, transfer
protocol, year of treatment, and fertility clinic. Finally, we perform temporal analyses on score developments from day 2 to
5 to illustrate improvements over time in discrimination performance, temporal changes in ranking, and relation to common
morphokinetic parameters used for traditional embryo selection. To the best of our knowledge, our work presents the first
AI-based model for ranking embryos from day 2 to day 5+, and is the first study to present calibration curves and subgroup
analyses on transferred cleavage-stage embryos.
2 Materials / Methods
Study design
The study was a multi-center retrospective cohort study consisting of 249,635 embryos from 34,620 IVF treatments carried out
across 22 clinics from 2011 to 2020. As the study focused on day 2, day 3 and day 5+ transfers, day 1 (n=1,243) and day 4
(n=182) embryos were excluded, corresponding to embryos incubated less than 36 hours post insemination (hpi) and embryos
incubated between 84 hpi and 108 hpi. Furthermore, embryos without known clinical fate were excluded, as their clinical
outcomes were unknown due to follow-up loss (n=3,192) or because they were still cryopreserved at the time of data collection
and thus had pending outcomes (n=50,392). After data exclusion, 181,428 embryos remained, of which 33,687 were transferred
embryos with known implantation data (KID) measured by the presence of a fetal heartbeat, and 147,741 were discarded by
embryologists either due to arrested development, failed fertilization, aneuploidy, or other clinical deselection criteria. Finally,
the dataset was split into training (85%) and testing (15%) on treatment level, ensuring that all embryos within a given treatment
were either allocated to training or testing. While this split-strategy allows cohort-analyses on the test set, it also mitigates
certain types of biases, as the AI model cannot benefit from overfitting to individual patients in the training set. A flow diagram
illustrating patients, exclusion of data points (embryos), and division into training and test subsets is shown in Figure 2.
2/16
Embryos (n=249,635)
- Treatments: 34,620
- Clinics: 22
Follow-up on clinical fate
(n=248,210)
Full dataset (n=181,428)
- KIDp: 8,465
- KIDn: 25,222
- Discard: 147,741
Excluded embryos (n=1,425)
- Day 1: 1,243
- Day 4: 182
Excluded embryos (n=66,782)
- Pending: 50,392
- Unknown: 3,192
Train (85%)
(n=154,875)
Test (15%)
(n=26,553)
Figure 2. Flowchart of the study design.
Day 2 Day 3 Day 5+ Total
Discarded 12,627 14,121 99,288 126,036
KID- 7,095 4,876 9,656 21,627
KID+ 1,453 1,075 4,684 7,212
Total 21,175 20,072 113,628 154,875
(a) Training data
Day 2 Day 3 Day 5+ Total
Discarded 2,029 2,491 17,185 21,705
KID- 1,165 809 1,621 3,595
KID+ 258 194 801 1,253
Total 3,452 3,494 19,607 26,553
(b) Test data
Table 1. Datasets for training and testing the model.
Table 1 shows the specific number of discarded embryos and KID embryos with positive (KID+) and negative (KID-)
outcomes for each day in the training and test sets. Table 6,Table 7 and Table 8 in the appendix contain further details
on patients age, clinical procedures, and embryos for each clinic in the data subsets of day 2, day 3, and day 5+ embryos,
respectively.
Image data
All embryos were cultured in EmbryoScope
, EmbryoScope
+, or EmbryoScope
Flex incubators (Vitrolife A/S, Aarhus,
Denmark). The incubators acquired time-lapse images during embryo development according to specific settings in each clinic.
For EmbryoScope
incubators, microscope images of 3–9 focal planes of size 500
×
500 pixels were acquired every 10–30
minutes. For EmbryoScope
+ or EmbryoScope
Flex incubators, microscope images of 11 focal planes of size 800
×
800
pixels were acquired every 10 minutes.
Model development
To predict embryo implantation on day 2, 3 and 5+, a combined AI model consisting of several components was developed.
Figure 1 shows a flowchart of the model. If an embryo is incubated more than 84 hpi, raw time-lapse images from 20–148 hpi
are fed to a 3D convolutional neural network (CNN) that outputs a scalar between 0–1 (Day 5+ model). If, however, the embryo
is incubated less than 84 hpi, images from 20–84 hpi are fed to two separate CNN models that evaluate overall implantation
potential (Day 2/3 model) and presence of direct cleavages from one to three cells and from two to five cells (Direct cleavage
model). The day 2/3 model outputs a scalar between 0–1, and the direct cleavage model outputs two scalars (one for each type
of direct cleavage) between 0–1. A logistic regression model then combines the three outputs into a single scalar. Finally,
outputs from either day 2/3 or day 5+ are calibrated individually for each day to obtain a linear relationship between scores
and implantation rates. At this point, the scores are estimates of pregnancy probabilities representative of the average patient
population (including various diagnostic profiles), as opposed to individualized probabilities for each patient. Therefore, to
avoid confusing probabilities as being individualized, the calibrated scores are ultimately rescaled to the range 1.0–9.9, similar
to the range used in our previous work [3] and by the manual morphokinetic model, KIDScore D5v3 [14].
For more details on model architectures, training methodology including data sampling, preprocessing and augmentation
strategies as well as individual results for the components, see Appendix A.
Model validation
Internal validation was used to evaluate the predictive performance of the model on test data in terms of discrimination and
calibration [
15
,
1
]. The area under the receiver operating characteristic curve (AUC) was used to quantify discrimination
and reported with 95% confidence intervals using DeLong’s algorithm [
16
]. Tests for significant differences in AUC were
performed using either paired or unpaired two-tailed DeLong’s test [
16
]. Bonferroni-adjusted p-values were used for reporting
significant differences between subgroups. Calibration was assessed graphically using observed implantation rates in grouped
observations of similar predictions (quantiles) and Loess smoothing [17].
3/16
3 Results
The combined discriminatory performance in terms of AUC for iDAScore v2.0 is presented in Table 2a along with intermediate
results by each component from Figure 1. For each day (2, 3 and 5+), the table lists an AUC on all embryos (KID+ vs. KID-
and discarded) and on KID embryos (KID+ vs. KID-). The AUCs on day 2, 3 and 5+ were 0.862, 0.873 and 0.954 for all
embryos and 0.669, 0.621 and 0.708 for KID embryos. Table 2b provides a comparison with two manual scoring systems,
KIDScore D3 [
9
] and KIDScore D5 v3 [
14
], as well as our previous work, iDAScore v1 [
3
], on embryos in the test set that had
manual morphological and morphokinetic annotations required by KIDScore. iDAScore v1 was evaluated on iDAScore v2.0’s
test set. As this includes training samples from v1, the iDAScore v1 performance may be overestimated. On day 3, the AUCs
of iDAScore v2.0 and KIDScore D3 on KID embryos were 0.608 and 0.610, with no significant differences according to a
paired DeLong’s test (
p=0.92
). As such, iDAScore v2.0 seems to perform as well as KIDScore D3 on selecting embryos for
transfer on day 3, however, without requiring any manual annotations. On day 5+, however, the AUCs of iDAScore v2.0 and
KIDScore D5 v3 on KID embryos were 0.694 and 0.644 and significantly different (
p<0.001
). When comparing iDAScore
v2.0 against the previous version, iDAScore v1, on KID embryos, AUCs were 0.694 and 0.672 and also significantly different
(
p=0.047
). This suggests that the increased amount of training data and slightly modified training strategies from v1 to v2.0
have improved the model performance significantly.
The calibration performance of iDAScore v2.0 is shown in Figure 3 for day 2, 3 and 5+, individually. In general, there is a
good agreement between predicted probabilities and observed implantation rates. Comparing the three curves, we see that both
the ranges of predictions and success rates increased from day 2 to 3 and from day 3 to 5+. That is, the best day 3 embryos had
higher scores and a higher implantation rate than the best day 2 embryos. And on day 5+, we observe both the highest and
lowest scores as well as the highest and lowest implantation rates. This suggests that with more information available on the
blastocyst stage, the model can more confidently assign a probability of implantation, ranging from around 6% for the lowest
scores up to 65% for the highest scores on day 5+. As these predictions were made based on time-lapse images alone, however,
they represent average patient probabilities and not individualized patient probabilities. To predict probabilities on patient-level,
additional characteristics such as patient demographics and clinical practice should be included in the calibration procedure and
analysis [1,18]. However, these aspects are outside the scope of this work.
Day 2 Day 3 Day 5+
All (n=3,452) KID (n=1,423) All (n=3,494) KID (n=1,003) All (n=19,607) KID (n=2,422)
Day 2/3 .856 [.840-.872] .663 [.630-.697] .862 [.844-.879] .611 [.569-.654] - -
Combination .862 [.845-.878] .669 [.635-.703] .873 [.854-.891] .621 [.580-.662] - -
Day 5+ - - - - .954 [.950-.958] .708 [.686-.728]
iDAScore v2.0 .862 [.845-.878] .669 [.635-.703] .873 [.854-.891] .621 [.580-.662] .954 [.950-.958] .708 [.686-.728]
(a) Full test set
Day 3 Day 5+
KID (n=800) KID (n=1,175)
KIDScore D3 .610 [.569-.651] -
KIDScore D5 - .644 [.613-.676]
iDAScore v1 - .672 [.641-.703]
iDAScore v2.0 .608 [.562-.654] .694 [.664-.724]
(b) Comparisons on subset of test set with annotations required by KIDScore D3 and D5.
Table 2. AUCs on the test set for the different model components across days of incubation. All denotes KID+ vs. KID- and
discarded embryos, whereas KID denotes KID+ vs. KID- embryos. All AUCs are reported with 95% confidence intervals in
brackets. (a) lists results on the full test set from Table 1b, whereas (b) compares performance with KIDScore D3, KIDScore
D5 and iDAScore v1 models on embryos that have manual annotations available as required by KIDScore.
4/16
摘要:

DevelopmentandvalidationofdeeplearningbasedembryoselectionacrossmultipledaysoftransferJacobTheilgaardLassen1,*,MikkelFlyKragh1,JensRimestad1,MartinNygardJohansen1,andJørgenBerntsen11VitrolifeA/S,Aarhus,Denmark*jtlassen@vitrolife.comABSTRACTThisworkdescribesthedevelopmentandvalidationofafullyautomate...

展开>> 收起<<
Development and validation of deep learning based embryo selection across multiple days of transfer Jacob Theilgaard Lassen1 Mikkel Fly Kragh1 Jens Rimestad1 Martin Nyg ard.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:642.21KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注