Desiderata for next generation of ML model serving Sherif Akoushy saseldon.ioAndrei Paleyesyz_2

2025-05-06 0 0 350.27KB 8 页 10玖币
侵权投诉
Desiderata for next generation of ML model serving
Sherif Akoush ∗ †
sa@seldon.io
Andrei Paleyes ∗ †
ap2169@cam.ac.uk
Arnaud Van Looveren
avl@seldon.io
Clive Cox
cc@seldon.io
Abstract
Inference is a significant part of ML software infrastructure. Despite the variety
of inference frameworks available, the field as a whole can be considered in its
early days. This position paper puts forth a range of important qualities that next
generation of inference platforms should be aiming for. We present our rationale
for the importance of each quality, and discuss ways to achieve it in practice. We
propose to focus on data-centricity as the overarching design pattern which enables
smarter ML system deployment and operation at scale.
1 Introduction
Model inference has become an important part of modern machine learning (ML) infrastructure.
Various sources estimate up to 90% of ML compute resources are used for inference tasks [
17
,
18
,
32
,
42
,
53
]. To answer a growing demand for inference infrastructure, a number of model serving
platforms appeared on the market [
4
,
12
,
13
,
14
]. In addition, cloud providers are also offering
services that simplify model serving for their users [1, 9].
While the existing ML model serving frameworks answer some of the initial challenges of ML
deployment, we believe there are a number of important properties such frameworks need to have to
ensure a seamless model deployment experience. For instance, effective monitoring and explainability
of an entire inference pipeline is an open research direction. Efficient use of infrastructure while
optimising for important metrics of energy and environmental footprint is missing. Seamless ML
deployment to different targets such as edge devices is required. While ML serving frameworks are
taking steps to provide some of these features, we think that taking into consideration all desired
aspects of the system is challenging [
41
], requires leveraging different architecture patterns, but in
general will lead to better solutions developed by the community.
In this position paper we present a set of desired features for ML deployment — a blueprint of the
next generation ML model serving frameworks. We discuss motivations for each feature and provide
initial pointers on ways to achieve them. We hope to bring the community and practitioners together
to discuss these challenges, find ideal solutions, and shape the future of ML serving.
2 Desiderata
In this section we present nine qualities that are important for ML serving. We advocate that designing
the system with data-centricity [7, 35] as the highest priority enables these features.
2.1 Inference pipelines as dataflow graphs
Model inference is a complex data processing pipeline. It can include input and output data trans-
formations, multiple ML models, monitoring components, custom business logic, and so on. An
Equal contribution
Seldon Technologies
Department of Computer Science and Technology, University of Cambridge
NeurIPS 2022 Workshop on Challenges in Deploying and Monitoring Machine Learning Systems.
arXiv:2210.14665v2 [cs.LG] 22 Nov 2022
Figure 1: An inference pipeline of a ride-sharing service, motivated by an example from Uber [
29
].
It ingests multiple data sources, contains several business logic operations and ML models, and
produces several outputs.
example of a complex inference graph is shown in Figure 1. It is imperative to have a clear view of
the flow of data through the entire pipeline, both for its developers and users. Runtime access to any
intermediate data in the pipeline allows for better experimentation, troubleshooting and monitoring
experience, all of which are discussed later in this paper.
There are two possible ways for a model serving platform to facilitate access to a dataflow graph
at runtime. It can be discovered post-hoc, for example with distributed tracing systems [
11
,
19
].
This approach is applicable to platforms implemented with service oriented approaches, such as
microservices. While a popular paradigm for software system design, service orientation might be
ill-suited for ML inference, as its control flow nature poorly reflects data-centricity of the inference
process. Alternatively, inference pipelines can be built with a dataflow-first approach, such as the
flow-based programming (FBP) paradigm, as is already the case for ML training pipelines [
20
,
34
].
FBP provides access to the dataflow graph naturally [
40
], and thus might be a more appropriate
choice for implementation of inference graphs.
2.2 Pipeline component abstractions
To allow for the construction of complex data pipelines as shown in Figure 1, simple but flexible
architectural building blocks are required. ML models usually run on dedicated specialised servers.
The models will have generally been created by data scientists while the final serving infrastructure is
usually handled by a separate dedicated operations teams. The definition of models and servers for
inference should be kept separate to allow for their distinct creation and for the possibility of model
sharing across servers.
Given a set of models, many data pipelines may be built which share them at inference time.
Consequently, a pipeline abstraction should be a higher level concept that defines the flow of data
between the functional steps and how that data is joined and split as needed. Teams should be able
to tap into any data source, consume any output stream or extend the inference pipeline for extra
processing.
2.3 Support for synchronous and asynchronous scenarios
Modern ML inference services should natively support two modes of operation: synchronous and
asynchronous. Synchronous inference is the traditional mode of doing inference, also known as
request-driven batch processing. In this mode a user makes a request with a batch of input data
2
摘要:

DesideratafornextgenerationofMLmodelservingSherifAkoushysa@seldon.ioAndreiPaleyesyzap2169@cam.ac.ukArnaudVanLooverenyavl@seldon.ioCliveCoxycc@seldon.ioAbstractInferenceisasignicantpartofMLsoftwareinfrastructure.Despitethevarietyofinferenceframeworksavailable,theeldasawholecanbeconsideredinitsear...

展开>> 收起<<
Desiderata for next generation of ML model serving Sherif Akoushy saseldon.ioAndrei Paleyesyz_2.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:350.27KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注