Desiderata for next generation of ML model serving Sherif Akoushy saseldon.ioAndrei Paleyesyz_2

2025-05-06 0 0 350.27KB 8 页 10玖币

侵权投诉

Desiderata for next generation of ML model serving

Sherif Akoush ∗ †

sa@seldon.io

Andrei Paleyes ∗ † ‡

ap2169@cam.ac.uk

Arnaud Van Looveren †

avl@seldon.io

Clive Cox †

cc@seldon.io

Abstract

Inference is a signiﬁcant part of ML software infrastructure. Despite the variety

of inference frameworks available, the ﬁeld as a whole can be considered in its

early days. This position paper puts forth a range of important qualities that next

generation of inference platforms should be aiming for. We present our rationale

for the importance of each quality, and discuss ways to achieve it in practice. We

propose to focus on data-centricity as the overarching design pattern which enables

smarter ML system deployment and operation at scale.

1 Introduction

Model inference has become an important part of modern machine learning (ML) infrastructure.

Various sources estimate up to 90% of ML compute resources are used for inference tasks [

]. To answer a growing demand for inference infrastructure, a number of model serving

platforms appeared on the market [

]. In addition, cloud providers are also offering

services that simplify model serving for their users [1, 9].

While the existing ML model serving frameworks answer some of the initial challenges of ML

deployment, we believe there are a number of important properties such frameworks need to have to

ensure a seamless model deployment experience. For instance, effective monitoring and explainability

of an entire inference pipeline is an open research direction. Efﬁcient use of infrastructure while

optimising for important metrics of energy and environmental footprint is missing. Seamless ML

deployment to different targets such as edge devices is required. While ML serving frameworks are

taking steps to provide some of these features, we think that taking into consideration all desired

aspects of the system is challenging [

], requires leveraging different architecture patterns, but in

general will lead to better solutions developed by the community.

In this position paper we present a set of desired features for ML deployment — a blueprint of the

next generation ML model serving frameworks. We discuss motivations for each feature and provide

initial pointers on ways to achieve them. We hope to bring the community and practitioners together

to discuss these challenges, ﬁnd ideal solutions, and shape the future of ML serving.

2 Desiderata

In this section we present nine qualities that are important for ML serving. We advocate that designing

the system with data-centricity [7, 35] as the highest priority enables these features.

2.1 Inference pipelines as dataﬂow graphs

Model inference is a complex data processing pipeline. It can include input and output data trans-

formations, multiple ML models, monitoring components, custom business logic, and so on. An

∗Equal contribution

†Seldon Technologies

‡Department of Computer Science and Technology, University of Cambridge

NeurIPS 2022 Workshop on Challenges in Deploying and Monitoring Machine Learning Systems.

arXiv:2210.14665v2 [cs.LG] 22 Nov 2022

Figure 1: An inference pipeline of a ride-sharing service, motivated by an example from Uber [

It ingests multiple data sources, contains several business logic operations and ML models, and

produces several outputs.

example of a complex inference graph is shown in Figure 1. It is imperative to have a clear view of

the ﬂow of data through the entire pipeline, both for its developers and users. Runtime access to any

intermediate data in the pipeline allows for better experimentation, troubleshooting and monitoring

experience, all of which are discussed later in this paper.

There are two possible ways for a model serving platform to facilitate access to a dataﬂow graph

at runtime. It can be discovered post-hoc, for example with distributed tracing systems [

This approach is applicable to platforms implemented with service oriented approaches, such as

microservices. While a popular paradigm for software system design, service orientation might be

ill-suited for ML inference, as its control ﬂow nature poorly reﬂects data-centricity of the inference

process. Alternatively, inference pipelines can be built with a dataﬂow-ﬁrst approach, such as the

ﬂow-based programming (FBP) paradigm, as is already the case for ML training pipelines [

FBP provides access to the dataﬂow graph naturally [

], and thus might be a more appropriate

choice for implementation of inference graphs.

2.2 Pipeline component abstractions

To allow for the construction of complex data pipelines as shown in Figure 1, simple but ﬂexible

architectural building blocks are required. ML models usually run on dedicated specialised servers.

The models will have generally been created by data scientists while the ﬁnal serving infrastructure is

usually handled by a separate dedicated operations teams. The deﬁnition of models and servers for

inference should be kept separate to allow for their distinct creation and for the possibility of model

sharing across servers.

Given a set of models, many data pipelines may be built which share them at inference time.

Consequently, a pipeline abstraction should be a higher level concept that deﬁnes the ﬂow of data

between the functional steps and how that data is joined and split as needed. Teams should be able

to tap into any data source, consume any output stream or extend the inference pipeline for extra

processing.

2.3 Support for synchronous and asynchronous scenarios

Modern ML inference services should natively support two modes of operation: synchronous and

asynchronous. Synchronous inference is the traditional mode of doing inference, also known as

request-driven batch processing. In this mode a user makes a request with a batch of input data

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DesideratafornextgenerationofMLmodelservingSherifAkoushysa@seldon.ioAndreiPaleyesyzap2169@cam.ac.ukArnaudVanLooverenyavl@seldon.ioCliveCoxycc@seldon.ioAbstractInferenceisasignicantpartofMLsoftwareinfrastructure.Despitethevarietyofinferenceframeworksavailable,theeldasawholecanbeconsideredinitsear...

展开>> 收起<<

Desiderata for next generation of ML model serving Sherif Akoushy saseldon.ioAndrei Paleyesyz_2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Desiderata for next generation of ML model serving Sherif Akoushy saseldon.ioAndrei Paleyesyz_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: