Snapshot of Algebraic Vision Joe Kileel and Kathl en Kohn In honor of Bernd Sturmfels 60th birthday

2025-05-03 0 0 4.45MB 40 页 10玖币
侵权投诉
Snapshot of Algebraic Vision
Joe Kileel and Kathl´en Kohn
In honor of Bernd Sturmfels’ 60th birthday
Abstract. In this survey article, we present interactions between algebraic
geometry and computer vision, which have recently come under the header of
algebraic vision. The subject has given new insights in multiple view geometry
and its application to 3D scene reconstruction and carried a host of novel
problems and ideas back into algebraic geometry.
Computer vision is the research field that studies how computers can gain un-
derstanding from 2D images and videos, similar to human cognitive abilities. Typ-
ical computer vision tasks include the automatic recognition of objects in images,
the detection of events in videos, and the reconstruction of 3D scenes from many
given 2D images. A general overview of computer vision is presented in textbook
form in [172]. The subject is a pillar in the AI revolution.
Algebraic vision is the symbiosis of computer vision and algebraic geometry.
Motivated by Chris Aholt’s Ph.D. thesis titled Polynomials in Multiview Geometry
[9] and earlier works, the term “algebraic vision” was coined during a particular
lunch held at a Seattle office of Google in early spring 2014, attended by Sameer
Agarwal, Chris Aholt, Joe Kileel, Hon-Leung Lee, Max Lieblich, Bernd Sturmfels,
and Rekha Thomas. The intent was to encourage interactions between the applied
algebraic geometry community and the 3D reconstruction community in computer
vision. A short discussion of algebraic vision can be found in the review [34] on
nonlinear algebra and its applications.
Historically, computer vision made substantial use of projective geometry and
computational algebra in parts of its foundations. Specifically multiple view geom-
etry, as described in the textbook [94] of Hartley and Zisserman, is modeled on
projective three-space and two-space and group-equivariant (multi-)linear transfor-
mations between these. Similar algebraic treatments of the subject are the text-
books [133] and [137]. Previously, this connection was not well-appreciated by the
2020 Mathematics Subject Classification. Primary 68T45, 14Q20, 13P25; Secondary 13P15,
65H14, 13P10.
J.K. is supported in part by NSF awards DMS-2309782 and IIS-2312746, and start-up grants
from the Department of Mathematics and Oden Institute at UT Austin.
K.K. is supported in part by the Wallenberg AI, Autonomous Systems and Software Program
(WASP) funded by the Knut and Alice Wallenberg Foundation.
1
arXiv:2210.11443v2 [math.AG] 17 Oct 2023
2 JOE KILEEL AND KATHL ´
EN KOHN
computational algebra geometry community. However, in the last decade, algebro-
geometric papers and workshops on 3D reconstruction have been appearing, leading
to novel results in multiple view geometry while motivating developments in applied
algebraic geometry.
The present article provides a survey of algebraic vision. No previous knowledge
of computer vision is assumed, and the prerequisites for computational algebraic
geometry are kept mostly to the level of undergraduate texts [51]. Due to space lim-
itations, the article makes no attempt to be comprehensive in any way, but instead
it focuses narrowly on the role of projective varieties and systems of polynomial
equations in 3D vision. An outline of the sections is as follows:
In Section 1, we introduce the problem of 3D scene reconstruction from
unkown cameras and its algebro-geometric nature.
In Section 2, we discuss a variety of usual models for cameras.
In Section 3, we study multiview varieties which characterize feasible im-
ages of points under fixed cameras. Their defining equations play a key
role in 3D reconstruction algorithms, and their Euclidean distance degrees
measure the intrinsic complexity of noisy triangulation (i.e., the task of
recovering the 3D coordinates of a point observed by known cameras).
In Section 4, we consider the space of all cameras. We explain how tuples
of cameras – up to changes of world coordinates – can be encoded via
multifocal tensors [94].
In Section 5, we overview the most popular algorithmic pipeline to solve
3D scene reconstruction, highlighting minimal problems that are the algebro-
geometric heart of the pipeline.
In Section 6, we describe polynomial solvers for minimal problems, focus-
ing on Gr¨obner basis methods using elimination templates and homotopy
continuation. Those method applies to zero-dimensional parameterized
polynomial systems in general.
In Section 7, we discuss algebro-geometric approaches to understand de-
generate world scenes and image data, where uniqueness of reconstruction
breaks down and algorithms can encounter difficulty.
After reading Sections 1 and 2, the other sections are essentially independent;
only Section 6 builds on Section 5. We provide specific pointers to earlier sections
in case of partial dependencies.
Some important topics in algebraic vision that are omitted include group syn-
chronization (e.g., [161, 127]), uses of polynomial optimization (e.g., [104, 50, 6,
190, 44]), and approaches based on differential invariants (e.g., [42, 30]). Readers
may consult [147] for a survey that covers numerical and large-scale optimization
aspects in 3D reconstruction.
Acknowledgements. We thank Sameer Agarwal, Paul Breiding, Luca Car-
lone, Tim Duff, Hongyi Fan, Fredrik Kahl, Anton Leykin, Tomas Pajdla, Jean
Ponce, Kristian Ranestad, Felix Rydell, Elima Shehu, Rekha Thomas, Matthew
Trager and Uli Walther for their comments on earlier versions of the manuscript.
1. Computer vision through the algebraic lens
One of the main challenges in computer vision is the structure-from-motion
(SfM) problem: given many 2D images, the task is to reconstruct the 3D scene
SNAPSHOT OF ALGEBRAIC VISION 3
and also the positions of the cameras that took the pictures. This has many ap-
plications such as 3D mapping from images taken by drones [158], to localize and
navigate autonomous cars and robots in a 3D world [83], or in the movie industry
to reconstruct 3D backgrounds [107], for photo tourism [2], and for combining real
and virtual worlds [60].
The structure-from-motion problem is typically solved using the 3D reconstruc-
tion pipeline. We will now sketch a highly simplified version of that pipeline, il-
lustrated in Figure 1. We provide more details in Section 5.1. Given a set of 2D
images, the first step in the pipeline is to take a few of the given images and identify
geometric features, such as points or lines, that they have in common. In Figure 1b,
a detection algorithm has been used that only identifies points. In the second step
of the pipeline, we forget the original images and only keep the geometric features
we have identified. We reconstruct the 3D coordinates of those features and also the
camera poses, that is, the locations and orientations of the cameras. In Figure 1c,
five common points were identified on two images, so we aim to reconstruct the
five points in 3-space and the two cameras. Finally, we repeat this process several
times until we have recovered all cameras and also enough geometric features to
approximate the 3D scene.
(a) Input images (b) Image matching
(c) Reconstruct cam-
eras and 3D points (d) Output
Figure 1. 3D reconstruction pipeline (courtesy of Tomas Pajdla).
As the second step of the pipeline forgets the pictures and only works with
algebro-geometric features, such as points or lines, the reconstruction problem be-
comes purely algebraic. More specifically, we aim to compute a fiber of the joint
camera map:
Φ : X × Cm99K Y,(1.1)
that maps an arrangement X∈ X of 3D features and a tuple (C1, . . . , Cm)∈ Cmof
cameras to the m2D images of Xtaken by the cameras. For instance in Figure 1c,
the joint camera map becomes
Φ : R35× C299K R25×R25.(1.2)
A full specification of the joint camera map requires a choice of camera model.
The simplest model is a pinhole camera; see Figure 2. Such a camera simply takes
a picture of a point in space by projecting it onto a plane. A pinhole camera in
standard position is typically assumed to be centered at the origin such that its
4 JOE KILEEL AND KATHL ´
EN KOHN
image plane is H={(x, y, z)R2|z= 1}. In these coordinates, the pinhole
camera is the map
R399K H, (x, y, z)7−(x
z,y
z,1).
(x, y, z)
(x/z, y/z, 1)
(0, 0, 1)
H
c
Figure 2. A pinhole camera in standard position is centered
at c= (0,0,0) and maps world points (x, y, z) to image points
(x
z,y
z,1) on the image plane H.
Often homogeneous coordinates are used to model cameras. This means that
each point in the image plane is identified with the light ray passing through the
point and the origin. In homogeneous coordinates, the standard pinhole camera in
Figure 2 becomes
P3
R99K P2
R,[x:y:z:w]7−[x:y:z].
This map is defined everywhere except at the camera center [0 : 0 : 0 : 1], i.e. the
origin in the affine chart where w= 1.
The projective geometry approach in modeling cameras is thoroughly explained
in the textbook [94]. That book laid many foundations and conventions used in
modern computer vision and offers a great entry point for the algebraic community
into the field of computer vision. The main focus of the book [94] is multiview
geometry, where a 3D object is viewed by several cameras, such as in Figure 1.
In that setting, we cannot assume that all cameras are in standard position as
described above. Instead, a pinhole camera is more generally given by a 3 ×4
matrix Aof rank three. The corresponding camera map
P3
R99K P2
R, X 7−AX.
is defined everywhere except at the camera center that is given by the kernel of A.
The standard camera in Figure 2 corresponds to the matrix h1 0 0 0
0 1 0 0
0 0 1 0 i.
Hence, when using pinhole cameras and homogeneous coordinates, the camera
variety Cmin (1.1) that describes all m-tuples of such cameras is
Cm= (PMat3×4
3)m,
SNAPSHOT OF ALGEBRAIC VISION 5
where Mat3×4
3R3×4denotes the set of 3 ×4 matrices of rank three. For instance,
the joint camera map in (1.2) becomes
Φ : P3
R5×(PMat3×4
3)299K P2
R5×P2
R5,
(X1, . . . , X5, A1, A2)7−(A1X1, . . . , A1X5, A2X1, . . . , A2X5).
In the next section, we review common camera models and highlight algebraic
vision articles studying camera geometry. In the remaining sections, our focus
returns to the joint camera map in (1.1): We will see that many computer vision
problems can be formulated using the joint camera map – such as understanding
the image of a shape in space or reconstructing a 3D shape from several images –
and are thus natural to study through the algebraic lens. The recent paper [1] gives
a similar such unifying algebro-geometric framework for computer vision problems.
2. Camera models
Calibrated cameras. The camera model described in the previous section
is known as the projective / uncalibrated pinhole camera. The calibrated pinhole
camera model assumes that every camera is obtained from the standard pinhole
camera in Figure 2 by translation and rotation. This means that every camera
matrix Ais of the form [R|t] where RSO(3) is the relative rotation from the
standard pinhole camera to the camera with matrix Aand the relative translation
can be read off from the vector tR3: the camera center c, which is the origin
in Figure 2, is now c=Rt(note that the vector (c, 1)R4spans the kernel
of the camera matrix [R|t]). In particular, every calibrated pinhole camera has 6
degrees of freedom (3 for Rand 3 for t), whereas a projective pinhole camera has
11 degrees of freedom.
Calibrated pinhole cameras are a commonly used model in applications, corre-
sponding to the case when the internal parameters of the cameras are known (such
as from meta data stored inside the image file). There is also a variety of partly
calibrated pinhole cameras, e.g. a camera with unknown focal length, that have less
strict structural assumptions on the 3×4 camera matrices than the fully calibrated
model described above. Partly calibrated pinhole cameras are modeled as K[R|t]
where Kis a 3 ×3 upper triangular calibration matrix whose entries are partially
known [94, Chapter 6].
Distortion. In practice, cameras are not as ideal as in the calibrated model.
As seen in Figure 2, the pinhole cameras described so far assume that the world
point, the camera center, and the image point are collinear. This assumption does
not hold for real-life camera lenses, because they are affected by various kinds of
distortion. The main factor of deviation from the idealistic pinhole camera model
is typically radial distortion; see Figure 3.
Often, calibrated cameras are a sufficient approximation of real-life cameras.
However, sometimes the impact of radial distortion is too big, e.g., for fisheye
cameras. One approach to address radial distortion is to make the camera model
more complicated by adding distortion parameters that have to be estimated during
3D reconstruction (see [94, Chapter 7.4] for an overview and [106] for an algebraic
treatment of distortion varieties). Another approach is to simplify the camera model
by not estimating the radial distortion at all: Once the center of radial distortion on
a given image is determined, we know for every 3D point onto which line through
the distortion center it gets mapped by the camera (see Figure 3), although we
摘要:

SnapshotofAlgebraicVisionJoeKileelandKathl´enKohnInhonorofBerndSturmfels’60thbirthdayAbstract.Inthissurveyarticle,wepresentinteractionsbetweenalgebraicgeometryandcomputervision,whichhaverecentlycomeundertheheaderofalgebraicvision.Thesubjecthasgivennewinsightsinmultipleviewgeometryanditsapplicationto...

展开>> 收起<<
Snapshot of Algebraic Vision Joe Kileel and Kathl en Kohn In honor of Bernd Sturmfels 60th birthday.pdf

共40页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:40 页 大小:4.45MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 40
客服
关注