A New Task Deriving Semantic Class Targets for the Physical Sciences Micah Bowles1Hongming Tang2Eleni Vardoulaki3

2025-04-30 0 0 502.25KB 6 页 10玖币

侵权投诉

A New Task: Deriving Semantic Class Targets for the

Physical Sciences

Micah Bowles1∗Hongming Tang2Eleni Vardoulaki3

Emma L. Alexander1Yan Luo4Lawrence Rudnick5Mike Walmsley1

Fiona Porter1Anna M. M. Scaife1,6Inigo Val Slijepcevic1Gary Segal7,8

1Jodrell Bank Centre for Astrophysics, University of Manchester, Manchester, UK

2Department of Astronomy, Tsinghua University, Beijing, China

3Thüringer Landessternwarte, Tautenburg, Germany

4School of Physics and Astronomy, Sun Yat-sen University, Zhuhai, China

5Minnesota Institute for Astrophysics, University of Minnesota, Minneapolis, USA

6The Alan Turing Institute, London, UK

7School of Mathematics and Physics, University of Queensland, Brisbane, QLD, Australia

8CSIRO Space and Astronomy, Epping, NSW, Australia

Abstract

We deﬁne deriving semantic class targets as a novel multi-modal task. By doing

so, we aim to improve classiﬁcation schemes in the physical sciences which can

be severely abstracted and obfuscating. We address this task for upcoming radio

astronomy surveys and present the derived semantic radio galaxy morphology class

targets.

1 Introduction

Language evolves and changes - sometimes quite quickly. When a new idea arises and demands new

terminology, terms are commonly invented [e. g. utopia;

], named after an early adopter or pioneer

(e. g. Newtonian physics), or adopted from similar ideas [e. g. ‘modern’ in philosophy and art;

This is also common across various physical sciences, where use of language is key, as it is even

believed to affect how we think [12].

In computer science applications, language used to describe target classes is not a pressing issue at

this point. Consider ImageNet [

], whose class targets are built on WordNet [

]. WordNet is explicitly

constructed around the conceptual-semantic and lexical relations of the English language. Because of

this, models trained on ImageNet carry inductive biases of the semantic (meaningful) terms used as

class targets. These inductive biases do not necessarily hold true for data sets in the physical sciences,

where in fewer than one hundred years new language has been developed to capture entirely novel

and crucial concepts.

The physical sciences are quickly entering large data regimes where automation is essential. This

includes classiﬁcation. Supervised classiﬁcation approaches usually have deﬁnitions for class targets

and terms to describe to which class a given data point belongs.

To improve supervised models, it is common in both computer science and applied ﬁelds to attempt to

optimise approaches and models for a ﬁxed set of class targets. Often, computer science led state of

the art methods are implemented in the hope of surpassing previous benchmarks on a speciﬁed (ﬁxed)

supervised task. However, in dynamically evolving ﬁelds, ﬁxing such class targets in place may not

∗micah.bowles@postgrad.manchester.ac.uk; https://mb010.github.io/

Fifth Workshop on Machine Learning and the Physical Sciences (NeurIPS 2022).

arXiv:2210.14760v2 [astro-ph.IM] 27 Oct 2022

be ideal or provide meaningful results. For instance, in radio astronomy, the ﬁeld has developed a

detailed understanding of radio galaxies and how they form, yet the same abstract classes deﬁned

through the ﬁeld’s understanding in the 1970s [

] still persists. We therefore propose that rather than

optimising predictions of ineffective class targets, in certain scenarios it may be more beneﬁcial to

change the target classes with the aim of developing more robust, generalisable and feature rich

models. Consequently, in this work, we propose a task to derive semantic class targets.

Sec. 2 details the proposed task and its potential consequences for the physical sciences. Sec. 3

presents the proposed method. An application of the method to radio astronomy is presented in

Sec. 4 before conclusions are drawn in Sec. 5. Code and data used in this work are available at

https://github.com/mb010/Text2Tag.

2 Task

To improve target classes in labelled data sets, we propose a multi-modal task which can be phrased

as:

Given a set of documents describing labelled data samples, return a set of natural

language terms / phrases which capture the semantic features of the labelled data

set.

For any task, the derived set of class labels should be able to:

1. Map the science targets,

2. Map the semantic features of the data,

3. Use clear (non-technical) language.

The set of targets must be able to map to the previous set of classes, as otherwise a given scientiﬁc

community will not be able to translate classiﬁcations into the historical classes that they are used

to. For example, ‘fur length’ as the class target in the supervised task of classifying cats and dogs;

although useful, it does not sufﬁce to classify a given image back into the cat/dog scheme.

Targets which map the semantic features of the data are ideal, as populations which contain semantic

feature differences may not be captured by abstract classes. For example, a classiﬁer could be trained

to predict features of buildings (spires, column designs, materials used, etc.) rather than architectural

styles (gothic, baroque, neoclassical, brutalist etc.). This would enable the model to generalise to

architectural styles not included in the abstract target classes, and could even be used to highlight

designs which include hybrid elements.

The beneﬁt of clear non-technical language is the ability it provides experts in a given ﬁeld to

capture, communicate, and collaborate in and around their data. If the terms map the science targets

sufﬁciently well, they could even replace terms reducing that community’s dependence on obtuse,

and sometimes inconsistent, deﬁnitions of technical terminology. It could also lower barriers to entry

for inter-disciplinary research, outreach, and citizen science projects.

3 Method

The methods we discuss here are based on annotations and science targets. We use the term

annotations to describe short documents which each describe a feature of a single data sample

using non-technical terminology. We use the term science targets to mean the traditional abstract

classiﬁcations (or engineered features) for each annotated data sample.

There are many possible approaches to address this task. Two simple approaches include manual

selection of plain English terms by a panel of experts, or using a large language model (LLM) for a

zero-shot approach. We expect, given appropriate experts, that manual selection via an expert panel

would be acceptable to a given community. However, expecting a panel of experts to agree on a

set of plain English class targets may not be realistic depending on the background of each expert

and/or their ability to distil abstract concepts into simple terms. Manual selection may also lack the

reproducibility and tractability that the physical sciences should demand. Using a LLM in a zero-shot

approach may work; however, it is not clear how prompts should be engineered in order to extract

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ANewTask:DerivingSemanticClassTargetsforthePhysicalSciencesMicahBowles1HongmingTang2EleniVardoulaki3EmmaL.Alexander1YanLuo4LawrenceRudnick5MikeWalmsley1FionaPorter1AnnaM.M.Scaife1;6InigoValSlijepcevic1GarySegal7;81JodrellBankCentreforAstrophysics,UniversityofManchester,Manchester,UK2DepartmentofAst...

展开>> 收起<<

A New Task Deriving Semantic Class Targets for the Physical Sciences Micah Bowles1Hongming Tang2Eleni Vardoulaki3.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A New Task Deriving Semantic Class Targets for the Physical Sciences Micah Bowles1Hongming Tang2Eleni Vardoulaki3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: