A New Task Deriving Semantic Class Targets for the Physical Sciences Micah Bowles1Hongming Tang2Eleni Vardoulaki3

2025-04-30 0 0 502.25KB 6 页 10玖币
侵权投诉
A New Task: Deriving Semantic Class Targets for the
Physical Sciences
Micah Bowles1Hongming Tang2Eleni Vardoulaki3
Emma L. Alexander1Yan Luo4Lawrence Rudnick5Mike Walmsley1
Fiona Porter1Anna M. M. Scaife1,6Inigo Val Slijepcevic1Gary Segal7,8
1Jodrell Bank Centre for Astrophysics, University of Manchester, Manchester, UK
2Department of Astronomy, Tsinghua University, Beijing, China
3Thüringer Landessternwarte, Tautenburg, Germany
4School of Physics and Astronomy, Sun Yat-sen University, Zhuhai, China
5Minnesota Institute for Astrophysics, University of Minnesota, Minneapolis, USA
6The Alan Turing Institute, London, UK
7School of Mathematics and Physics, University of Queensland, Brisbane, QLD, Australia
8CSIRO Space and Astronomy, Epping, NSW, Australia
Abstract
We define deriving semantic class targets as a novel multi-modal task. By doing
so, we aim to improve classification schemes in the physical sciences which can
be severely abstracted and obfuscating. We address this task for upcoming radio
astronomy surveys and present the derived semantic radio galaxy morphology class
targets.
1 Introduction
Language evolves and changes - sometimes quite quickly. When a new idea arises and demands new
terminology, terms are commonly invented [e. g. utopia;
9
], named after an early adopter or pioneer
(e. g. Newtonian physics), or adopted from similar ideas [e. g. ‘modern’ in philosophy and art;
11
,
1
].
This is also common across various physical sciences, where use of language is key, as it is even
believed to affect how we think [12].
In computer science applications, language used to describe target classes is not a pressing issue at
this point. Consider ImageNet [
3
], whose class targets are built on WordNet [
5
]. WordNet is explicitly
constructed around the conceptual-semantic and lexical relations of the English language. Because of
this, models trained on ImageNet carry inductive biases of the semantic (meaningful) terms used as
class targets. These inductive biases do not necessarily hold true for data sets in the physical sciences,
where in fewer than one hundred years new language has been developed to capture entirely novel
and crucial concepts.
The physical sciences are quickly entering large data regimes where automation is essential. This
includes classification. Supervised classification approaches usually have definitions for class targets
and terms to describe to which class a given data point belongs.
To improve supervised models, it is common in both computer science and applied fields to attempt to
optimise approaches and models for a fixed set of class targets. Often, computer science led state of
the art methods are implemented in the hope of surpassing previous benchmarks on a specified (fixed)
supervised task. However, in dynamically evolving fields, fixing such class targets in place may not
micah.bowles@postgrad.manchester.ac.uk; https://mb010.github.io/
Fifth Workshop on Machine Learning and the Physical Sciences (NeurIPS 2022).
arXiv:2210.14760v2 [astro-ph.IM] 27 Oct 2022
be ideal or provide meaningful results. For instance, in radio astronomy, the field has developed a
detailed understanding of radio galaxies and how they form, yet the same abstract classes defined
through the field’s understanding in the 1970s [
4
] still persists. We therefore propose that rather than
optimising predictions of ineffective class targets, in certain scenarios it may be more beneficial to
change the target classes with the aim of developing more robust, generalisable and feature rich
models. Consequently, in this work, we propose a task to derive semantic class targets.
Sec. 2 details the proposed task and its potential consequences for the physical sciences. Sec. 3
presents the proposed method. An application of the method to radio astronomy is presented in
Sec. 4 before conclusions are drawn in Sec. 5. Code and data used in this work are available at
https://github.com/mb010/Text2Tag.
2 Task
To improve target classes in labelled data sets, we propose a multi-modal task which can be phrased
as:
Given a set of documents describing labelled data samples, return a set of natural
language terms / phrases which capture the semantic features of the labelled data
set.
For any task, the derived set of class labels should be able to:
1. Map the science targets,
2. Map the semantic features of the data,
3. Use clear (non-technical) language.
The set of targets must be able to map to the previous set of classes, as otherwise a given scientific
community will not be able to translate classifications into the historical classes that they are used
to. For example, ‘fur length’ as the class target in the supervised task of classifying cats and dogs;
although useful, it does not suffice to classify a given image back into the cat/dog scheme.
Targets which map the semantic features of the data are ideal, as populations which contain semantic
feature differences may not be captured by abstract classes. For example, a classifier could be trained
to predict features of buildings (spires, column designs, materials used, etc.) rather than architectural
styles (gothic, baroque, neoclassical, brutalist etc.). This would enable the model to generalise to
architectural styles not included in the abstract target classes, and could even be used to highlight
designs which include hybrid elements.
The benefit of clear non-technical language is the ability it provides experts in a given field to
capture, communicate, and collaborate in and around their data. If the terms map the science targets
sufficiently well, they could even replace terms reducing that community’s dependence on obtuse,
and sometimes inconsistent, definitions of technical terminology. It could also lower barriers to entry
for inter-disciplinary research, outreach, and citizen science projects.
3 Method
The methods we discuss here are based on annotations and science targets. We use the term
annotations to describe short documents which each describe a feature of a single data sample
using non-technical terminology. We use the term science targets to mean the traditional abstract
classifications (or engineered features) for each annotated data sample.
There are many possible approaches to address this task. Two simple approaches include manual
selection of plain English terms by a panel of experts, or using a large language model (LLM) for a
zero-shot approach. We expect, given appropriate experts, that manual selection via an expert panel
would be acceptable to a given community. However, expecting a panel of experts to agree on a
set of plain English class targets may not be realistic depending on the background of each expert
and/or their ability to distil abstract concepts into simple terms. Manual selection may also lack the
reproducibility and tractability that the physical sciences should demand. Using a LLM in a zero-shot
approach may work; however, it is not clear how prompts should be engineered in order to extract
2
摘要:

ANewTask:DerivingSemanticClassTargetsforthePhysicalSciencesMicahBowles1HongmingTang2EleniVardoulaki3EmmaL.Alexander1YanLuo4LawrenceRudnick5MikeWalmsley1FionaPorter1AnnaM.M.Scaife1;6InigoValSlijepcevic1GarySegal7;81JodrellBankCentreforAstrophysics,UniversityofManchester,Manchester,UK2DepartmentofAst...

展开>> 收起<<
A New Task Deriving Semantic Class Targets for the Physical Sciences Micah Bowles1Hongming Tang2Eleni Vardoulaki3.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:502.25KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注