Aggregator Reuse and Extension for Richer Web Archive Interaction Mat Kelly0000000202367389

2025-04-27 0 0 1.67MB 16 页 10玖币
侵权投诉
Aggregator Reuse and Extension
for Richer Web Archive Interaction
Mat Kelly[0000000202367389]
Drexel University, Philadelphia PA 19104, USA
mkelly@drexel.edu
https://matkelly.com
Abstract.
Memento aggregators enable users to query multiple web
archives for captures of a URI in time through a single HTTP endpoint.
While this one-to-many access point is useful for researchers and end-
users, aggregators are in a position to provide additional functionality to
end-users beyond black box style aggregation. This paper identifies the
state-of-the-art of Memento aggregation, abstracts its processes, highlights
shortcomings, and offers systematic enhancements.
1 Introduction
Web archives act as a historical record of the web. The Internet Archive (IA)
possesses the largest number of web archive holdings. These holdings are accessible
through a set of interfaces to the Wayback Machine. Beyond IA, other web
archives exhibit focused collection efforts, often providing unique captures within
IA’s temporal and spatial (i.e., URL [
7
]) voids [
17
]. A common usage pattern in
accessing IA’s captures is to request the archive’s web site at
archive.org
, submit
a URL of interest by providing it in a text input field, then selecting a date and
time from the set of available captures for that URL in the past. This pattern may
differ between web archives’ respective web interfaces.
Memento [27]
provides
the standards-based interoperable means, dynamics, syntax, and semantics for
representing identifiers for archival captures (mementos) from a set of web archives.
Each archive that supports the Memento Framework provides an HTTP endpoint
for retrieving mementos from their respective archival holdings. Users can send
a request for all captures of a URL to a variety of supporting archives through
a single endpoint by an accessible tool that performs the logic of querying and
combining results from multiple sources—a Memento aggregator.
Memento aggregators typically have reference to a set of endpoints to web
archives that implement the Memento Framework. An aggregator may express
this through a URI “template” like Figure 1 or as a URI with an implicit
append operation of a
URI-R
[
27
]. Upon receiving a request from a client with a
parameterized URL (e.g., the
URI-R
applied to the template URI), an aggregator
relays the argument received in this request as parameters for subsequent requests
to each archive. When the aggregator receives a sufficient response
1
, as dictated
1
This criteria is implementation-specific and may be associated with a temporal
threshold, memento count, etc.
arXiv:2210.01196v1 [cs.DL] 3 Oct 2022
2 M. Kelly
t0:{scheme & hostname}/{resource type}/{format}/{URI-R}
t1:https://myarchive.org/timemap/link/http://example.com
m0:{scheme & hostname}/{datetime}/{URI-R}
m1:http://archive.md/20210619183508/https://icadl.net/icadl2021/
m2:https://archive.ph/eoQRZ
Fig. 1: An aggregator must be configured to supply parameters to an HTTP
endpoint (like t
1
), often exhibited in the form of a “templated URI” (t
0
) for a URI-
T as shown here. The suffixed red portion represents a
URI-R
http://example.com
as used in practice. This URI templating is replicated (m
0
) with
URI-Ms
(e.g.,
m
1
), though a web archive need not identify its captures in this non-opaque
manner (m2and m1identify the same memento).
by the logic of the aggregator in-practice, the aggregator combines the results
through a procedure that aligns with Memento syntax, often inclusive of temporal
sorting
2
. The aggregator returns this “aggregated” response to the client. This
description somewhat encompasses the conventional role of the aggregator. Its
place as a means for users to interface with multiple web archives through a
single request has the potential to be further utilized, exploited, and be more
generally useful.
This paper examines the hierarchical (yet decoupled) relationship between a
Memento aggregator and Memento-compliant web archives. While an aggregator
and a set of archives often exhibit a static one-to-many relationship (respectively),
there exists both more fundamental and more potentially complex hierarchies
that may be exhibited using existing infrastructure. These exhibitions may be
strategically and efficiently enhanced through consideration of this potential
additional capability for the sake of enhancing the role of the aggregator in use
cases for web archives. We build on existing work in defining a framework for
aggregating public and private web archives [
16
]. Our focus will be on identifying
(Section 6) and mitigating (Section 7) some outstanding issues both introduced
by the framework as well as those that exist in current practice of interfacing
with web archives using Memento aggregation.
2 Background
The Memento Framework [
27
] introduces the ability to perform temporal negoti-
ation on the web by relating the current and past representations of a web page.
Past representations are identified by “URI-Ms” and the original representation
by a “
URI-R
”, per Memento. Memento also introduces a resource to associate
URI-Ms
and
URI-Rs
through a structured listing called a TimeMap, identified by
a “URI-T”. A web archive may return a TimeMap representing its holdings, inclu-
sive of
URI-Ms
, a URI-R, URI-Ts, and a URI-G for a “TimeGate”. A TimeGate
allows a client, through HTTP request headers, to specify a datetime basis for
2
It is important to note here that TimeMaps do not need to be temporally sorted to
be Memento compliant.
Aggregator Reuse and Extension for Richer Web Archive Interaction 3
(a) (b)
Fig. 2: The “Time Travel” service provides a graphical, web-based endpoint
to interface with LANL’s Memento aggregator. After submitting a URI and
date range in the interface (2a), the results are displayed (Figure 2b), showing
the extent of the captures from a variety of pre-configured, server-defined web
archives.
a likewise included URI-R. This paper relates to the information retrieval and
relational aspects of Memento TimeMaps and not specifically to the temporal
negotiation of Memento, the latter being a feature of TimeGates. We focus on
the association of past and present URIs and not the ability to resolve the closest
datetime, both of which Memento provides.
The concept of aggregation goes beyond the Memento specification by leverag-
ing a similar structure to TimeMaps but allowing the URIs contained within the
aggregated TimeMap to identify resources at multiple archives instead of a single
archive. The Research Library at Los Alamos National Laboratory (LANL) de-
ployed the original Memento aggregator [
8
,
11
], currently accessible through a web
interface via the Time Travel service at
https://timetravel.mementoweb.org/
.
This web service (Figure 2a) provides an HTML form field for a user to specify
the URI-R and a datetime then uses temporal negotiation to query a set of
archives and return links to the results (Figure 2b).
A central point of access also implies a central point of failure—if the aggre-
gator goes down, no further aggregation may be performed, and users must again
resort to querying individual web archives. In response, Alam and Nelson created
MemGator [
1
], a portable, open-source, cross-platform, user-deployable Memento
aggregator. This tool enables individuals to no longer solely rely on a single
web-accessible aggregator but also configure, use, and potentially deploy their
own. Also, unlike Time Travel, a user has the ability to control which web archives
are queried for mementos. This newfound ability provided the accessibility of the
aggregation capability to be further explored by researchers.
Memento is an extension to the Hypertext Transfer Protocol (HTTP). HTTP
is a stateless, client-server based protocol on which the web is built. In the context
4 M. Kelly
of Memento, a client provides an HTTP request for a TimeMap of a URI in the
past, often by appending a URI-R to a templated endpoint (Figure 1). Both
the identifiers for a TimeMap and a memento are returned with corresponding
Link [20]
HTTP response headers giving additional context to the representation.
A user (e.g., person) will typically act as a client through a user-agent (e.g., web
browser, cURL
3
) and may send an HTTP request to a Memento aggregator with
the expectation of receiving an HTTP response. The aggregator, in-turn, acts as
a client to the web archives, relaying the request for the URI-R in the past and
expects HTTP responses. This use case of a Memento aggregator playing the
role of a server and a client is abridged in Section 7.4.
3 Related Work
Most research involving Memento aggregation relates to usage of the aggregator
rather than enhancement of the aggregation process. In the same way that
prior to MemGator, researchers would state “we requested URIs from the Time
Travel Service”, this statement was transformed to “we used MemGator to
request URIs”, indicative that it was useful for researchers to utilize their own
aggregator instance [
21
,
14
,
4
]. A facet of this use case is the ability for researchers
to customize the set of web archives to be used as the basis for querying, which
is performed prior to running MemGator by modifying a configuration file
4
. This
paper examines the aggregation process beyond accessing an aggregator and does
so at a more abstract level than the ability to customize the archival sources.
3.1 Using Aggregators Beyond End-User Aggregation
As MemGator is free and open-source software (cf. Time Travel), many research
endeavors on evolving the aggregation process have centered around enhancing its
development beyond the limited endpoint-based Time Travel ecosystem. While the
set of archives to be aggregated is static, both in accessing the Time Travel service
as well as a deployed MemGator instance, other standards-based mechanisms
like HTTP Prefer [
26
] provide a means of allowing a client to specify the set
of archives aggregated to an “enhanced” aggregator—in this case, an extended
version of MemGator [
13
]. This approach [
13
] entailed encoding the set of archives
that normally reside in a server-side configuration file to be customizable at query
time. The specification of custom archival sources utilizes the “Prefer” HTTP
request header with a value being the self-describing, base-64 encoded JSON
representing the aggregator’s configuration of endpoints. A prototypical extension
of MemGator referenced by the authors required the aggregator to read the
HTTP request header and respond accordingly at runtime to request captures
only from the archives specified by the client.
3https://curl.se/
4
An aside: researchers that need to control the process do so either through manipula-
tion of their internal software (LANL experimenting with Time Travel [8]) or those
outside of LANL utilizing MemGator.
摘要:

AggregatorReuseandExtensionforRicherWebArchiveInteractionMatKelly[0000000202367389]DrexelUniversity,PhiladelphiaPA19104,USAmkelly@drexel.eduhttps://matkelly:comAbstract.MementoaggregatorsenableuserstoquerymultiplewebarchivesforcapturesofaURIintimethroughasingleHTTPendpoint.Whilethisone-to-manyaccess...

展开>> 收起<<
Aggregator Reuse and Extension for Richer Web Archive Interaction Mat Kelly0000000202367389.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.67MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注