Aggregator Reuse and Extension for Richer Web Archive Interaction Mat Kelly0000000202367389

2025-04-27 0 0 1.67MB 16 页 10玖币

侵权投诉

Aggregator Reuse and Extension

for Richer Web Archive Interaction

Mat Kelly[0000−0002−0236−7389]

Drexel University, Philadelphia PA 19104, USA

mkelly@drexel.edu

https://matkelly.com

Abstract.

Memento aggregators enable users to query multiple web

archives for captures of a URI in time through a single HTTP endpoint.

While this one-to-many access point is useful for researchers and end-

users, aggregators are in a position to provide additional functionality to

end-users beyond black box style aggregation. This paper identiﬁes the

state-of-the-art of Memento aggregation, abstracts its processes, highlights

shortcomings, and oﬀers systematic enhancements.

1 Introduction

Web archives act as a historical record of the web. The Internet Archive (IA)

possesses the largest number of web archive holdings. These holdings are accessible

through a set of interfaces to the Wayback Machine. Beyond IA, other web

archives exhibit focused collection eﬀorts, often providing unique captures within

IA’s temporal and spatial (i.e., URL [

]) voids [

]. A common usage pattern in

accessing IA’s captures is to request the archive’s web site at

archive.org

, submit

a URL of interest by providing it in a text input ﬁeld, then selecting a date and

time from the set of available captures for that URL in the past. This pattern may

diﬀer between web archives’ respective web interfaces.

Memento [27]

provides

the standards-based interoperable means, dynamics, syntax, and semantics for

representing identiﬁers for archival captures (mementos) from a set of web archives.

Each archive that supports the Memento Framework provides an HTTP endpoint

for retrieving mementos from their respective archival holdings. Users can send

a request for all captures of a URL to a variety of supporting archives through

a single endpoint by an accessible tool that performs the logic of querying and

combining results from multiple sources—a Memento aggregator.

Memento aggregators typically have reference to a set of endpoints to web

archives that implement the Memento Framework. An aggregator may express

this through a URI “template” like Figure 1 or as a URI with an implicit

append operation of a

URI-R

[

]. Upon receiving a request from a client with a

parameterized URL (e.g., the

URI-R

applied to the template URI), an aggregator

relays the argument received in this request as parameters for subsequent requests

to each archive. When the aggregator receives a suﬃcient response

, as dictated

This criteria is implementation-speciﬁc and may be associated with a temporal

threshold, memento count, etc.

arXiv:2210.01196v1 [cs.DL] 3 Oct 2022

2 M. Kelly

t0:{scheme & hostname}/{resource type}/{format}/{URI-R}

t1:https://myarchive.org/timemap/link/http://example.com

m0:{scheme & hostname}/{datetime}/{URI-R}

m1:http://archive.md/20210619183508/https://icadl.net/icadl2021/

m2:https://archive.ph/eoQRZ

Fig. 1: An aggregator must be conﬁgured to supply parameters to an HTTP

endpoint (like t

), often exhibited in the form of a “templated URI” (t

) for a URI-

T as shown here. The suﬃxed red portion represents a

URI-R

http://example.com

as used in practice. This URI templating is replicated (m

) with

URI-Ms

(e.g.,

), though a web archive need not identify its captures in this non-opaque

manner (m2and m1identify the same memento).

by the logic of the aggregator in-practice, the aggregator combines the results

through a procedure that aligns with Memento syntax, often inclusive of temporal

sorting

. The aggregator returns this “aggregated” response to the client. This

description somewhat encompasses the conventional role of the aggregator. Its

place as a means for users to interface with multiple web archives through a

single request has the potential to be further utilized, exploited, and be more

generally useful.

This paper examines the hierarchical (yet decoupled) relationship between a

Memento aggregator and Memento-compliant web archives. While an aggregator

and a set of archives often exhibit a static one-to-many relationship (respectively),

there exists both more fundamental and more potentially complex hierarchies

that may be exhibited using existing infrastructure. These exhibitions may be

strategically and eﬃciently enhanced through consideration of this potential

additional capability for the sake of enhancing the role of the aggregator in use

cases for web archives. We build on existing work in deﬁning a framework for

aggregating public and private web archives [

]. Our focus will be on identifying

(Section 6) and mitigating (Section 7) some outstanding issues both introduced

by the framework as well as those that exist in current practice of interfacing

with web archives using Memento aggregation.

2 Background

The Memento Framework [

] introduces the ability to perform temporal negoti-

ation on the web by relating the current and past representations of a web page.

Past representations are identiﬁed by “URI-Ms” and the original representation

by a “

URI-R

”, per Memento. Memento also introduces a resource to associate

URI-Ms

and

URI-Rs

through a structured listing called a TimeMap, identiﬁed by

a “URI-T”. A web archive may return a TimeMap representing its holdings, inclu-

sive of

URI-Ms

, a URI-R, URI-Ts, and a URI-G for a “TimeGate”. A TimeGate

allows a client, through HTTP request headers, to specify a datetime basis for

It is important to note here that TimeMaps do not need to be temporally sorted to

be Memento compliant.

Aggregator Reuse and Extension for Richer Web Archive Interaction 3

(a) (b)

Fig. 2: The “Time Travel” service provides a graphical, web-based endpoint

to interface with LANL’s Memento aggregator. After submitting a URI and

date range in the interface (2a), the results are displayed (Figure 2b), showing

the extent of the captures from a variety of pre-conﬁgured, server-deﬁned web

archives.

a likewise included URI-R. This paper relates to the information retrieval and

relational aspects of Memento TimeMaps and not speciﬁcally to the temporal

negotiation of Memento, the latter being a feature of TimeGates. We focus on

the association of past and present URIs and not the ability to resolve the closest

datetime, both of which Memento provides.

The concept of aggregation goes beyond the Memento speciﬁcation by leverag-

ing a similar structure to TimeMaps but allowing the URIs contained within the

aggregated TimeMap to identify resources at multiple archives instead of a single

archive. The Research Library at Los Alamos National Laboratory (LANL) de-

ployed the original Memento aggregator [

], currently accessible through a web

interface via the Time Travel service at

https://timetravel.mementoweb.org/

This web service (Figure 2a) provides an HTML form ﬁeld for a user to specify

the URI-R and a datetime then uses temporal negotiation to query a set of

archives and return links to the results (Figure 2b).

A central point of access also implies a central point of failure—if the aggre-

gator goes down, no further aggregation may be performed, and users must again

resort to querying individual web archives. In response, Alam and Nelson created

MemGator [

], a portable, open-source, cross-platform, user-deployable Memento

aggregator. This tool enables individuals to no longer solely rely on a single

web-accessible aggregator but also conﬁgure, use, and potentially deploy their

own. Also, unlike Time Travel, a user has the ability to control which web archives

are queried for mementos. This newfound ability provided the accessibility of the

aggregation capability to be further explored by researchers.

Memento is an extension to the Hypertext Transfer Protocol (HTTP). HTTP

is a stateless, client-server based protocol on which the web is built. In the context

4 M. Kelly

of Memento, a client provides an HTTP request for a TimeMap of a URI in the

past, often by appending a URI-R to a templated endpoint (Figure 1). Both

the identiﬁers for a TimeMap and a memento are returned with corresponding

Link [20]

HTTP response headers giving additional context to the representation.

A user (e.g., person) will typically act as a client through a user-agent (e.g., web

browser, cURL

) and may send an HTTP request to a Memento aggregator with

the expectation of receiving an HTTP response. The aggregator, in-turn, acts as

a client to the web archives, relaying the request for the URI-R in the past and

expects HTTP responses. This use case of a Memento aggregator playing the

role of a server and a client is abridged in Section 7.4.

3 Related Work

Most research involving Memento aggregation relates to usage of the aggregator

rather than enhancement of the aggregation process. In the same way that

prior to MemGator, researchers would state “we requested URIs from the Time

Travel Service”, this statement was transformed to “we used MemGator to

request URIs”, indicative that it was useful for researchers to utilize their own

aggregator instance [

]. A facet of this use case is the ability for researchers

to customize the set of web archives to be used as the basis for querying, which

is performed prior to running MemGator by modifying a conﬁguration ﬁle

. This

paper examines the aggregation process beyond accessing an aggregator and does

so at a more abstract level than the ability to customize the archival sources.

3.1 Using Aggregators Beyond End-User Aggregation

As MemGator is free and open-source software (cf. Time Travel), many research

endeavors on evolving the aggregation process have centered around enhancing its

development beyond the limited endpoint-based Time Travel ecosystem. While the

set of archives to be aggregated is static, both in accessing the Time Travel service

as well as a deployed MemGator instance, other standards-based mechanisms

like HTTP Prefer [

] provide a means of allowing a client to specify the set

of archives aggregated to an “enhanced” aggregator—in this case, an extended

version of MemGator [

]. This approach [

] entailed encoding the set of archives

that normally reside in a server-side conﬁguration ﬁle to be customizable at query

time. The speciﬁcation of custom archival sources utilizes the “Prefer” HTTP

request header with a value being the self-describing, base-64 encoded JSON

representing the aggregator’s conﬁguration of endpoints. A prototypical extension

of MemGator referenced by the authors required the aggregator to read the

HTTP request header and respond accordingly at runtime to request captures

only from the archives speciﬁed by the client.

3https://curl.se/

An aside: researchers that need to control the process do so either through manipula-

tion of their internal software (LANL experimenting with Time Travel [8]) or those

outside of LANL utilizing MemGator.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AggregatorReuseandExtensionforRicherWebArchiveInteractionMatKelly[0000000202367389]DrexelUniversity,PhiladelphiaPA19104,USAmkelly@drexel.eduhttps://matkelly:comAbstract.MementoaggregatorsenableuserstoquerymultiplewebarchivesforcapturesofaURIintimethroughasingleHTTPendpoint.Whilethisone-to-manyaccess...

展开>> 收起<<

Aggregator Reuse and Extension for Richer Web Archive Interaction Mat Kelly0000000202367389.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Aggregator Reuse and Extension for Richer Web Archive Interaction Mat Kelly0000000202367389

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: