Preserving a Web of Linked Data

Preserving a
Web of Linked Data

Lessons and challenges from a fading Web

There are many sides
to preservation.

We are loosing thousands of Alexandria libraries each day

We have lost so much of the early Web history, just as we have lost so much of early Human history.
Kalev H. Leetaru - University of Illinois

The forces of decay

Link Rot
Content Drift

Digital Preservation Business Case Toolkit http://wiki.dpconline.org/

Link Rot

Content Drift

Significant change in content within a 3-Month Period

Yesterday: Web archiving strategies
Today: Tools for a Web of Linked Data
Tomorrow: Things to keep in mind

Preserving a Web of Linked Data

Yesterday: Web archiving strategies
Today: Tools for a Web of Linked Data
Tomorrow: Things to keep in mind

Strategies

Observational: perceived as discrete
- Snapshot
- Web archive
Historical: perceived as continuous
- Versioning systems
- Transactional
- Notification-based

Web archive

See: Open Wayback

Versioning systems

See: MediaWiki

Transactional

See: SiteStory apache plugin

If a representation
changes and nobody is
around to see it,
should it be archived?

Memento: travelling to the Web of the Past

https://tools.ietf.org/html/rfc7089

Preserving a Web of Linked Data

Yesterday: Web archiving strategies
Today: Tools for a Web of Linked Data
Tomorrow: Things to keep in mind

Archive or Archiving?

Linked Data archiving as the product

RDF indexes for versioning
- Dydra, Virtuoso, XRDF3X, ...
Representations of versions, provenance & time:
- PROV, LDPatch, LODE, ...

Technical
(Increasingly) Popular research tracks.

Linked Data archiving as the process

Some technological building blocks
- Linked Data interfaces, change detection, publishing, crawling & querying

Technical, as well as Infrastructural & Societal.
Rather unknown territory (but there are technologies).

What assumptions are there about data evolution?

Historical Data
- Provenance is a timeline.
- Only truth can exist at the same time.
- Timeseries databases, Wikipedia
Versioned Data
- Provenance is a directed acyclic graph.
- Multiple truths can exist at the same time.

Decay becomes more complex

Link Rot
Content Drift
Concept Drift
- "Please don't change your vocabulary"
  (Check out DRIFT-A-LOD workshop)
- Problem in other domains as well (Machine Learning)

Study these issues within Linked Data

Link Rot
- Subject or Object cannot be dereferenced
- Dataset/Interface is gone
Content Drift
- Context graph of Subject or Object has changed
Concept Drift
- Predicate or Object change meaning

Archiving for the
Reproducibility of Query results

Sustain the validity of claims
Backwards compatibility of applications

Federated querying is highly affected

How to shape a decentralized Quality of Service?

The Hyperlink is the simplest form of decentralization,
which we are already failing to preserve.

Persistent Identification

Persistent Identification

Dependency on publisher registering the PIDs
Possible loss of connection between PIDs and the original
Dependency on the PID provider

Possibly replacing one potential Link rot problem by another

Who are you to tell me my URI is not persistent?

ISWC Resources track:

Consensus on and trust in persistence in a decentralized Web:
community-driven? standardization? blockchain,...?

Robust links


	<a href="B"
	data-versionurl="URL of snapshot of B"
	data-versiondate="datetime of snapshot of B">

http://robustlinks.mementoweb.org/spec/

Robust Links

Open Annotation
& Memento vocab

Can be linked
to PROV

Open challenges with Memento

Real-time data: HTTP Datetime format is per second
Parallel truths: No solution for accessing Versioned Data

Who will be responsible for archiving?

Publisher
- Snapshot
- Versioning systems
3rd party
- Traditional
Hybrid: Publisher and/or 3rd party
- Transactional
- Notification-based

Snapshot

Often "End of Term" archive (DBPedia version)
Exchangeable archives, eg. file-based HDT

Versioning systems

Memento support can improve
- depends on query expressivity
Significant progress in the RDF domain

Web: MediaWiki
RDF: Storage: Dydra, Virtuoso, ...; Memento-supported publishing: DBpedia Wayback machine, Linked Data Fragments Server

Hybrid: Snapshot + Versioning

Discrete snapshots + index for continuous versions

Linked Data pages: Tailr, ...
Triple Patterns: Ostrich (offset-enabled), ...

Web archive

Not much in place yet
Indexes, but no notion of time
- Sindice, LODCache, LODLaundromat
Many technologies
- targeted crawling, sindice LODLaundromat, Linked Data Crawling, ...
No guarantees on completeness

Transactional

Decentralized, sustainable solution
A challenge for completeness
- Dependence on resource granularity
  eg. SPARQL results or Linked Data pages?
Interested to see how far we would get...

Preserving a Web of Linked Data

Yesterday: Web archiving strategies
Today: tools for a Web of Linked Data
Tomorrow: things to keep in mind

Data archiving intrests more than curators & activists

For instance, Data driven journalism.

Product: transparency of the editorial process
Process: interaction with users, public

Scolary communication, cultural heritage, legal publications, community databases (Wikipedia & Wikidata)

Archivability of Linked Data

Linked Data is in essence easier to archive.

Raw, self-contained data
Already machine processable/understandable
No obfuscation by client-side scripting

Accessibility of content to stimulate archiving.

The content in HTML+RDFa that dokieli produces is accessible (readable) without requiring any CSS or JavaScript, ie. text-browser safe. Breaking this "rule" in future development should be considered an anti-pattern (or a bug) in dokieli.
dokieli documentation, Sarven Capadisli

Choices in Linked Data interface
increase or decrease archiving.

Intelligent Server: High resource granularity
Intelligent Client: Data not as accessible; Need to participate in archiving process

Prevent mistakes from the past in standardization

Query interfaces: what can be archived?
Protocols: is it accessible?
Domain Modeling: can the semantics be preserved? How to select the subgraph?

Preserving a Web of Linked Data

Yesterday: Web archiving strategies
Today: Tools for a Web of Linked Data
Tomorrow: Things to keep in mind

There are many sides
to preservation.

We don't start from scratch,
many technologies are there.

Start covering the uncovered sides.

Add archiving to the discussion.

Preserving a Web of Linked Data

Lessons and challenges from a fading Web

Miel Vander Sande

Ghent University – imec

Preserving aWeb of Linked Data

Lessons and challenges from a fading Web

There are many sidesto preservation.

Web of Linked Data?

We are loosing thousands of Alexandria libraries each day

The forces of decay

Link Rot

Content Drift

Significant change in content within a 3-Month Period

Preserving a Web of Linked Data

Preserving a Web of Linked Data

Strategies

Snapshot

Web archive

Versioning systems

Transactional

If a representationchanges and nobody is around to see it,should it be archived?

Notification-based

Memento: travelling to the Web of the Past

Preserving a Web of Linked Data

Archive or Archiving?

Linked Data archiving as the product

Technical(Increasingly) Popular research tracks.

Linked Data archiving as the process

Technical, as well as Infrastructural & Societal.Rather unknown territory (but there are technologies).

What assumptions are there about data evolution?

Decay becomes more complex

Study these issues within Linked Data

Archiving for the Reproducibility of Query results

How to shape a decentralized Quality of Service?

Persistent Identification

Persistent Identification

Possibly replacing one potential Link rot problem by another

Who are you to tell me my URI is not persistent?

Robust links

Robust Links

Open Annotation& Memento vocabCan be linkedto PROV

Open challenges with Memento

Who will be responsible for archiving?

Snapshot

Versioning systems

Hybrid: Snapshot + Versioning

Web archive

Transactional

Notification-based

Preserving a Web of Linked Data

Data archiving intrests more than curators & activists

For instance, Data driven journalism.

Archivability of Linked Data

Linked Data is in essence easier to archive.

Accessibility of content to stimulate archiving.

Choices in Linked Data interfaceincrease or decrease archiving.

Prevent mistakes from the past in standardization

Preserving a Web of Linked Data

There are many sidesto preservation.

We don't start from scratch, many technologies are there.

Start covering the uncovered sides.

Add archiving to the discussion.

Preserving a Web of Linked Data

Lessons and challenges from a fading Web

Preserving a
Web of Linked Data

There are many sides
to preservation.

If a representation
changes and nobody is
around to see it,
should it be archived?

Technical
(Increasingly) Popular research tracks.

Technical, as well as Infrastructural & Societal.
Rather unknown territory (but there are technologies).

Archiving for the
Reproducibility of Query results

Open Annotation
& Memento vocab

Can be linked
to PROV

Choices in Linked Data interface
increase or decrease archiving.

There are many sides
to preservation.

We don't start from scratch,
many technologies are there.