Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: proposal for content claims #86

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
334 changes: 334 additions & 0 deletions content-claims.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,334 @@
# Content Claims Protocol

![wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square)

## Editors

- [Alan Shaw](https://github.com/alanshaw), [DAG House](https://dag.house/)

## Authors

- [Irakli Gozalishvili](https://github.com/Gozala), [DAG House](https://dag.house/)
- [Mikeal Rogers](https://github.com/mikeal), [DAG House](https://dag.house/)

# Abstract

UCAN based protocol allowing actors to share information about specific content (identified by CID).

> We base the protocol on top of [UCAN invocation specification 0.2](https://github.com/ucan-wg/invocation/blob/v0.2/README.md#23-ipld-schema)
> [Original proposal](https://hackmd.io/IiKMDqoaSM61TjybSxwHog?view) (includes implementation notes)

## Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119).

# Introduction

## Motivation

## NextGen IPFS Content Discovery

Check failure on line 29 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "NextGen". Suggested alternatives: "Next Gen", "Next-Gen", "Next-gen", "Neogene" If you want to ignore this message, add NextGen to the ignore file at ./.github/workflows/words-to-ignore.txt

Content discovery in IPFS today is "peer based", meaning that an IPFS client first performs a "content discovery" step and then a "peer discovery" step to read data in the following way:

1. Client finds peerids that have announced they have a CID
2. Client finds transport protocols for the given peerid that it can use to read IPFS data from that peer.
3. Client retreives data from that peer.

![](https://hackmd.io/_uploads/r1pmgLuVh.png)

Check failure on line 37 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Images should have alternate text (alt text)

content-claims.md:37 MD045/no-alt-text Images should have alternate text (alt text)

File reads in this system requires that two entities, the client and the

Check failure on line 39 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Trailing spaces [Expected: 0 or 2; Actual: 1]

content-claims.md:39:73 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1]
peer serving the data, both support an IPFS aware transport protocol.

Check failure on line 40 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Trailing spaces [Expected: 0 or 2; Actual: 1]

content-claims.md:40:70 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1]
The peer serving a given CID is required to serve data at a specific

Check failure on line 41 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Trailing spaces [Expected: 0 or 2; Actual: 1]

content-claims.md:41:69 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1]
endpoint and keep that endpoint operational for data to be available.

When a publisher wants to hire a remote storage device this protocol design presents a challenge since these remote storage devices don’t speak native IPFS protocols, therefor IPFS clients can’t read content from them directly. We end up having to place a node between these remote storage devices and the client, proxying the data and incurring unnecessary egress.

Check failure on line 44 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "proxying". Suggested alternatives: "propping" If you want to ignore this message, add proxying to the ignore file at ./.github/workflows/words-to-ignore.txt

![](https://i.imgur.com/Jan0Opg.png)

Check failure on line 46 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Images should have alternate text (alt text)

content-claims.md:46 MD045/no-alt-text Images should have alternate text (alt text)

Content Claims change this.

Publishers post **verifiable claims** about IPFS data they publish in remote storage devices into Content Discovery networks and services.

From these claims, IPFS clients can request data directly from remote storage devices over standard transports like HTTP.

![](https://i.imgur.com/4iutSwA.png)

Check failure on line 54 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Images should have alternate text (alt text)

content-claims.md:54 MD045/no-alt-text Images should have alternate text (alt text)


Check failure on line 56 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Multiple consecutive blank lines [Expected: 1; Actual: 2]

content-claims.md:56 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2]
With content claims, any actor can provide data into the network without making the data available over an IPFS aware transport protocol. Rather than "serving" the data, content providers post verifiable claims about the data and its location.

Clients can use these claims to read directly, over any existing transport (mostly HTTP), and in the act of reading the data will verify the related claims.

This removes a **substantial** source of cost from providing content into the network. Content can be published into the network "at rest" on any permanent or temporary storage device that supports reading over HTTP.

# Protocol

## Claim Types

The requirements for content claims break down into a few isolated components, each representing a specific claim. These claims are assembled together to represent the proof information necessary for retrieving content.

All claim types map a **single** CID to "claim information."

While you can derive block indexes from these claims (see: Inclusion Claims), each individual claim is indexed by a single **significant** CID (file root, dir root, etc) referrenced by `content` field.

Check failure on line 71 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "dir". Suggested alternatives: "deer", "jr", "rid", "Dir", "sir", "fir", "die", "dire", "dirt", "dirk", "dis", "air", "din", "cir", "did" If you want to ignore this message, add dir to the ignore file at ./.github/workflows/words-to-ignore.txt

Check failure on line 71 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "referrenced". Suggested alternatives: "referenced", "reference", "deference", "reverenced", "referred" If you want to ignore this message, add referrenced to the ignore file at ./.github/workflows/words-to-ignore.txt

These claims include examples of a unixfs file encoded into CAR files, but the protocol itself makes heavy use of CIDs in order to support a variety of future protocols and other use cases.

Check failure on line 73 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "unixfs". Suggested alternatives: "UnixFS" If you want to ignore this message, add unixfs to the ignore file at ./.github/workflows/words-to-ignore.txt

Since this protocol builds upon the UCAN Invocation Specification, all of these claims are contained within a message from an *issuer* to a *destination*. This means it can be used to send specific actors unique route information apart from public content discovery networks, and can also be used to send messages to public content discovery networks by simply addressing them to a DID representing the discovery network.

### Location Claims

* Claims that a CID is available at a URL.

Check failure on line 79 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Unordered list style [Expected: dash; Actual: asterisk]

content-claims.md:79:1 MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]
* Block Level Interface,

Check failure on line 80 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Unordered list style [Expected: dash; Actual: asterisk]

content-claims.md:80:1 MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]

Check failure on line 80 in content-claims.md

View workflow job for this annotation

GitHub Actions / markdown-link-check

Trailing spaces [Expected: 0 or 2; Actual: 1]

content-claims.md:80:25 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1]
* GET Body MUST match multihash digest in CID.
* Using CAR CID, or any future BlockSet() identifier, allows this to be a multi-block interface.

```javascript
{
"op": "assert/location",
"rsc": "https://web3.storage",
"input": {
"content" : CID /* CAR CID */,
"location": "https://r2.cf/bag...car",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that implementation currently receives an array here, which I would think is also the best option :)

"range" : [ start, end ] /* Optional: Byte Range in URL */
}
}
```

From this, we can derive a verifiable Block interface over HTTP or any other URL based address. This could even be used with `BitSwap` using `bitswap://`.

Filecoin Storage Providers (running boost) would be addressable via the [lowest cost read method available (HTTP GET Range in Piece)](https://boost.filecoin.io/http-retrieval#retrieving-a-full-piece).

And you can provide multiple locations for the same content using a list.

```javascript
{
"op": "assert/location",
"rsc": "https://web3.storage",
"input": {
"content": CID /* CAR CID */,
"location": [ "https://r2.cf/bag...car", "s3://bucket/bag...car" ],
"range": [ start, end ] /* Optional: Byte Range in URL */
}
}
```

### Equivalency Claims

We also have cases in which the same data is referred to by another CID and/or multihash. Equivalency claims represent this association as a verifiable claim.

Check failure on line 116 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "multihash". Suggested alternatives: "multi hash", "multi-hash", "multitask" If you want to ignore this message, add multihash to the ignore file at ./.github/workflows/words-to-ignore.txt

```javascript
{
"op": "assert/equals",
"rsc": "https://web3.storage",
"input": {
"content": CID /* CAR CID */,
"equals": CID /* Commp CID */
}
}
```

We should expect content discovery services to index these claims by `"content"` CID, since that is standard across all claims, but since this is a cryptographic equivalency, equivalency claim aware systems are encouraged to index both.

### Inclusion Claims

* Claims that a CID includes the contents claimed in another CID.
* Multi-Block Level Interface,
* One CID is included in another CID.
* When that's a CARv2 CID, the CID also provides a block level index of the referenced CAR CIDs contents.
* Using CAR CIDs and CARv2 Indexes, get a multi-block interface.
* Combined with HTTP location information for the CAR CID, this means we can read individual block sections using HTTP Ranges.


```javascript
{
"op": "assert/inclusion",
"rsc": "https://web3.storage",
"input": {
"content": CID /* CAR CID */,
"includes": CID /* CARv2 Index CID */,
"proof": CID /* Optional: zero-knowledge proof */
}
}
```

This can also be used to provide verifable claims for sub-deal inclusions in Filecoin.

Check failure on line 153 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "verifable". Suggested alternatives: "verifiable", "veritable", "veritably" If you want to ignore this message, add verifable to the ignore file at ./.github/workflows/words-to-ignore.txt

```javascript
{
"op": "assert/inclusion",
"rsc": "https://web3.storage",
"input": {
"content": CID /* PieceCID (CommP) */,
"includes": CID /* Sub-Deal CID (CommP) */,
"proof": CID /* Sub-Deal Inclusion Proof */
}
}
```

### Partition Claims

* Claims that a CID’s graph can be read from the blocks found in parts,
* `content` (Root CID)
* `blocks` (List of ordered CIDs)
* `parts` (List of archives [CAR CIDs] containing the blocks)

```javascript
{
"op": "assert/partition",
"rsc": "https://web3.storage",
"input": {
"content": CID /* Content Root CID */,
"blocks": CID, /* CIDs CID */
"parts": [
CID /* CAR CID */,
CID /* CAR CID */,
...
]
}
}
```

# Consuming Claims

An IPFS client wishing to perform a verifiable `read()` of IPFS data can construct one from verifiable claims. Given the amount of cryptography and protocol expertise necessary to perform these operations a few examples are detailed below.

## IPFS File Publishing

An IPFS File is a contiuous set of bytes (source_file) that has been encoded into a merkle-dag. The resulting tree is referenced by CID using the `dag-pb` codec.

Check failure on line 196 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "contiuous". Suggested alternatives: "continuous", "contiguous", "contentious", "contusion" If you want to ignore this message, add contiuous to the ignore file at ./.github/workflows/words-to-ignore.txt

Check failure on line 196 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "source_file". Suggested alternatives: "Sourceforge" If you want to ignore this message, add source_file to the ignore file at ./.github/workflows/words-to-ignore.txt

When files are encoded and published into the network, they are often packed into CAR files. One CAR file might contain **many** IPFS Files, and one large file could be encoded into **many** CAR Files. CAR files can be referenced by CID (digest of the CAR) and are typically exchanged transactionally (block level interface). This can get confusing, as transaction (block level) interfaces now ***contain*** multi-block interfaces.

Check failure on line 198 in content-claims.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "transactionally". Suggested alternatives: "transaction ally", "transaction-ally", "transactional", "transitionally", "transaction", "transnational" If you want to ignore this message, add transactionally to the ignore file at ./.github/workflows/words-to-ignore.txt

### Large File

Large file uploads can be problematic when performed as a single pass/fail transaction, so it has become routine to break large upload transactions into smaller chunks. The default encoder for web3.storage (w3up) is configurable but defaults to ~100MB, so as the file is encoded into the unixfs block structure it is streamed to CAR encodings of the configured size. Each of these is uploaded transactionally, and if the operation were aborted and started again it would effectively resume as long as the file and settings hadn't changed.

When the encode is complete the root CID of the unixfs merkle tree will be available as a `dag-pb` CID. This means we need a claim structure that can cryptographically route from this `dag-pb` CID to the data ***inside*** the resulting CAR files.

Content Claims allow us to expose cryptography in the IPFS encoding to client such that clients can read **directly from the encoded CAR**, rather than requiring an intermediary to *assemble* the unixfs merkle tree for them.

So, after we stream encode a bunch of CARs and upload them, we encode and publish:
* A **Partition Claim** for the ***Content Root CID***, which includes:
* An ordered list of every CID we encoded.
* A list of CAR CIDs where those CIDs can be found.
* **Inclusion Claims** for every ***CAR CID***, which includes:
* A CARv2 index of the CAR file.

Now, if we publish these CAR files to Filecoin, we're going to want to capture another address (CommP) for the CAR file. We then include an

* **Equivalency Claim** for every ***CAR CID***, which claims this CAR CID is equivalent to CommP for that CAR file.

Once that CAR data is aggregated into Filecoin deals, the list of CommP addresses included in each deal can be used to compute a sub-inclusion proof. At this time you may also publish:

* **Inclusion Claims** for every ***Sub-Deal CID*** (CommP) which claims each Sub-Deal CID is "included" in each Piece CID (also CommP), which includes:
* The referenced sub-deal inclusion proof.

From this point forward, you can lookup Filecoin Storage Providers onchain 😁

The preferred (lowest cost) method of reading data from Storage Providers is through an HTTP interface that requests data by Piece CID. You could describe these locations, perhaps playing the role of chain oracle, as:

* **Location Claims** for every ***CAR CID*** in every Storage Provider, which includes:
* The offsets in the Piece addressable by HTTP Range Header.

And of course, you can also use **Location Claims** for any other HTTP accessible storage system you ever decide to put a CAR into.

### *`read(DagPBCID, offset, length)`*

Now that we have a better idea of what sorts of claims are being published, let's construct a read operation from the claims.

It's out of scope for this specification to define the exact means by which you **discover** claims, but content discovery systems are expected to index these claims by `content` CID (in theory, you could index every block in the CARv2 indexes, but we should assume many actors don't want to pay for that, so everything in this specification works with only requires indexing the `content` CID in each claim). However, once a publisher sends these claims in public networks, they should presume all exposed addresses in the claims are "discoverable" from a security and privacy perspective.

Once you have the claims for a given `dag-pb` CID, like the claim examples above, you can:

* Build a Set() of CAR CIDs claimed to be holding relevant blocks and
* Build a Map() of inclusions indexed by CAR CID and
* depending on your CARv2 library you may want to parse the CARv2 indexes into something you can read quickly from, and
* Build a Map() of locations indexed by CAR CID,
* and you may even be able to build those concurrently.

Take the list of blocks from the partition claim and find the smallest number of CAR CIDs containing a complete Set.

Using the CARv2 indexes, you can now determine the byte offsets within every CAR for every block. You should also take the time to improve the read performance by closing the distance between many of those offsets so that you have fewer individual reads for contiuous sections of data, that will also reduce the burden your client will put on data providers since the same data will be fetched in fewer requests.

You're now free to request this data from all CARs concurrently in a single round-trip, wherever they are.

#### From Remote HTTP Endpoint

You can use HTTP Range Headers to request the required block sections over HTTP. Keep in mind that CAR CID claims have support for offsets that would need to be included in the offset calculation you'd then need to perform *within* the CAR.

#### From Filecoin Sub-Inclusion Proofs

Pretty simple really:

* The **Equivalency Claims** tells you the sub-deal CID.
* The **Inclusion Claims** for those sub-deal CIDs give you the Piece CIDs that Storage Providers commit to on chain.

#### From BitSwap 🤯

One amazing thing about these protocols is that by simply using a BitSwap URL, we have BitSwap peers as "locations" for CAR CIDs in the same **Location Claim** protocol.

One could build a system/protocol for "loading" CAR files into BitSwap peers and then publish Location Claims.

Even over BitSwap, this would represent a performance improvement over the current client protocols as you avoid round-trips traversing the graph.

## IPLD Schema

```ipldsch

# For UCAN IPLD Schema see
# https://github.com/ucan-wg/ucan-ipld

type URL string
type URLs [URL]
type CIDs [&Any]

type Assertion union {
| AssertLocation "assert/location"
| AssertInclusion "assert/inclusion"
| AssertPartition "assert/partition"
} representation inline {
discriminantKey "op"
}

type AssertLocation {
on URL
input ContentLocation
}
type AssertLocation {
on URL
input ContentInclusion
}
type AssertLocation {
on URL
input ContentPartition
}

type Range {
start Int
end Int
} representation tuple

type Locations union {
| URL string
| URLs list
} representation kinded

type ContentLocation {
content &Any
location Locations
range optional Range
}

type ContentInclusion {
content &Any
includes &Any
proof optional &Any
}

type ContentPartition {
# Content that is partitioned
content &Any
# Links to the CARs in which content is contained
parts [&ContentArchive]
# Block addresses in read order
blocks &CIDs
}
```