Merkle reference multiformat #357

Gozala · 2024-10-01T15:37:55Z

This PR proposes addition of the code for the merkle-references multiformat, described here. Allocating a code would enable use to format them as IPLD links and seamlessly integrate them into the rest of the IPLD ecosystem.

Format

Merkle reference could be viewed as a hashing algorithm defined for all of the IPLD types (as opposed to just bytes). However, just like with CIDs it may use different underlying hashing algorithms (ℹ️ Although same algorithm across full DAG is implied).

Therefor merkle-reference is proposed as a standalone multiformat with a following format

<merkle-reference> ::= <merkle-reference-multicodec><content-multihash>
# or, expanded:
<merkle-reference> ::= <0x07, the code for merkle-reference><multihash of merkle-folded data>

Integration

There is a nice duality to the merkle-references as they could also be viewed as a lossy IPLD codec where bytes are derived through merkle-folding process (described in the linked spec).

By prefixing merkle-reference with 0x01 varint they can be formatted as valid CIDv1, in which case 0x07 code could be treated as IPLD codec.

This could be utilized to integrate merkle-reference's into rest of the IPLD ecosystem by formatting them IPLD links.

alanshaw · 2024-10-23T08:42:56Z

I like the idea. I don't know the process to have codes accepted here, do we require at least one implementation?

Gozala · 2024-10-24T06:52:55Z

I like the idea. I don't know the process to have codes accepted here, do we require at least one implementation?

I have an implementation that currently lives here https://github.com/Gozala/merkle-reference

bumblefudge · 2024-11-18T09:52:05Z

I like the idea. I don't know the process to have codes accepted here, do we require at least one implementation?

See Robin's process doc - Provide evidence that the encoding is supported in at least two production implementations. is required for DRAFT status, and this would also be a requirement if the registry were administered at IANA or at W3C's new registry process. Since we're in the extremely scarce single-byte range, it might make more sense to wait until this is a little more developed/further along before robbing future-multiformats of one slot needed for future CID and/or other structured "third layer" tags?

Another thing that would help would be if the current spec were formatted as a complete, linter-passing IETF internet-draft (example from a multibase registration in progress) or W3C CG respec doc with test-vectors and all that, rather than an open PR with unresolved comments on the web3.storage IP process repo... not technically a requirement at DRAFT status but maybe one that would help allay concerns of single-byte squatting from the least-generous possible readers 😉

Gozala · 2024-11-19T01:28:30Z

See Robin's process doc - Provide evidence that the encoding is supported in at least two production implementations. is required for DRAFT status, and this would also be a requirement if the registry were administered at IANA or at W3C's new registry process.

Ah I was not aware of the new process, thanks for pointing it out.

Since we're in the extremely scarce single-byte range, it might make more sense to wait until this is a little more developed/further along before robbing future-multiformats of one slot needed for future CID and/or other structured "third layer" tags?

Sounds reasonable, yet seems like a double standard when I see

multicodec/table.csv

Lines 4 to 5 in 352d05a

    
           cidv2,                          cid,            0x02,           draft,      CIDv2 
        
           cidv3,                          cid,            0x03,           draft,      CIDv3

CIDv2 had being discussed forever and I would be very surprised if there are multiple implementations two production implementations. I have not even heard of CIDv3 probably something new that happened since I fell of the inter planetary space 😅 I won't even mention that most of the codes in that table would fail to meet new criteria.

As of second implementation, there is one in development in Rust and I can update the thread here when it's ready.

Another thing that would help would be if the current spec were formatted as a complete, linter-passing IETF internet-draft (example from a multibase registration in progress) or W3C CG respec doc with test-vectors and all that, rather than an open PR with unresolved comments on the web3.storage IP process repo... not technically a requirement at DRAFT status but maybe one that would help allay concerns of single-byte squatting from the least-generous possible readers 😉

Most up to date spec lives here https://github.com/Gozala/merkle-reference/blob/main/docs/spec.md. There is also interactive version https://observablehq.com/@gozala/merkle-references that anyone can test with various data sets.

Some test fixtures are available here and they'll likely move into more portable form once Rust implementation is there.

Trying to reformat it into IETF / W3C spec format is plausible, but as one man show I got to be pragmatic with where I spend time and it seemed a lot more reasonable to budget it after code is in the table as opposed to before.

Gozala · 2024-11-19T01:31:17Z

I should mention that presence on multicodec table is nice to have mostly for backwards compatibility with IPLD addressing scheme. In practice I don't expect multiformat prefixes to be used beyond bridging with legacy (IPLD) system.

bumblefudge · 2024-11-25T14:05:18Z

Sounds reasonable, yet seems like a double standard when I see

multicodec/table.csv

Lines 4 to 5 in 352d05a

cidv2, cid, 0x02, draft, CIDv2

cidv3, cid, 0x03, draft, CIDv3

Those two are a special case of "reserved for future use", as far as I know, and there is no definite timeline on CIDv2 or CIDv3, I think the single-bytes are just being held in reserve for future generations. And in total transparency, much of what's in the multicodec table today, including many "final" registrations, particularly those things with a casual/light spec and no second implementation, would need to be summarily demoted to "vendor", "experimental", or "reserved" status if the table moved into IANA or W3C governance.

Trying to reformat it into IETF / W3C spec format is plausible, but as one man show I got to be pragmatic with where I spend time and it seemed a lot more reasonable to budget it after code is in the table as opposed to before.

Oh, of course, if I'm still engaged in this project when you get the running code and informal spec over the line, I can help with formalizing the spec, and if I'm not, hopefully someone at IPFS Foundation or Shipyard can help instead. Sprucing up a spec is a laborious chore you won't have to do alone if there is interest and other people are interested in using this alternative to conventional CIDv1s!

burdiyan · 2024-11-28T16:39:18Z

This new approach of hashing data structures seems pretty interesting. Actually, it's probably the closest thing to the original vision of IPLD out there. I wonder whether it can be made compatible with existing IPFS data though.

rvagg · 2024-11-29T04:54:50Z

Could we move this to the two-byte range, at least for now? I'm concerned that without a community of people developing this or some kind of institutional backing that it's not going to get enough traction to go anywhere beyond an interesting idea. I feel bad saying this, but PL's nucleation process has meant a lot of good ideas like this have had to be relegated to the personal experiments of individuals. Squatting 0x7 for something that's not baked in to someone's product or has an excited community of developers is hard to justify when the precious 1-byte range is the thing we're the most protective of in the table these days.

Gozala · 2024-11-30T00:29:13Z

This new approach of hashing data structures seems pretty interesting. Actually, it's probably the closest thing to the original vision of IPLD out there. I wonder whether it can be made compatible with existing IPFS data though.

There is a integration section in the repo that describes how compatibility with IPLD is currently managed.

I have used it successfully with various IPLD codecs that treat merkle-references as IPLD Links a.k.a CIDs. Happy to discuss this more, but lets do it n the linked repo instead to reduce a noise here.

Gozala · 2024-11-30T00:40:48Z

Could we move this to the two-byte range, at least for now? I'm concerned that without a community of people developing this or some kind of institutional backing that it's not going to get enough traction to go anywhere beyond an interesting idea.

It is being used by a startup I'm currently employed by. I believe @hannahhoward also was looking at some point into using this for indexing blockchains, although I do not know if things have change there.

I feel bad saying this, but PL's nucleation process has meant a lot of good ideas like this have had to be relegated to the personal experiments of individuals. Squatting 0x7 for something that's not baked in to someone's product or has an excited community of developers is hard to justify when the precious 1-byte range is the thing we're the most protective of in the table these days.

I understand the rational, although at this point I don't find much value in pursuing that. In our current use we already treat merkle-references as multihash to avoid 1 byte CID overhead, yet we still pay 3 byte overhead.

If there low interest to justify code allocation that probably implies there is equally low probability of collision so I'd rather revisit this when and if interest / probability is higher.

Gozala · 2024-11-30T00:45:01Z

As a side note it might be a good idea to switch to hierarchical tables like we did with varsig that way domain specific tables can exist without every single format having to compete for the canonical code.

In such world merkle-reference would have being registered as a IPLD codec which would have had much smaller table.

burdiyan · 2024-12-04T15:05:34Z

As a side note it might be a good idea to switch to hierarchical tables like we did with varsig that way domain specific tables can exist without every single format having to compete for the canonical code.

In such world merkle-reference would have being registered as a IPLD codec which would have had much smaller table.

So agree with this. Not sure about hierarchical, but namespaces could be very useful.

bumblefudge · 2024-12-04T20:03:12Z

What's the status on varsig btw? see the discussion in #345

Gozala requested review from rvagg and vmx as code owners October 1, 2024 15:37

Gozala requested a review from alanshaw October 1, 2024 16:07

Propose merkle-reference multiformat

bafe3c9

Gozala force-pushed the merkle-reference branch from 44163fd to bafe3c9 Compare October 1, 2024 16:10

Gozala changed the title ~~Merkle Reference Code~~ Merkle reference multiformat Oct 1, 2024

Gozala requested a review from ribasushi October 1, 2024 16:26

bumblefudge mentioned this pull request Dec 6, 2024

Feedback darobin/dasl.ing#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merkle reference multiformat #357

Merkle reference multiformat #357

Gozala commented Oct 1, 2024 •

edited

Loading

alanshaw commented Oct 23, 2024

Gozala commented Oct 24, 2024

bumblefudge commented Nov 18, 2024 •

edited

Loading

Gozala commented Nov 19, 2024

Gozala commented Nov 19, 2024

bumblefudge commented Nov 25, 2024

burdiyan commented Nov 28, 2024

rvagg commented Nov 29, 2024

Gozala commented Nov 30, 2024

Gozala commented Nov 30, 2024

Gozala commented Nov 30, 2024 •

edited

Loading

burdiyan commented Dec 4, 2024

bumblefudge commented Dec 4, 2024

Merkle reference multiformat #357

Are you sure you want to change the base?

Merkle reference multiformat #357

Conversation

Gozala commented Oct 1, 2024 • edited Loading

Format

Integration

alanshaw commented Oct 23, 2024

Gozala commented Oct 24, 2024

bumblefudge commented Nov 18, 2024 • edited Loading

Gozala commented Nov 19, 2024

Gozala commented Nov 19, 2024

bumblefudge commented Nov 25, 2024

burdiyan commented Nov 28, 2024

rvagg commented Nov 29, 2024

Gozala commented Nov 30, 2024

Gozala commented Nov 30, 2024

Gozala commented Nov 30, 2024 • edited Loading

burdiyan commented Dec 4, 2024

bumblefudge commented Dec 4, 2024

Gozala commented Oct 1, 2024 •

edited

Loading

bumblefudge commented Nov 18, 2024 •

edited

Loading

Gozala commented Nov 30, 2024 •

edited

Loading