Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

bitcoin: add bitcoin docs (WIP) #270

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

bitcoin: add bitcoin docs (WIP) #270

wants to merge 2 commits into from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jun 12, 2020

Not complete, but it's big enough and very tedious, that I just want to push something. If anyone feels like reviewing as a WIP feedback would be appreciated but I've got a lot more to do to connect the pieces to IPLD. Will call for reviews when I think it's ~finished.

Copy link
Contributor

@ribasushi ribasushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass over this. Exciting!


The Bitcoin format consistently uses a double-SHA2-256 hash to produce content digests. This algorithm is simply the SHA2-256 digest of a SHA2-256 digest of the raw bytes. These digests are also used publicly when referring to individual transactions and whole block graphs. The Bitcoin Core CLI as well as the many web-based block explorers allow data look-up by these addresses.

When publishing these addresses, they are typically presented as big-endian in hexadecimal. To represent these in byte form on a little-endian system, they therefore need to be reversed and the hexadecimal decoded.
Copy link
Contributor

@ribasushi ribasushi Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since endianness is usually defined over a multibyte integer type, I am for real not sure which type of "little endian" is meant here ( and casual googling doesn't help ). If I see the following 128bit long payload on disk:

  • 00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff
    What is the actual value:
  1. 33 22 11 00 77 66 55 44 bb aa 99 88 ff ee dd cc
  2. 77 66 55 44 33 22 11 00 ff ee dd cc bb aa 99 88
  3. ff ee dd cc bb aa 99 88 77 66 55 44 33 22 11 00
  4. Something else?

Alternatively - if the on-disk structures are explicitly defined over >64bit integer types: this needs to be called out early, so folks like me get in the right mindset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3, as if you read it entirely as a 32-byte unsigned integer, you read it in the reverse than you would if you treated it as LE. "usually defined over a multibyte integer type" is what's being got at here, but it's 32-bytes, not some repeating sub-pattern.

The "as if" makes me think this is leaning too heavily on the "uint256" thing too much. I'm tempted to remove that language entirely and say it's just a byte string and by convention it gets byte-reversed and turned into hexadecimal when presented publicly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In truth, I never touch this "uint256" thing myself in any of my code. It treat all of these things as byte arrays and then reverse+hexadecimal whenever I need to present the value. Otherwise they're only useful as byte arrays. So I guess that fact in itself suggests the backing out of this concept. It's really just window dressing to make the zeros go at the start of block addresses.

block-layer/codecs/bitcoin.md Show resolved Hide resolved
block-layer/codecs/bitcoin.md Show resolved Hide resolved
block-layer/codecs/bitcoin.md Show resolved Hide resolved
block-layer/codecs/bitcoin.md Show resolved Hide resolved

### Transactions

There are at least one transaction in a Bitcoin block graph. The first transaction is called the "coinbase" and represents the miner rewards. A block graph may _only_ contain a coinbase or it may also also contain a number of transactions representing the movement of coins between wallets. Each transaction contains a list of one or more "Transaction Ins" and a list of one or more "Transaction Outs" representing the flow of coins. The coinbase contains a single Transaction In containing the block reward and the Transaction Outs list represent the destination of the rewards. Non-coinbase transactions contain Transaction Ins representing the source of the coins being transacted, linking to previous transactions, and a list of Transaction Outs containing the details of the destination wallets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are at least one transaction in a Bitcoin block graph.

Technically, past ~2140, when everyone working on this is dead, this may no longer be true ;P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that it's only true if people are transacting on Bitcoin and beyond ~2140 there may no longer be transactions? It's still going to be true as long as someone is mining Bitcoin because there's always a coinbase. There cannot exist a "bitcoin block graph" without at least one transaction!

I'm looking through Zcash right now and it's kind of sad how many coinbase-only transactions there are near the head. It makes it look like it now exists to be mined ...

block-layer/codecs/bitcoin.md Show resolved Hide resolved
}

type OutPoint struct {
hash Bytes # 256-bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, together with the #int64 below, almost makes one want to say "ipld schema integers are of arbitrary precision", and leave it up to the codecs when to switch the wire-representation, and leave it to codecs when to use a language internal bigint and when to use a native integer.

This has probably been discussed already, so feel free to ignore with no further discussion.

* `version`: a signed 32-bit integer
* `segwit`: is implicit and `false` for all block graphs prior to the SegWit soft fork, which occurred at a height of 481,824. After this height, the two bytes following `version` are inspected, if they are equal to `[0x0, 0x1]`, the bytes are consumed and `segwit` is `true`. If the bytes are not exactly these values, `segwit` is false, and the two bytes instead form the begining of `vin` (the first byte of `vin` is part of the compact size integer, and as `vin` must contain one or more elements, it cannot be `0x00`, hence the reliability of the `segwit` flag maintaining backward-compatibility).
* `vin`: one or more elements, prefixed by a compact size int, then, for each element up to the size:
* `hash`: an unsigned 256-bit integer / a 32-byte binary string, the OutPoint transaction ID hash identifying the source transaction for the coins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes together with the endianness discussion above: being an integer and a string at the same time can't be a thing.

block-layer/codecs/bitcoin.md Show resolved Hide resolved
@rvagg
Copy link
Member Author

rvagg commented Jun 15, 2020

Notes to self arising from discussion so far, for when I do revisions:

  • Probably remove the "uint256" concept, it's a confusing mess. In practice it's just a byte array that is presented to users as a byte-reversed hex string by convention
  • Notes about the various levels on the gradient of what it means to "decode" a bitcoin block and what you present at the data model - from strictly the elements you pull out of the binary format up to decoding all of the things including turning linkable things into CIDs and even decoding the script into its string format (or perhaps something more advanced?).
  • Work on something to clarify how ipldsch is being used—this applies broadly to our docs too, need better language to say "this defines a structure that could be conceived of as a data model thing, but we need more details as an adjunct to talk about specifics of binary representations", which gets tricky still because in these docs I'm even presenting different forms of data model things (see previous point)—the raw decoded pieces vs the more advanced version that can be presented to match the Bitcoin Core RPC (i.e. convention).

@mikeal
Copy link
Contributor

mikeal commented Sep 17, 2020

what’s the status here?

i’d like to get something up on the specs website that i can link to

@rvagg
Copy link
Member Author

rvagg commented Sep 18, 2020

status is that each time I sit down to attack this I'm overwhelmed by the size of the task to pull it together into a coherent form that covers everything that it needs to; but it does weigh on me that it's outstanding and I need to get it closed out along with js-multiformats reworks of the codec(s).

It's not in a worthy state to even merge as a draft tbh, so you're out of luck for now but I'll try and get to it asap.

@warpfork
Copy link
Contributor

It would be really cool to merge this, even if we want to put some disclaimer texts in somewhere. This is way more and better information than we have on this topic anywhere else, as far as I know.

@rvagg
Copy link
Member Author

rvagg commented Apr 21, 2021

Not merge-worthy IMO, it's so far from what it should have been. I think a better approach might be to start from the reverse end, like the Filecoin, and now Ethereum data specs, and work backward. It turned out to be really hard to work forward like I was doing it here.
tbh I'm not sure what to do with this, I don't see any time on the horizon for me to finish this out but it's one of those things that linger in the back of my head, along with the code that backs this work which is also out of date now with the ecosystem it sits within.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants