Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: use BV class as python value for Cryptol bit sequences #116

Merged
merged 15 commits into from
Nov 13, 2020

Conversation

pnwamk
Copy link
Contributor

@pnwamk pnwamk commented Nov 10, 2020

Adds a python class BV to represent bitvectors / cryptol bit sequences and uses that for the canonical return value for bit sequences returned from the python cryptol API.

(The previous class we experimented with (BitVector) seemed optimized for use cases that did not line up well with our needs.)

@pnwamk
Copy link
Contributor Author

pnwamk commented Nov 10, 2020

This Cryptol branch has the corresponding submodule bump and tests updated to reflect these changes: https://github.com/GaloisInc/cryptol/tree/cryptol-bits-as-BV

Copy link
Contributor

@david-christiansen david-christiansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run the code, but it looks generally good. There's some style stuff that I think we should follow up on, though.

python/cryptol/bv.py Outdated Show resolved Hide resolved
python/cryptol/bv.py Outdated Show resolved Hide resolved
python/cryptol/bv.py Outdated Show resolved Hide resolved
python/cryptol/bv.py Outdated Show resolved Hide resolved
python/cryptol/bv.py Outdated Show resolved Hide resolved
python/cryptol/bv.py Outdated Show resolved Hide resolved
python/cryptol/bv.py Outdated Show resolved Hide resolved
python/cryptol/bv.py Outdated Show resolved Hide resolved
@pnwamk pnwamk force-pushed the cryptol-bits-as-BV branch from 3cf029b to 96cc7eb Compare November 11, 2020 21:27
@pnwamk
Copy link
Contributor Author

pnwamk commented Nov 11, 2020

This PR does not remove the previously added support for BitVector values. If we want to remove that dependency... we could do that in this PR or make it an issue and take care of it when this solution is merged and has some traction.

@weaversa
Copy link

This PR does not remove the previously added support for BitVector values. If we want to remove that dependency... we could do that in this PR or make it an issue and take care of it when this solution is merged and has some traction.

Is there support in the new “BV” module for moving to and from BitVector?

@pnwamk
Copy link
Contributor Author

pnwamk commented Nov 11, 2020

@weaversa Not currently, but there could be! Would that be useful?

@weaversa
Copy link

@weaversa Not currently, but there could be! Would that be useful?

Well, yes. If you’re planning to remove support for BitVector, it would be nice to have a ready way to merge the new stuff into existing code that uses BitVector already.

@pnwamk pnwamk force-pushed the cryptol-bits-as-BV branch from 0ee7dbd to a978288 Compare November 11, 2020 23:19
@pnwamk
Copy link
Contributor Author

pnwamk commented Nov 12, 2020

@weaversa I think we've got functionality roughly equivalent to the prototypes you linked previously. If you see any omissions or have any other thoughts/comments I'm happy to incorporate them =)

- name: Build
run: |
cabal build all
- name: Cabal argo tests
run: cabal test argo
- name: Cabal file-echo-api tests
run: cabal test file-echo-api
- name: Python Cryptol unit tests
working-directory: ./python/cryptol
run: python3 -m unittest discover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final issue: can we set this up to run as part of cabal test? We already run Python scripts as part of our normal test suite, and having some tests require a different command to run than the others seems like a recipe for ongoing confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1539ba2 tweaks things so cabal test python will run the python unit test and discovery, and the success of any discovered python unit tests in the directory structure is a part of the existing Haskell test suite's success. Feels a little odd to have the haskell test framework invoking the python unit test framework... but I think it's fine for now and at least it's simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay... but the test are failing in a way I don't see locally and specifically running mypy on the files in question doesn't produce the error either... hmmmm....

@weaversa
Copy link

It would be nice for the cryptol.call function to accept strings as input. Otherwise, I've written this:

def stringToBVseq(string):
    return [BV(8, ord(element)) for character in string]

Also, BitVector accepts named parameters, so one can write BitVector(size = 10, intVal = 0xa). Would it be possible to add this feature to BV?

@weaversa
Copy link

This has already simplified my client code quite a bit by allowing me to pass results from cryptol.call to other cryptol.call invocations.

@david-christiansen
Copy link
Contributor

RE strings as arguments: we can make something like that work, but I don't think it fits well with Python's design. Python is pretty careful to keep a distinction between Unicode strings, and their realizations under various encodings as byte arrays. We could say "It's all UTF-8", but that seems a bit risky vs asking clients to call encode on it first.

@david-christiansen
Copy link
Contributor

👍 on keyword arguments to the constructor :-)

@weaversa
Copy link

weaversa commented Nov 12, 2020

RE strings as arguments: we can make something like that work, but I don't think it fits well with Python's design. Python is pretty careful to keep a distinction between Unicode strings, and their realizations under various encodings as byte arrays. We could say "It's all UTF-8", but that seems a bit risky vs asking clients to call encode on it first.

I see. I don't know Python all that well. What is a good way to handle passing strings? I'm fine with enforcing encode, and I'm also fine using the helper function I wrote, but perhaps something similar could be supported in the package?

So, am I correct to think we could have something like:

cryptol.call("myFunc", cryptol.StringToBVseq("cryptol")) # maybe this is argo.StringToBVseq, I don't know...also StringToBVseq is a terrible name

or

cryptol.call("myFunc, "cryptol".encode())

@pnwamk
Copy link
Contributor Author

pnwamk commented Nov 12, 2020

As of e6e76ed you should be able to use size and value keyword arguments, i.e. BV(8,0xff) == BV(value=255,size=8)

@david-christiansen
Copy link
Contributor

You're right (it's number 2).

Python strings can be converted to bytes by calling their .encode() method. It defaults to UTF-8, but an optional keyword argument lets you select other encodings. Example:

>>> c.call('Id::id', "UTF-8-kodning gør strenge længere".encode()).result()
b'UTF-8-kodning g\xc3\xb8r strenge l\xc3\xa6ngere'

(pardon my Danish - had to find something with a non-trivial representation in UTF-8)

@weaversa
Copy link

Ok! So that’s a fine solution. Thanks for showing me .encode()

@david-christiansen
Copy link
Contributor

david-christiansen commented Nov 13, 2020

I'm hesitant to have it call encode on strings by default, but certainly open to the possibility. The argument for not doing so is two-fold:

  1. It matters which encoding is selected - 16-bit Unicode is popular in Javascript and in Windows APIs, 8-bit is popular in the rest of the world, and 32-bit arises from time to time. Arbitrarily picking one of these can be a source of bugs.

  2. In recognition of point 1 above, Python 3 explicitly separated the string and bytes types, adding this encoding/decoding step. There's since been a culture of taking the difference between strings that represent text as Unicode and byte arrays that represent things like input buffers from disk seriously, and silently coercing between them is likely to be unexpected by Python programmers. We could probably add a better error message that points users at encode(), at least.

The best argument for doing a silent coercion is that Cryptol does one:

Cryptol> "UTF-8-kodning gør strenge længere"
[0x55, 0x54, 0x46, 0x2d, 0x38, 0x2d, 0x6b, 0x6f, 0x64, 0x6e, 0x69,
 0x6e, 0x67, 0x20, 0x67, 0xf8, 0x72, 0x20, 0x73, 0x74, 0x72, 0x65,
 0x6e, 0x67, 0x65, 0x20, 0x6c, 0xe6, 0x6e, 0x67, 0x65, 0x72, 0x65]

Unfortunately for us, most of the world these days expects silent coercions to be done in UTF-8, but Cryptol seems to be doing ISO-8859-1. This makes me think that it's in users' interest to be careful here.

@david-christiansen
Copy link
Contributor

Cryptol gives some wonderful error messages when it's provided with Unicode that isn't representable in ISO-8858-1!

Cryptol> "Œ"

[error] at <interactive>:1:1--1:4:
  Unsolvable constraints:
    • 8 >= 9
        arising from
        use of literal or demoted expression
        at <interactive>:1:1--1:4
    • Reason: It is not the case that 8 >= 9
Cryptol> "ẞ"

[error] at <interactive>:1:1--1:4:
  Unsolvable constraints:
    • 8 >= 13
        arising from
        use of literal or demoted expression
        at <interactive>:1:1--1:4
    • Reason: It is not the case that 8 >= 13

@pnwamk
Copy link
Contributor Author

pnwamk commented Nov 13, 2020

If this is common enough to warrant a package function, would something like this fit the bill?

    @staticmethod
    def byte_seq(source : Union[str, bytes], encoding : str = 'UTF-8') -> List['BV']:
        """Convert ``source`` into a sequence of 8-bit ``BV``s.
        
        If ``source`` is a string it is treated as UTF-8 encoded bytes by default. 
        Other encodings can be selected via the optional ``encoding`` keyword.
        See python's ``str.encode()`` function for supported encodings.
        """
        if isinstance(source, str):
            return [BV(8, c) for c in source.encode(encoding=encoding)]
        elif isinstance(source, bytes):
            return [BV(8, c) for c in source]
        else: # type: ignore
            raise ValueError(f'Cannot convert {source!r} into a byte sequence.')

Which could be called as follows:

$ python3
>>> from cryptol.bitvector import BV
>>> BV.byte_seq("12345")
[BV(8, 0x31), BV(8, 0x32), BV(8, 0x33), BV(8, 0x34), BV(8, 0x35)]
>>> BV.byte_seq("12345", encoding='ASCII')
[BV(8, 0x31), BV(8, 0x32), BV(8, 0x33), BV(8, 0x34), BV(8, 0x35)]
>>> 

@weaversa
Copy link

weaversa commented Nov 13, 2020

You know, that is interesting. Shouldn’t the call function interpret strings the same way Cryptol does?

@david-christiansen
Copy link
Contributor

You know, that is interesting. Shouldn’t the call function should interpret strings the same way Cryptol does?

Here, we get into a conflict between the default Python interpretation (UTF-8) and the default Cryptol one (ISO-8859-1). Cryptol string notation is really a convenience notation for finite sequences of bytes, rather than representing actual strings, and it seems that the fact that it allows non-ASCII characters may even be a bug.

@pnwamk
Copy link
Contributor Author

pnwamk commented Nov 13, 2020

The magical green CI checkmark is back! @david-christiansen should we consider merging this now and continue the other features/etc separately? Or delay?

@david-christiansen
Copy link
Contributor

I think we should go ahead and get this thing merged.

@pnwamk pnwamk merged commit ca97f5a into master Nov 13, 2020
@pnwamk pnwamk deleted the cryptol-bits-as-BV branch November 13, 2020 00:43
@weaversa
Copy link

You know, that is interesting. Shouldn’t the call function should interpret strings the same way Cryptol does?

Here, we get into a conflict between the default Python interpretation (UTF-8) and the default Cryptol one (ISO-8859-1). Cryptol string notation is really a convenience notation for finite sequences of bytes, rather than representing actual strings, and it seems that the fact that it allows non-ASCII characters may even be a bug.

Would you mind if we continued this as a Cryptol issue? More folks over there.

@david-christiansen
Copy link
Contributor

I brought it up in our internal chat and it's being discussed quite a lot! But yes, I think that's where it's most appropriate.

@david-christiansen
Copy link
Contributor

GaloisInc/cryptol#863 seems to be the place where it is most natural.

@pnwamk
Copy link
Contributor Author

pnwamk commented Nov 13, 2020

I made this issue to summarize the general problem and allow us to include Python-specific discussion if needed without mucking up the pure-cryptol questions: #119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants