Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for print_docstring()'s docstring.find(quote) Type error #502

Merged
merged 3 commits into from
Oct 17, 2024

Conversation

gdesmar
Copy link
Contributor

@gdesmar gdesmar commented Oct 8, 2024

Description

The print_docstring function assumes the docstring is going to be a string, but it could be a bytestring.

How to reproduce

Five files can be found in brokendocstring.zip:
brokenbytedocstring.py: Docstring with actual bytes in the docstring
brokenbytedocstring.pyc: Previous file compiled with python 2.7
brokendocstring.py: Docstring with backslash escaping for bytes in docstring
brokendocstring.pyc: Previous file compiled with python 2.7
brokendocstring.cpython-38.pyc: Previous file compiled with python 3.8

The python 3.8 file is given for completeness, but from my investigation, it drops the docstring at compilation time, so not a problem.
The brokenbytedocstring files are examples that are technically not executable with a call like python brokenbytedocstring.py, but starting a python2.7 interpreter, we can invoke the function using import:

>>> import brokenbytedocstring
>>> brokenbytedocstring.a_func()
>>>

Output Given

Using those python 2.7 bytecode from the zip file, the following error should be reproducible:

Traceback (most recent call last):
  [...]
  File ".../spark_parser/ast.py", line 117, in preorder
    self.preorder(kid)
  File ".../uncompyle6/semantics/pysource.py", line 436, in preorder
    super(SourceWalker, self).preorder(node)
  File ".../spark_parser/ast.py", line 112, in preorder
    self.default(node)
  File ".../uncompyle6/semantics/pysource.py", line 896, in default
    self.template_engine(table[key.kind], node)
  File ".../uncompyle6/semantics/pysource.py", line 790, in template_engine
    self.preorder(node[index])
  File ".../uncompyle6/semantics/pysource.py", line 436, in preorder
    super(SourceWalker, self).preorder(node)
  File ".../spark_parser/ast.py", line 110, in preorder
    func(node)
  File ".../uncompyle6/semantics/n_actions.py", line 997, in n_mkfunc
    self.make_function(node, is_lambda=False, code_node=code_node)
  File ".../uncompyle6/semantics/pysource.py", line 550, in make_function
    make_function2(self, node, is_lambda, nested, code_node)
  File ".../uncompyle6/semantics/make_function2.py", line 179, in make_function2
    print_docstring(self, indent, code.co_consts[0])
  File ".../uncompyle6/semantics/helper.py", line 160, in print_docstring
    if docstring.find(quote) >= 0:
       ^^^^^^^^^^^^^^^^^^^^^
TypeError: argument should be integer or bytes-like object, not 'str'

Solution

By just converting the docstring to a string if it is a byte array, the rest of the code is successful.

if isinstance(docstring, bytes):
    docstring = docstring.decode("utf8", errors="backslashreplace")

I decided to use backslashreplace so that we could still see which bytes were in the original docstring, in case it becomes important for analysis.

Potential problem with the solution

The recovered docstring in the examples are different than the ones that were compiled, both in the string, and the prefix (b"" vs r""). I decided to offer a solution that would keep code change to a minimum while still giving a result that is very close to reality. The alternative would be to use two different functions to handle bytedocstrings and normaldocstrings, or to convert the docstring from string to bytes and do everything in bytes.

Tests

I tried to run the tests with make check and they succeeded. Just in case, I tried to make the print_docstring function return True as its first line, and none of the tests broke. I am therefore not certain if there are no tests for the docstring, or I did something wrong to test.

Real life example

5d9fe2735d4399d98e6e6a792b1feb26d6f2d9a5d77944ecacb4b4837e5e5fca (Photo.scr, exe)
-> 9cacd1265b06c6853d7a034686d38689dedfdb77a3021604f766f31876a26785 (ftpcrack.pyc)

@rocky
Copy link
Owner

rocky commented Oct 8, 2024

I need to investigate and think about this a some.

In something like:

def a_func():
    b"""Got \xe7\xfe Bytes?"""
    pass

Is the binary string a docstring or an unused string expression analogous to:

def a_func():
    5 # unused literal expression
    pass

Of course, uncompyle6 should not be throwing an error, whichever way it interprets this.

@gdesmar
Copy link
Contributor Author

gdesmar commented Oct 8, 2024

Indeed! I blindly assumed that it was, as I was in the print_docstring function.
A quick search does look to be bad news for my Pull Request. I'm very happy I did not try to do a more complex solution as described in the potential problem part. 😅

@rocky
Copy link
Owner

rocky commented Oct 8, 2024

@gdesmar If you are up for research, I would appreciate enlightenment on when something is a real docstring and when it is some kind of string expression across the various Python versions.

Consider this program:

"""Module docstring"""
class A:
    b"""Got \xe7\xfe Bytes?"""
    print(__doc__)

def a_func():
    """function docstring?"""
    print(__doc__)
print(__doc__)
a_func()
A()

When run in Python 3.12, it outputs:

Module docstring
Module docstring

If you remove the "b" in the class string it turns into docstring and when run it we get:

Got çþ Bytes?
Module docstring
Module docstring

@gdesmar
Copy link
Contributor Author

gdesmar commented Oct 10, 2024

Considering the following code (an extension of what you provided):

"""Module docstring"""
class A:
    b"""Got \xe7\xfe Bytes?"""
    print(__doc__)

    def class_func(self):
        b"""Got \xe7\xfe Bytes?"""
        print(__doc__)

class B:
    """Got no Bytes?"""
    print(__doc__)

    def class_func(self):
        """Got no Bytes?"""
        print(__doc__)

def single_func():
    """single docstring?"""
    print(__doc__)

def single_byte_func():
    b"""Got \xe7\xfe Bytes?"""
    print(__doc__)

print("\nStart of program:")
print("Module:")
print(__doc__)

print("\nSingle function:")
print(single_func.__doc__)
single_func()

print("\nSingle Byte function:")
print(single_byte_func.__doc__)
single_byte_func()

print("\nClass A:")
print(A.__doc__)
print(A.class_func.__doc__)
a = A()
print(a.class_func.__doc__)
a.class_func()

print("\nClass B:")
print(B.__doc__)
print(B.class_func.__doc__)
b = B()
print(b.class_func.__doc__)
b.class_func()

The behaviour is different based on the python version used.
I tested with 3.11.7, and it gave the same result as 3.8.19:

Module docstring
Got no Bytes?

Start of program:
Module:
Module docstring

Single function:
single docstring?
Module docstring

Single Byte function:
None
Module docstring

Class A:
None
None
None
Module docstring

Class B:
Got no Bytes?
Got no Bytes?
Got no Bytes?
Module docstring

This confirms the logic that the docstring cannot be a byte string in those versions. It also raises an interesting discussion on which docstring the variable __doc__ is pointing to, at different time during the execution of the code, but that it out of scope! I'm mostly looking at those entries where we are calling <something>.__doc__ directly.

I've also ran it with python 2.7.18:

Got  Bytes?
Got no Bytes?

Start of program:
Module:
Module docstring

Single function:
single docstring?
Module docstring

Single Byte function:
Got  Bytes?
Module docstring

Class A:
Got  Bytes?
Got  Bytes?
Got  Bytes?
Module docstring

Class B:
Got no Bytes?
Got no Bytes?
Got no Bytes?
Module docstring

So it looks like python 2.7 supports byte docstrings. That first line, piped into hexdump would show the expected values:
00000000 47 6f 74 20 e7 fe 20 42 79 74 65 73 3f 0a 47 6f |Got .. Bytes?.Go|

This difference also explain my initial statement of "python 3.8 is dropping the [byte] docstring", while it may simply be considering it an unused string expression, and optimized the bytecode.

Compiling the script, and trying to use uncompyle6 on them may confirm this for python 3.8 (edited to remove some empty lines and uninteresting prints at the end). It interestingly shows how the __doc__ variable it overwritten in the class B object. I wondered if the __doc__ = should be there, since it's not in the function, but I do not have a preference/opinion.

"""Module docstring"""
class A:
    print(__doc__)

    def class_func(self):
        print(__doc__)

class B:
    __doc__ = "Got no Bytes?"
    print(__doc__)

    def class_func(self):
        """Got no Bytes?"""
        print(__doc__)

def single_func():
    """single docstring?"""
    print(__doc__)

def single_byte_func():
    print(__doc__)

For python 2.7, it showed another "problem", as it errored out. I fixed it by making the function n_docstring() in n_action.py call the print_docstring function, as it was identical.

"""Module docstring"""
class A:
    r"""Got \xe7\xfe Bytes?"""
    print __doc__

    def class_func(self):
        r"""Got \xe7\xfe Bytes?"""
        print __doc__

class B:
    """Got no Bytes?"""
    print __doc__

    def class_func(self):
        """Got no Bytes?"""
        print __doc__

def single_func():
    """single docstring?"""
    print __doc__

def single_byte_func():
    r"""Got \xe7\xfe Bytes?"""
    print __doc__

On another note, if someone absolutely want a byte docstring in python 3.8, they can change their code to the following 😢 :

class A:
    __doc__ = b"""Got \xe7\xfe Bytes?"""
    print(__doc__)

Which will gives this, considering I didn't change the docstring for the function in class A:

Class A:
b'Got \xe7\xfe Bytes?'
None
None
Module docstring

I'll try to keep investigating, but I need to catch up with other things first. I hope this gives a bit more context. I could execute the test with every minor release (2.x, 3.x) to see if they differ from the two examples here. Hopefully it doesn't and each minor versions in a major version handles it the same way.

@rocky
Copy link
Owner

rocky commented Oct 10, 2024

Many thanks for the great investigative work and fix.

To make sure this great work is not lost or forgotten, would you turn your Python code into a self-checking test, by turning the print statements into assert statements?

Then this would go into test/sample/stmts. Then in test you can run add-test.py to byte-compile the program for representative bytecode that has distinctive behavior. e.g. 2.7 vs 3.8.
Again, thanks.

@gdesmar
Copy link
Contributor Author

gdesmar commented Oct 16, 2024

I converted the script to use asserts instead of print statements.

I tried to figure out how to add the tests to the repository, but would appreciate confirmations:
I created them under test/simple_source/stmts (instead of test/sample/stmts) and named them 16_bytestring_docstring.py and 16_no_bytestring_docstring.py. I wasn't sure of the number prefix convention.
I ran add_test.py, but it didn't look to do anything. I think it was mostly because the root .gitignore was ignoring the newly created .pyc files.
I did the steps manually, and realized that the optimize=2 for python>3.2 is making all docstring disappear from the compiled bytecode. I decided to generate the python 3.8 version with the default optimization.

I'm not completely confident, but I ran the make check under a python3.8 environment and it did not error out!

@rocky
Copy link
Owner

rocky commented Oct 16, 2024

I converted the script to use asserts instead of print statements.

Thanks!

I tried to figure out how to add the tests to the repository, but would appreciate confirmations: I created them under test/simple_source/stmts (instead of test/sample/stmts) and named them 16_bytestring_docstring.py and 16_no_bytestring_docstring.py. I wasn't sure of the number prefix convention.

The number prefix allows you to specify roughly where in the order of testing this test should be run. Simpler tests have lower number and more complicated tests have higher numbers. This allows me when creating a new version to find simple tests to run and then add the more complicated ones later.

I ran add_test.py, but it didn't look to do anything. I think it was mostly because the root .gitignore was ignoring the newly created .pyc files. I did the steps manually, and realized that the optimize=2 for python>3.2 is making all docstring disappear from the compiled bytecode. I decided to generate the python 3.8 version with the default optimization.

I'm not completely confident, but I ran the make check under a python3.8 environment and it did not error out!

Looking at now. Thanks again for doing this.

@rocky rocky merged commit 27c869b into rocky:master Oct 17, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants