Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does SOK support inline data? #9

Open
thaddywu opened this issue Aug 2, 2021 · 5 comments
Open

Does SOK support inline data? #9

thaddywu opened this issue Aug 2, 2021 · 5 comments

Comments

@thaddywu
Copy link

thaddywu commented Aug 2, 2021

Hi, I'm curious about whether SOK could handle inline data?

Though gcc and clang won't place any jump tables or constants in .text, there're invariantly some occasions in real-world projects where there exists interleaving data and code in the .text section. I tried to embed data into gaps of instructions using inline assembly. What I got is that SOK misidentifies those inline data bytes (from 0x40055f to 0x4005a7) as instructions. Given the following program attachments compiled by gcc -O0, SOK even throws an error. The root of this problem is because SOK wrongly takes data bytes as instructions.

For your convenience, I post the source code here. Log file and executable file are in attachments.

#include <stdio.h>
#include <stdlib.h>

int func() {
    int filter;
    asm volatile(
        "  leaq _filter(%%rip), %%rax\n\t"
        "  jmp _out\n\t"
        ".global _filter\n"
        ".type _filter,@object\n"
        "_filter:\n\t"
        ".ascii \""
        "\\040\\000\\000\\000\\000\\000\\000\\000"  // 0. BPF_STMT
        "\\025\\000\\000\\005\\015\\000\\000\\000"  // 1. BPF_JUMP
        "\\040\\000\\000\\000\\020\\000\\000\\000"  // 2. BPF_STMT
        "\\025\\000\\004\\000\\005\\000\\000\\000"  // 3. BPF_JUMP
        "\\025\\000\\003\\000\\012\\000\\000\\000"  // 4. BPF_JUMP
        "\\025\\000\\002\\000\\013\\000\\000\\000"  // 5. BPF_JUMP
        "\\025\\000\\001\\000\\004\\000\\000\\000"  // 6. BPF_JUMP
        "\\006\\000\\000\\000\\000\\000\\377\\177"  // 7. BPF_STMT
        "\\006\\000\\000\\000\\000\\000\\005\\000"  // 8. BPF_STME
        "\"\n\t"
        "_out:"
        : "=rax"(filter)
        :
        :);
    return filter;
}
int main() {
    printf("%d", func());
    return 0;
}

But even let the former problem alone, there may be some potential problems when handling with overlapping instructions.

Traceback (most recent call last):
File "./extract_gt/extractBB.py", line 1213, in
dumpGroundTruth(essInfo, module, outFile, options.binary, options.split)
File "./extract_gt/extractBB.py", line 804, in dumpGroundTruth
handleNotIncludedBB(pbModule)
File "./extract_gt/extractBB.py", line 970, in handleNotIncludedBB
addedBB2.size = bb.instructions[0].va + bb.instructions[0].size - overlapping_target
ValueError: Value out of range: -5

No matter what, thanks so much for your amazing work!

@bin2415
Copy link
Collaborator

bin2415 commented Aug 3, 2021

Hi, assembly codes are problems for our tools to collect ground truth, as compilers do not have basic block information for them. There are two categories of assembly codes: 1. assembly file 2. assembly codes in c file. Our solution is wrapping these regions with specific labels, and do recursive disassembly according to the control flows to identify code and data regions in assembly regions.

In this example, below is the assembly result of assembly region:

        .bbInfo_INLINEB
#APP
# 6 "test.c" 1
          leaq _filter(%rip), %rax
          jmp _out
        .global _filter
.type _filter,@object
_filter:
        .ascii "\040\000\000\000\000\000\000\000\025\000\000\005\015\000\000\000\040\000\000\000\020\000\000\000\025\000\004\000\005\000\000\000\025\000\003\000\012\000\000\000\025\000\002\000\013\000\000\000\025\000\001\000\004\000\000\000\006\000\000\000\000\000\377\177\006\000\000\000\000\000\005\000"
        _out:
# 0 "" 2
#NO_APP
        .bbInfo_INLINEE

We use .bbInfo_INLINEB and .bbinfo_INLINE to mark the start and end of the assembly regions. And we try to do recursively disassembling to identify the code and data regions. It seems that there exists bug to handle this region. Thanks for reporting!

@ZhangZhuoSJTU
Copy link

ZhangZhuoSJTU commented Aug 3, 2021

Hi, assembly codes are problems for our tools to collect ground truth, as compilers do not have basic block information for them. There are two categories of assembly codes: 1. assembly file 2. assembly codes in c file. Our solution is wrapping these regions with specific labels, and do recursive disassembly according to the control flows to identify code and data regions in assembly regions.

In this example, below is the assembly result of assembly region:

        .bbInfo_INLINEB
#APP
# 6 "test.c" 1
          leaq _filter(%rip), %rax
          jmp _out
        .global _filter
.type _filter,@object
_filter:
        .ascii "\040\000\000\000\000\000\000\000\025\000\000\005\015\000\000\000\040\000\000\000\020\000\000\000\025\000\004\000\005\000\000\000\025\000\003\000\012\000\000\000\025\000\002\000\013\000\000\000\025\000\001\000\004\000\000\000\006\000\000\000\000\000\377\177\006\000\000\000\000\000\005\000"
        _out:
# 0 "" 2
#NO_APP
        .bbInfo_INLINEE

We use .bbInfo_INLINEB and .bbinfo_INLINE to mark the start and end of the assembly regions. And we try to do recursively disassembling to identify the code and data regions. It seems that there exists bug to handle this region. Thanks for reporting!

Hi @bin2415 , thanks for your prompt reply. I am kind of curious why we need to use recursive disassembly to distinguish the code and data? Based on my understanding, all the data in the assembly code would have some labels like .ascii or .byte. Would it be easier to leverage such labels to identify the data/code regions? Please kindly correct me if I am wrong.

I do agree that we need to use recursively disassembly to get the basic block information, by the way 😆

@bin2415
Copy link
Collaborator

bin2415 commented Aug 3, 2021

Based on my understanding, all the data in the assembly code would have some labels like .ascii or .byte

Hi @ZhangZhuoSJTU, that is a good observation and most cases meet this rule. But there exist some corner cases do not obey this rule as I know.

For example, here(link1, link2) are the examples that .bytes represent specific instruction(s). Similar cases also exist in glibc.

@ZhangZhuoSJTU
Copy link

ZhangZhuoSJTU commented Aug 3, 2021

Based on my understanding, all the data in the assembly code would have some labels like .ascii or .byte

Hi @ZhangZhuoSJTU, that is a good observation and most cases meet this rule. But there exist some corner cases do not obey this rule as I know.

For example, here(link1, link2) are the examples that .bytes represent specific instruction(s). Similar corner cases also exists in glibc.

I see. I guess it means if we follow the rule, we would get a sound result for data identification (i.e., w/o false negative but w/ false positive).

So I am wondering whether we can first follow the rule to get a superset of such inline-assemble data (i.e., the regions following .bytes/.ascii/... and between.bbInfo_INLINEB and .bbinfo_INLINE), and then use the linear disassembly to rule out some possible instructions (i.e., only a valid basic block occupying the whole data region can be regarded as instructions, and maybe more strong heuristics can be used here like only padding or ud2 is accepted).

I prefer linear disassembly rather than recursive disassembly. My observation here is that these specific instruction(s) represented by .bytes should be simple enough and should not contains control flow transfers (otherwise it would be unreasonable to hardcode them as bytes).

@bin2415
Copy link
Collaborator

bin2415 commented Aug 3, 2021

I see. I guess it means if we follow the rule, we would get a sound result for data identification (i.e., w/o false negative but w/ false positive).

I agree with that.

only a valid basic block occupying the whole data region can be regarded as instructions, and maybe more strong heuristics can be used here like only padding or ud2 is accepted

This should work. By the way, rep ret are often written in .byte xxxxxxx in some programs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants