all repos — site @ 003b3e29dc4b4a58518e1e8301380419445e522c

source for my site, found at icyphox.sh

pages/txt/python-for-re-1.txt (view raw)

  1   08 February, 2019
  2
  3Python for Reverse Engineering
  4
  5Building your own disassembly tooling for -- that's right -- fun and profit
  6
  7   While solving complex reversing challenges, we often use established
  8   tools like radare2 or IDA for disassembling and debugging. But there
  9   are times when you need to dig in a little deeper and understand how
 10   things work under the hood.
 11
 12   Rolling your own disassembly scripts can be immensely helpful when it
 13   comes to automating certain processes, and eventually build your own
 14   homebrew reversing toolchain of sorts. At least, that's what I'm
 15   attempting anyway.
 16
 17Setup
 18
 19   As the title suggests, you're going to need a Python 3 interpreter
 20   before anything else. Once you've confirmed beyond reasonable doubt
 21   that you do, in fact, have a Python 3 interpreter installed on your
 22   system, run
 23$ pip install capstone pyelftools
 24
 25   where capstone is the disassembly engine we'll be scripting with and
 26   pyelftools to help parse ELF files.
 27
 28   With that out of the way, let's start with an example of a basic
 29   reversing challenge.
 30/* chall.c */
 31
 32#include <stdio.h>
 33#include <stdlib.h>
 34#include <string.h>
 35
 36int main() {
 37   char *pw = malloc(9);
 38   pw[0] = 'a';
 39   for(int i = 1; i <= 8; i++){
 40       pw[i] = pw[i - 1] + 1;
 41   }
 42   pw[9] = '\0';
 43   char *in = malloc(10);
 44   printf("password: ");
 45   fgets(in, 10, stdin);        // 'abcdefghi'
 46   if(strcmp(in, pw) == 0) {
 47       printf("haha yes!\n");
 48   }
 49   else {
 50       printf("nah dude\n");
 51   }
 52}
 53
 54   Compile it with GCC/Clang:
 55$ gcc chall.c -o chall.elf
 56
 57Scripting
 58
 59   For starters, let's look at the different sections present in the
 60   binary.
 61# sections.py
 62
 63from elftools.elf.elffile import ELFFile
 64
 65with open('./chall.elf', 'rb') as f:
 66    e = ELFFile(f)
 67    for section in e.iter_sections():
 68        print(hex(section['sh_addr']), section.name)
 69
 70   This script iterates through all the sections and also shows us where
 71   it's loaded. This will be pretty useful later. Running it gives us
 72> python sections.py
 730x238 .interp
 740x254 .note.ABI-tag
 750x274 .note.gnu.build-id
 760x298 .gnu.hash
 770x2c0 .dynsym
 780x3e0 .dynstr
 790x484 .gnu.version
 800x4a0 .gnu.version_r
 810x4c0 .rela.dyn
 820x598 .rela.plt
 830x610 .init
 840x630 .plt
 850x690 .plt.got
 860x6a0 .text
 870x8f4 .fini
 880x900 .rodata
 890x924 .eh_frame_hdr
 900x960 .eh_frame
 910x200d98 .init_array
 920x200da0 .fini_array
 930x200da8 .dynamic
 940x200f98 .got
 950x201000 .data
 960x201010 .bss
 970x0 .comment
 980x0 .symtab
 990x0 .strtab
1000x0 .shstrtab
101
102   Most of these aren't relevant to us, but a few sections here are to be
103   noted. The .text section contains the instructions (opcodes) that we're
104   after. The .data section should have strings and constants initialized
105   at compile time. Finally, the .plt which is the Procedure Linkage Table
106   and the .got, the Global Offset Table. If you're unsure about what
107   these mean, read up on the ELF format and its internals.
108
109   Since we know that the .text section has the opcodes, let's disassemble
110   the binary starting at that address.
111# disas1.py
112
113from elftools.elf.elffile import ELFFile
114from capstone import *
115
116with open('./bin.elf', 'rb') as f:
117    elf = ELFFile(f)
118    code = elf.get_section_by_name('.text')
119    ops = code.data()
120    addr = code['sh_addr']
121    md = Cs(CS_ARCH_X86, CS_MODE_64)
122    for i in md.disasm(ops, addr):
123        print(f'0x{i.address:x}:\t{i.mnemonic}\t{i.op_str}')
124
125   The code is fairly straightforward (I think). We should be seeing this,
126   on running
127> python disas1.py | less
1280x6a0: xor ebp, ebp
1290x6a2: mov r9, rdx
1300x6a5: pop rsi
1310x6a6: mov rdx, rsp
1320x6a9: and rsp, 0xfffffffffffffff0
1330x6ad: push rax
1340x6ae: push rsp
1350x6af: lea r8, [rip + 0x23a]
1360x6b6: lea rcx, [rip + 0x1c3]
1370x6bd: lea rdi, [rip + 0xe6]
138**0x6c4: call qword ptr [rip + 0x200916]**
1390x6ca: hlt
140... snip ...
141
142   The line in bold is fairly interesting to us. The address at [rip +
143   0x200916] is equivalent to [0x6ca + 0x200916], which in turn evaluates
144   to 0x200fe0. The first call being made to a function at 0x200fe0? What
145   could this function be?
146
147   For this, we will have to look at relocations. Quoting [1]linuxbase.org
148
149     Relocation is the process of connecting symbolic references with
150     symbolic definitions. For example, when a program calls a function,
151     the associated call instruction must transfer control to the proper
152     destination address at execution. Relocatable files must have
153     "relocation entries'' which are necessary because they contain
154     information that describes how to modify their section contents,
155     thus allowing executable and shared object files to hold the right
156     information for a process's program image.
157
158   To try and find these relocation entries, we write a third script.
159# relocations.py
160
161import sys
162from elftools.elf.elffile import ELFFile
163from elftools.elf.relocation import RelocationSection
164
165with open('./chall.elf', 'rb') as f:
166    e = ELFFile(f)
167    for section in e.iter_sections():
168        if isinstance(section, RelocationSection):
169            print(f'{section.name}:')
170            symbol_table = e.get_section(section['sh_link'])
171            for relocation in section.iter_relocations():
172                symbol = symbol_table.get_symbol(relocation['r_info_sym'])
173                addr = hex(relocation['r_offset'])
174                print(f'{symbol.name} {addr}')
175
176   Let's run through this code real quick. We first loop through the
177   sections, and check if it's of the type RelocationSection. We then
178   iterate through the relocations from the symbol table for each section.
179   Finally, running this gives us
180> python relocations.py
181.rela.dyn:
182 0x200d98
183 0x200da0
184 0x201008
185_ITM_deregisterTMCloneTable 0x200fd8
186**__libc_start_main 0x200fe0**
187__gmon_start__ 0x200fe8
188_ITM_registerTMCloneTable 0x200ff0
189__cxa_finalize 0x200ff8
190stdin 0x201010
191.rela.plt:
192puts 0x200fb0
193printf 0x200fb8
194fgets 0x200fc0
195strcmp 0x200fc8
196malloc 0x200fd0
197
198   Remember the function call at 0x200fe0 from earlier? Yep, so that was a
199   call to the well known __libc_start_main. Again, according to
200   [2]linuxbase.org
201
202     The __libc_start_main() function shall perform any necessary
203     initialization of the execution environment, call the main function
204     with appropriate arguments, and handle the return from main(). If
205     the main() function returns, the return value shall be passed to the
206     exit() function.
207
208   And its definition is like so
209int __libc_start_main(int *(main) (int, char * *, char * *),
210int argc, char * * ubp_av,
211void (*init) (void),
212void (*fini) (void),
213void (*rtld_fini) (void),
214void (* stack_end));
215
216   Looking back at our disassembly
2170x6a0: xor ebp, ebp
2180x6a2: mov r9, rdx
2190x6a5: pop rsi
2200x6a6: mov rdx, rsp
2210x6a9: and rsp, 0xfffffffffffffff0
2220x6ad: push rax
2230x6ae: push rsp
2240x6af: lea r8, [rip + 0x23a]
2250x6b6: lea rcx, [rip + 0x1c3]
226**0x6bd: lea rdi, [rip + 0xe6]**
2270x6c4: call qword ptr [rip + 0x200916]
2280x6ca: hlt
229... snip ...
230
231   but this time, at the lea or Load Effective Address instruction, which
232   loads some address [rip + 0xe6] into the rdi register. [rip + 0xe6]
233   evaluates to 0x7aa which happens to be the address of our main()
234   function! How do I know that? Because __libc_start_main(), after doing
235   whatever it does, eventually jumps to the function at rdi, which is
236   generally the main() function. It looks something like this
237
238   To see the disassembly of main, seek to 0x7aa in the output of the
239   script we'd written earlier (disas1.py).
240
241   From what we discovered earlier, each call instruction points to some
242   function which we can see from the relocation entries. So following
243   each call into their relocations gives us this
244printf 0x650
245fgets  0x660
246strcmp 0x670
247malloc 0x680
248
249   Putting all this together, things start falling into place. Let me
250   highlight the key sections of the disassembly here. It's pretty
251   self-explanatory.
2520x7b2: mov edi, 0xa  ; 10
2530x7b7: call 0x680    ; malloc
254
255   The loop to populate the *pw string
2560x7d0:  mov     eax, dword ptr [rbp - 0x14]
2570x7d3:  cdqe
2580x7d5:  lea     rdx, [rax - 1]
2590x7d9:  mov     rax, qword ptr [rbp - 0x10]
2600x7dd:  add     rax, rdx
2610x7e0:  movzx   eax, byte ptr [rax]
2620x7e3:  lea     ecx, [rax + 1]
2630x7e6:  mov     eax, dword ptr [rbp - 0x14]
2640x7e9:  movsxd  rdx, eax
2650x7ec:  mov     rax, qword ptr [rbp - 0x10]
2660x7f0:  add     rax, rdx
2670x7f3:  mov     edx, ecx
2680x7f5:  mov     byte ptr [rax], dl
2690x7f7:  add     dword ptr [rbp - 0x14], 1
2700x7fb:  cmp     dword ptr [rbp - 0x14], 8
2710x7ff:  jle     0x7d0
272
273   And this looks like our strcmp()
2740x843:  mov     rdx, qword ptr [rbp - 0x10] ; *in
2750x847:  mov     rax, qword ptr [rbp - 8]    ; *pw
2760x84b:  mov     rsi, rdx
2770x84e:  mov     rdi, rax
2780x851:  call    0x670                       ; strcmp
2790x856:  test    eax, eax                    ; is = 0?
2800x858:  jne     0x868                       ; no? jump to 0x868
2810x85a:  lea     rdi, [rip + 0xae]           ; "haha yes!"
2820x861:  call    0x640                       ; puts
2830x866:  jmp     0x874
2840x868:  lea     rdi, [rip + 0xaa]           ; "nah dude"
2850x86f:  call    0x640                       ; puts
286
287   I'm not sure why it uses puts here? I might be missing something;
288   perhaps printf calls puts. I could be wrong. I also confirmed with
289   radare2 that those locations are actually the strings "haha yes!" and
290   "nah dude".
291
292   Update: It's because of compiler optimization. A printf() (in this
293   case) is seen as a bit overkill, and hence gets simplified to a puts().
294
295Conclusion
296
297   Wew, that took quite some time. But we're done. If you're a beginner,
298   you might find this extremely confusing, or probably didn't even
299   understand what was going on. And that's okay. Building an intuition
300   for reading and grokking disassembly comes with practice. I'm no good
301   at it either.
302
303   All the code used in this post is here:
304   [3]https://github.com/icyphox/asdf/tree/master/reversing-elf
305
306   Ciao for now, and I'll see ya in #2 of this series -- PE binaries.
307   Whenever that is.
308
309References
310
311   1. http://refspecs.linuxbase.org/elf/gabi4+/ch4.reloc.html
312   2. http://refspecs.linuxbase.org/LSB_3.1.0/LSB-generic/LSB-generic/baselib%20--%20libc-start-main-.html
313   3. https://github.com/icyphox/asdf/tree/master/reversing-elf