all repos — site @ d6942c6a52297c3461473479976c0edec7f5a5d7

source for my site, found at icyphox.sh

pages/txt/python-for-re-1 (view raw)

  1---
  2date: '2019-02-08'
  3subtitle: 'Building your own disassembly tooling for --- that''s
  4  right --- fun and profit'
  5template: text.html
  6title: Python for Reverse Engineering
  7url: 'python-for-re-1'
  8---
  9
 10While solving complex reversing challenges, we often use established
 11tools like radare2 or IDA for disassembling and debugging. But there are
 12times when you need to dig in a little deeper and understand how things
 13work under the hood.
 14
 15Rolling your own disassembly scripts can be immensely helpful when it
 16comes to automating certain processes, and eventually build your own
 17homebrew reversing toolchain of sorts. At least, that's what I'm
 18attempting anyway.
 19
 20Setup
 21-----
 22
 23As the title suggests, you're going to need a Python 3 interpreter
 24before anything else. Once you've confirmed beyond reasonable doubt that
 25you do, in fact, have a Python 3 interpreter installed on your system,
 26run
 27
 28``` {.console}
 29$ pip install capstone pyelftools
 30```
 31
 32where `capstone` is the disassembly engine we'll be scripting with and
 33`pyelftools` to help parse ELF files.
 34
 35With that out of the way, let's start with an example of a basic
 36reversing challenge.
 37
 38``` {.c}
 39/* chall.c */
 40
 41#include <stdio.h>
 42#include <stdlib.h>
 43#include <string.h>
 44
 45int main() {
 46   char *pw = malloc(9);
 47   pw[0] = 'a';
 48   for(int i = 1; i <= 8; i++){
 49       pw[i] = pw[i - 1] + 1;
 50   }
 51   pw[9] = '\0';
 52   char *in = malloc(10);
 53   printf("password: ");
 54   fgets(in, 10, stdin);        // 'abcdefghi'
 55   if(strcmp(in, pw) == 0) {
 56       printf("haha yes!\n");
 57   }
 58   else {
 59       printf("nah dude\n");
 60   }
 61}
 62```
 63
 64Compile it with GCC/Clang:
 65
 66``` {.console}
 67$ gcc chall.c -o chall.elf
 68```
 69
 70Scripting
 71---------
 72
 73For starters, let's look at the different sections present in the
 74binary.
 75
 76``` {.python}
 77# sections.py
 78
 79from elftools.elf.elffile import ELFFile
 80
 81with open('./chall.elf', 'rb') as f:
 82    e = ELFFile(f)
 83    for section in e.iter_sections():
 84        print(hex(section['sh_addr']), section.name)
 85```
 86
 87This script iterates through all the sections and also shows us where
 88it's loaded. This will be pretty useful later. Running it gives us
 89
 90``` {.console}
 91› python sections.py
 920x238 .interp
 930x254 .note.ABI-tag
 940x274 .note.gnu.build-id
 950x298 .gnu.hash
 960x2c0 .dynsym
 970x3e0 .dynstr
 980x484 .gnu.version
 990x4a0 .gnu.version_r
1000x4c0 .rela.dyn
1010x598 .rela.plt
1020x610 .init
1030x630 .plt
1040x690 .plt.got
1050x6a0 .text
1060x8f4 .fini
1070x900 .rodata
1080x924 .eh_frame_hdr
1090x960 .eh_frame
1100x200d98 .init_array
1110x200da0 .fini_array
1120x200da8 .dynamic
1130x200f98 .got
1140x201000 .data
1150x201010 .bss
1160x0 .comment
1170x0 .symtab
1180x0 .strtab
1190x0 .shstrtab
120```
121
122Most of these aren't relevant to us, but a few sections here are to be
123noted. The `.text` section contains the instructions (opcodes) that
124we're after. The `.data` section should have strings and constants
125initialized at compile time. Finally, the `.plt` which is the Procedure
126Linkage Table and the `.got`, the Global Offset Table. If you're unsure
127about what these mean, read up on the ELF format and its internals.
128
129Since we know that the `.text` section has the opcodes, let's
130disassemble the binary starting at that address.
131
132``` {.python}
133# disas1.py
134
135from elftools.elf.elffile import ELFFile
136from capstone import *
137
138with open('./bin.elf', 'rb') as f:
139    elf = ELFFile(f)
140    code = elf.get_section_by_name('.text')
141    ops = code.data()
142    addr = code['sh_addr']
143    md = Cs(CS_ARCH_X86, CS_MODE_64)
144    for i in md.disasm(ops, addr):        
145        print(f'0x{i.address:x}:\t{i.mnemonic}\t{i.op_str}')
146```
147
148The code is fairly straightforward (I think). We should be seeing this,
149on running
150
151``` {.console}
152› python disas1.py | less      
1530x6a0: xor ebp, ebp
1540x6a2: mov r9, rdx
1550x6a5: pop rsi
1560x6a6: mov rdx, rsp
1570x6a9: and rsp, 0xfffffffffffffff0
1580x6ad: push rax
1590x6ae: push rsp
1600x6af: lea r8, [rip + 0x23a]
1610x6b6: lea rcx, [rip + 0x1c3]
1620x6bd: lea rdi, [rip + 0xe6]
163**0x6c4: call qword ptr [rip + 0x200916]**
1640x6ca: hlt
165... snip ...
166```
167
168The line in bold is fairly interesting to us. The address at
169`[rip + 0x200916]` is equivalent to `[0x6ca + 0x200916]`, which in turn
170evaluates to `0x200fe0`. The first `call` being made to a function at
171`0x200fe0`? What could this function be?
172
173For this, we will have to look at **relocations**. Quoting
174[linuxbase.org](http://refspecs.linuxbase.org/elf/gabi4+/ch4.reloc.html)
175\> Relocation is the process of connecting symbolic references with
176symbolic definitions. For example, when a program calls a function, the
177associated call instruction must transfer control to the proper
178destination address at execution. Relocatable files must have
179"relocation entries'' which are necessary because they contain
180information that describes how to modify their section contents, thus
181allowing executable and shared object files to hold the right
182information for a process's program image.
183
184To try and find these relocation entries, we write a third script.
185
186``` {.python}
187# relocations.py
188
189import sys
190from elftools.elf.elffile import ELFFile
191from elftools.elf.relocation import RelocationSection
192
193with open('./chall.elf', 'rb') as f:
194    e = ELFFile(f)
195    for section in e.iter_sections():
196        if isinstance(section, RelocationSection):
197            print(f'{section.name}:')
198            symbol_table = e.get_section(section['sh_link'])
199            for relocation in section.iter_relocations():
200                symbol = symbol_table.get_symbol(relocation['r_info_sym'])
201                addr = hex(relocation['r_offset'])
202                print(f'{symbol.name} {addr}')
203```
204
205Let's run through this code real quick. We first loop through the
206sections, and check if it's of the type `RelocationSection`. We then
207iterate through the relocations from the symbol table for each section.
208Finally, running this gives us
209
210``` {.console}
211› python relocations.py
212.rela.dyn:
213 0x200d98
214 0x200da0
215 0x201008
216_ITM_deregisterTMCloneTable 0x200fd8
217**__libc_start_main 0x200fe0**
218__gmon_start__ 0x200fe8
219_ITM_registerTMCloneTable 0x200ff0
220__cxa_finalize 0x200ff8
221stdin 0x201010
222.rela.plt:
223puts 0x200fb0
224printf 0x200fb8
225fgets 0x200fc0
226strcmp 0x200fc8
227malloc 0x200fd0
228```
229
230Remember the function call at `0x200fe0` from earlier? Yep, so that was
231a call to the well known `__libc_start_main`. Again, according to
232[linuxbase.org](http://refspecs.linuxbase.org/LSB_3.1.0/LSB-generic/LSB-generic/baselib---libc-start-main-.html)
233\> The `__libc_start_main()` function shall perform any necessary
234initialization of the execution environment, call the *main* function
235with appropriate arguments, and handle the return from `main()`. If the
236`main()` function returns, the return value shall be passed to the
237`exit()` function.
238
239And its definition is like so
240
241``` {.c}
242int __libc_start_main(int *(main) (int, char * *, char * *), 
243int argc, char * * ubp_av, 
244void (*init) (void), 
245void (*fini) (void), 
246void (*rtld_fini) (void), 
247void (* stack_end));
248```
249
250Looking back at our disassembly
251
252    0x6a0: xor ebp, ebp
253    0x6a2: mov r9, rdx
254    0x6a5: pop rsi
255    0x6a6: mov rdx, rsp
256    0x6a9: and rsp, 0xfffffffffffffff0
257    0x6ad: push rax
258    0x6ae: push rsp
259    0x6af: lea r8, [rip + 0x23a]
260    0x6b6: lea rcx, [rip + 0x1c3]
261    **0x6bd: lea rdi, [rip + 0xe6]**
262    0x6c4: call qword ptr [rip + 0x200916]
263    0x6ca: hlt
264    ... snip ...
265
266but this time, at the `lea` or Load Effective Address instruction, which
267loads some address `[rip + 0xe6]` into the `rdi` register.
268`[rip + 0xe6]` evaluates to `0x7aa` which happens to be the address of
269our `main()` function! How do I know that? Because
270`__libc_start_main()`, after doing whatever it does, eventually jumps to
271the function at `rdi`, which is generally the `main()` function. It
272looks something like this
273
274![](https://cdn-images-1.medium.com/max/800/0*oQA2MwHjhzosF8ZH.png)
275
276To see the disassembly of `main`, seek to `0x7aa` in the output of the
277script we'd written earlier (`disas1.py`).
278
279From what we discovered earlier, each `call` instruction points to some
280function which we can see from the relocation entries. So following each
281`call` into their relocations gives us this
282
283    printf 0x650
284    fgets  0x660
285    strcmp 0x670
286    malloc 0x680
287
288Putting all this together, things start falling into place. Let me
289highlight the key sections of the disassembly here. It's pretty
290self-explanatory.
291
292    0x7b2: mov edi, 0xa  ; 10
293    0x7b7: call 0x680    ; malloc
294
295The loop to populate the `*pw` string
296
297    0x7d0:  mov     eax, dword ptr [rbp - 0x14]
298    0x7d3:  cdqe    
299    0x7d5:  lea     rdx, [rax - 1]
300    0x7d9:  mov     rax, qword ptr [rbp - 0x10]
301    0x7dd:  add     rax, rdx
302    0x7e0:  movzx   eax, byte ptr [rax]
303    0x7e3:  lea     ecx, [rax + 1]
304    0x7e6:  mov     eax, dword ptr [rbp - 0x14]
305    0x7e9:  movsxd  rdx, eax
306    0x7ec:  mov     rax, qword ptr [rbp - 0x10]
307    0x7f0:  add     rax, rdx
308    0x7f3:  mov     edx, ecx
309    0x7f5:  mov     byte ptr [rax], dl
310    0x7f7:  add     dword ptr [rbp - 0x14], 1
311    0x7fb:  cmp     dword ptr [rbp - 0x14], 8
312    0x7ff:  jle     0x7d0
313
314And this looks like our `strcmp()`
315
316    0x843:  mov     rdx, qword ptr [rbp - 0x10] ; *in
317    0x847:  mov     rax, qword ptr [rbp - 8]    ; *pw
318    0x84b:  mov     rsi, rdx             
319    0x84e:  mov     rdi, rax
320    0x851:  call    0x670                       ; strcmp  
321    0x856:  test    eax, eax                    ; is = 0? 
322    0x858:  jne     0x868                       ; no? jump to 0x868
323    0x85a:  lea     rdi, [rip + 0xae]           ; "haha yes!" 
324    0x861:  call    0x640                       ; puts
325    0x866:  jmp     0x874
326    0x868:  lea     rdi, [rip + 0xaa]           ; "nah dude"
327    0x86f:  call    0x640                       ; puts  
328
329I'm not sure why it uses `puts` here? I might be missing something;
330perhaps `printf` calls `puts`. I could be wrong. I also confirmed with
331radare2 that those locations are actually the strings "haha yes!" and
332"nah dude".
333
334**Update**: It's because of compiler optimization. A `printf()` (in this
335case) is seen as a bit overkill, and hence gets simplified to a
336`puts()`.
337
338Conclusion
339----------
340
341Wew, that took quite some time. But we're done. If you're a beginner,
342you might find this extremely confusing, or probably didn't even
343understand what was going on. And that's okay. Building an intuition for
344reading and grokking disassembly comes with practice. I'm no good at it
345either.
346
347All the code used in this post is here:
348<https://github.com/icyphox/asdf/tree/master/reversing-elf>
349
350Ciao for now, and I'll see ya in \#2 of this series---PE binaries.
351Whenever that is.