Hello!
Version
stable-x86_64-unknown-linux-gnu (default)
rustc 1.66.0 (69f9c33d7 2022-12-12)
Problem
I came across an infinite loop bug when I'm trying to unzipping an ASCII file whose size is larger than 4GB.
The following code is an example of my usage:
pub fn from_zip_file(file_path: String) -> Trace {
let zip_file =
fs::File::open(&file_path).expect(&format!("Could not open file {}", &file_path));
let mut zip_archive = zip::ZipArchive::new(zip_file)
.expect(&format!("Could not open archive {}", &file_path));
println!("Trace::from_zip_file: unzipping {}", &file_path);
let mut trace_file = zip_archive.by_index(0).unwrap();
let trace_file_path = trace_file.sanitized_name().to_str().unwrap().to_string();
println!("Trace::from_zip_file: unzipped {}", &file_path);
let mut trace_content = String::new();
trace_file
.read_to_string(&mut trace_content)
.expect(&format!("Could not read unzipped file {}", trace_file_path));
println!("Trace::from_zip_file: read_to_string {}", &file_path);
}
My program stuck at trace_file.read_to_string()
.
In order to investigate the problem, I analyzed the code carefully and find root cause in the bzip2 rust library.
Root Cause
The function read_to_string()
indirectly invokes std::io::default_read_to_end()
. The following is source of std::io::default_read_to_end()
:
// This uses an adaptive system to extend the vector when it fills. We want to
// avoid paying to allocate and zero a huge chunk of memory if the reader only
// has 4 bytes while still making large reads if the reader does have a ton
// of data to return. Simply tacking on an extra DEFAULT_BUF_SIZE space every
// time is 4,500 times (!) slower than a default reservation size of 32 if the
// reader has a very small amount of data to return.
pub(crate) fn default_read_to_end<R: Read + ?Sized>(r: &mut R, buf: &mut Vec<u8>) -> Result<usize> {
let start_len = buf.len();
let start_cap = buf.capacity();
let mut initialized = 0; // Extra initialized bytes from previous loop iteration
loop {
if buf.len() == buf.capacity() {
buf.reserve(32); // buf is full, need more space
}
let mut read_buf: BorrowedBuf<'_> = buf.spare_capacity_mut().into();
// SAFETY: These bytes were initialized but not filled in the previous loop
unsafe {
read_buf.set_init(initialized);
}
let mut cursor = read_buf.unfilled();
match r.read_buf(cursor.reborrow()) {
Ok(()) => {}
Err(e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => return Err(e),
}
if cursor.written() == 0 {
return Ok(buf.len() - start_len);
}
// store how much was initialized but not filled
initialized = cursor.init_ref().len();
// SAFETY: BorrowedBuf's invariants mean this much memory is initialized.
unsafe {
let new_len = read_buf.filled().len() + buf.len();
buf.set_len(new_len);
}
if buf.len() == buf.capacity() && buf.capacity() == start_cap {
// The buffer might be an exact fit. Let's read into a probe buffer
// and see if it returns `Ok(0)`. If so, we've avoided an
// unnecessary doubling of the capacity. But if not, append the
// probe buffer to the primary buffer and let its capacity grow.
let mut probe = [0u8; 32];
loop {
match r.read(&mut probe) {
Ok(0) => return Ok(buf.len() - start_len),
Ok(n) => {
buf.extend_from_slice(&probe[..n]);
break;
}
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => return Err(e),
}
}
}
}
}
The stdlib function std::io::default_read_to_end()
double the buffer's capacity every time when it's full. As the buffer's initial capacity is 32, the capacity of the Vec
buf will always be a power of two.
As my ASCII file to be unzipped is a little larger than 4GB (0x1_0000_0000h, 4294967296). The capacity of buffer will be extended to 8GB (0x2_0000_0000h, 8589934592).
In default_read_to_end()
, the read_buf
as BorrowedBuf
is borrowed from the original buffer, indicating the start of spare space in buf
.
There exists a time when read_buf.len()
is exactly 0x1_0000_0000 and buf.len()
is 0x2_0000_0000, which means first half part of buf
is all unzipped data and last half part is available spare space for filling unzipped data.
The function call r.read_buf(cursor.reborrow())
indirectly calls Decompress::decompress()
, whose source is listed below:
/// Decompress a block of input into a block of output.
pub fn decompress(&mut self, input: &[u8], output: &mut [u8]) -> Result<Status, Error> {
self.inner.raw.next_in = input.as_ptr() as *mut _;
self.inner.raw.avail_in = input.len() as c_uint;
self.inner.raw.next_out = output.as_mut_ptr() as *mut _;
self.inner.raw.avail_out = output.len() as c_uint;
unsafe {
match ffi::BZ2_bzDecompress(&mut *self.inner.raw) {
ffi::BZ_OK => Ok(Status::Ok),
ffi::BZ_MEM_ERROR => Ok(Status::MemNeeded),
ffi::BZ_STREAM_END => Ok(Status::StreamEnd),
ffi::BZ_PARAM_ERROR => Err(Error::Param),
ffi::BZ_DATA_ERROR => Err(Error::Data),
ffi::BZ_DATA_ERROR_MAGIC => Err(Error::DataMagic),
ffi::BZ_SEQUENCE_ERROR => Err(Error::Sequence),
c => panic!("wut: {}", c),
}
}
}
At this point, the length of output
is 0x1_0000_0000
.
When casting to c_uint
(uint32), output.len()
's high 32 bits are lost, so avail_out
is zero when invoking C unzipping function.
In that case, no data will be extracted and function directly returns. However, parent functions keeps asking for unzipping the left datas. Thus endless loop occurred.
Patch
The following is my patch. self.inner.raw.avail_out
's assign logic is modified. When avail_out
is exceeding the available range of c_uint
, it is set to the max value of c_uint
.
Other function may contain the same problem.
/// Decompress a block of input into a block of output.
pub fn decompress(&mut self, input: &[u8], output: &mut [u8]) -> Result<Status, Error> {
self.inner.raw.next_in = input.as_ptr() as *mut _;
self.inner.raw.avail_in = input.len() as c_uint;
self.inner.raw.next_out = output.as_mut_ptr() as *mut _;
self.inner.raw.avail_out = {
let avail_out = output.len();
if (avail_out > 0) && (avail_out & c_uint::MAX as usize == 0) {
c_uint::MAX
} else {
avail_out as c_uint
}
};
unsafe {
match ffi::BZ2_bzDecompress(&mut *self.inner.raw) {
ffi::BZ_OK => Ok(Status::Ok),
ffi::BZ_MEM_ERROR => Ok(Status::MemNeeded),
ffi::BZ_STREAM_END => Ok(Status::StreamEnd),
ffi::BZ_PARAM_ERROR => Err(Error::Param),
ffi::BZ_DATA_ERROR => Err(Error::Data),
ffi::BZ_DATA_ERROR_MAGIC => Err(Error::DataMagic),
ffi::BZ_SEQUENCE_ERROR => Err(Error::Sequence),
c => panic!("wut: {}", c),
}
}
}