Duck a Fourth Way: The File and Text I/O Benchmark Written in Rust

Michael Sydney Balloni

4.11/5 (4 votes)

Jan 5, 2023

CPOL

6 min read

15105

See how Rust stacks up against C-ish, C++, and C#

Download source code - 100.9 KB

Introduction

This article describes a Rust program and compares its performance with similar programs written in C#, C, and C++.

I'm embarrassed to admit this, but this is my first Rust program. I've been making way through the O'Reilly book, Programming Rust, 2^nd Edition. Good book. Quite long. Once you get past the initial "what in the...!?!?" of Rust, it comes down to "I know how to do this in C... what are the calls in Rust?" So I just went for it. I'm sure I have some major blindspots, I think we're all still learning Rust...

I know how to write this file and text I/O benchmark: here's the article. To sum up, you've got 90 MB of Unicode CSV in a file, read it into objects (all 100K of them), then write the objects back out to a Unicode CSV file.

Let's see some Rust!

But wait, let's get those performance numbers! The Rust program on my system takes 117 ms to do the read, and 195 ms to do the write. So it's right there with C-like on the read where that takes only 107 ms, and not too great on the write where C-like took 147 ms, and C++ took just 136 ms. Read on for why the Rust write code might be slow.

Rust Program Source

All the source is in one file, main.rs.

Source Header

The script starts with usual namespace helpers:

use stopwatch::Stopwatch;

use std::fs::File;

use std::io::Read;
use std::io::Write;

Stopwatch is a 3^rd party class, enough like .NET's Stopwatch to fit the bill here.

"use std::fs::File" makes it possible to use the File class. You could say "use std::fs::*" to bring in the whole namespace.

The std::io::Read and std::io::Write aren't actually classes, but you need to use them if you want to do any Reading or Writing.

The Data Record

This is the type of the object we'll be loading from CSV and writing back out to CSV.

// Our demographics data record
struct LineData<'a> {
    first: &'a str,
    last: &'a str,
    address1: &'a str,
    address2: &'a str,
    city: &'a str,
    state: &'a str,
    zip: &'a str
}

The 'a stuff is a notation that the contents of the struct are meant to live as long as the struct itself. This is the lifetime business, some of the "what in the...!?!?" of Rust. Enough said, code that works with LineData objects can't outlive each other, they must have the same lifetime. This prevents read-after-free bugs.

The & str is not a String in .NET or std::wstring in C++, it's a reference to a character buffer, essentially a well-groomed const wchar_t*. That's what really makes this program fly. Imagine a struct in C with raw character pointers out to who knows where. That'd be optimal, but scary as hell, right? Well, Rust has pulled it off, it's the real deal for just this sort of thing.

Unicode Bytes To String

We know from our previous attempts at this benchmark that reading the entire file into memory is a good start. Once we have all those bytes, in this attempt, we want to turn that into a String object for later processing.

// Turn a buffer of Unicode bytes into a String
fn utf16le_buffer_to_string(buffer: &[u8]) -> String {
	let char_len = buffer.len() / 2;
	let mut output = String::with_capacity(char_len);
	for idx in 0..char_len {
		output.push(char::from_u32(u16::from_le_bytes
        ([buffer[idx * 2], buffer[idx * 2 + 1]]) as u32).unwrap());
	}
	output
}

The buffer input parameter is a &[u8], which is a reference to an array of u8s, bytes. Pre-allocating the String is probably a good idea. Then we just loop over the range of indexes from 0 to char_len - 1, doing some Unicode fun. The unadorned "output" after the loop is the return value, weird, I know.

main() Begins

What little benchmark app is complete without a simple straight-through main() function?

fn main() {
	// Deal with inputs
    let args: Vec<String> = std::env::args().collect();
    if args.len() != 3 {
        println!("Usage: {} <input file> <output file>", args[0]);
        std::process::exit(0);
    }
    let input_file_path = &args[1];
    let output_file_path = &args[2];
    println!("{} -> {}", input_file_path, output_file_path);

We turn the raw array of str std::env::args() into an easier Vec<String> (std::vector<std::wstring>) to work with. The &args[x] allow the input_file_path / output_file_path variables to refer to the arguments without modifying anything. There's a lot of &s in Rust, it's pretty scary at first, like pointer addresses everywhere, and there are, but it's safe. That's Rust's big gamble, that you'll spackle &s and other incantations like "mut" and then you'll trust the compiler with memory correctness, and everything will be okay. The println! is like printf, with the {} placeholders like %'s, less type specifiers.

Stopwatch Timing

	// Timing is fun...look familiar?
	let mut sw = Stopwatch::start_new();
	let mut cur_ms : i64;
	let mut total_read_ms : i64 = 0;

The "mut" business says you want a read-write reference to the object. No "mut", no modifications are possible, kind of a const.

Input File I/O: File -> Buffer

	// Read the input file into a buffer
	sw.restart();
	let mut buffer = Vec::new();
	File::open(input_file_path).unwrap().read_to_end(&mut buffer).unwrap();
	cur_ms = sw.elapsed_ms();
	total_read_ms += cur_ms;
	println!("buffer: {} - {} ms", buffer.len(), cur_ms);

We create a new vector, we don't have to say the element type, the compiler figures it out. In one line, we read the file into the vector of... bytes, it's gotta be bytes. The &mut means that we're passing (they call it borrowing) a read-write reference to read_to_end() so it can modify our vector.

Input Text I/O: Buffer > String

	// Read the buffer into a string
	let str_val = utf16le_buffer_to_string(&buffer);

Here, we use the function we defined above for this purpose, passing in a reference to our byte vector.

Object Input I/O: String > Objects

	let mut objects = Vec::new();
	let mut parts: [&str; 7] = ["", "", "", "", "", "", ""];
	let field_len = parts.len();
	let mut idx: usize;
	for line in str_val.lines() { // walk the lines
		idx = 0;
		for part in line.split(',') { // walk the comma-delimited parts
			assert!(idx < field_len);
			parts[idx] = part;
			idx = idx + 1;
		}
		if idx == 0 { // skip blank lines
			continue;
		}
		assert_eq!(idx, parts.len());
		objects.push
		(
			LineData {
				first: parts[0],
				last: parts[1],
				address1: parts[2],
				address2: parts[3],
				city: parts[4],
				state: parts[5],
				zip: parts[6]
			}
		);
	}

I had to optimize this code a bit to make it fly. The Vec objects is where we collect our records. The array parts holds seven str references, one for each field in our record type, a little array of const wchar_t*s. In the loop, we walk lines and collect strings and put them into the objects. Picture character pointers making their way out of lines() and split() calls, into the parts array, then into our records. No string copying at all, just shuffling text in and out of data structures. Amazing!

All Done With Reading

Here is the profile of the reading:

buffer: 90316528 - 29 ms
str_val: 45158266 - 63 ms
objects: 100000 - 25 ms
total read: 117 ms

Smokin'!

Object Output I/O: Objects -> String

		// Compute a big string containing all the records
		let mut big_str: String = String::with_capacity(str_val.len());
		for obj in objects {
			big_str += obj.first;
			big_str += ",";
			big_str += obj.last;
			big_str += ",";
			big_str += obj.address1;
			big_str += ",";
			big_str += obj.address2;
			big_str += ",";
			big_str += obj.city;
			big_str += ",";
			big_str += obj.state;
			big_str += ",";
			big_str += obj.zip;
			big_str += "\n";
		}

This matches the relatively fast C++ benchmark application's output code.

String Output I/O: String -> Buffer

		// Turn the big string into a vector of Unicode 16-bit values
		let big_char_buffer = big_str.encode_utf16(); // this takes no time at all
		
		// Turn the vector of 16-bits values into a vector of bytes
		let mut big_output_buffer = Vec::<u8>::with_capacity(big_str.len() * 2);
		for c in big_char_buffer {
			big_output_buffer.push(c as u8);
			big_output_buffer.push((c >> 8) as u8);
		}</u8>

Writing loops in 2023 seems so 2000s; I could not find anything off the shelf. There should be a way, though.

File Output I/O: Buffer -> File

File::create(output_file_path).unwrap().write_all(&big_output_buffer);

Another fun file I/O one-liner.

Dissecting (Bad) Output Performance

That fun Stopwatch code yields these tea leaves about where time is spent in the output side of things:

big_str: 12 ms
big_char_buffer: 0 ms
big_output_buffer: 60 ms
output_file: 123 ms
total write: 195 m

big_str seems fine. big_char_buffer is surprisingly thrifty. But big_output_buffer, where we turn all that u16 goodness into u8s, that costs a lot. It's actually about identical to the cost of reading the buffer into the string on the read side. In C/C++, you can take a wchar_t* and just say it's a uint8_t* and then, presto! it's bytes! Rust is not a fan of that sort of cavalier memory voodoo. And in Rust, strings are stored in memory as UTF-8 so you can't just make it so like in C/C++. UTF-8 may seem like the lingua franca of the internet, but half the world does better with 16 or 32 bit encodings, so I don't understand that language decision, it's colorblind and inefficient. Hmmm.

Conclusion and Points of Interest

I hope you have picked up enough Rust exposure to be interested in learning a lot more about it. The performance benchmark comparison showed Rust flexing its muscles reading data, but seeming a bit tired when it comes to writing data. This benchmark brought out the good and the bad; on balance I think it's good enough to use Rust for my future projects.

I'm interested in your first blush impression of Rust code and your thoughts on how it went with programming and measuring the benchmark and interpreting the results.

History

5^th January, 2023: Initial version