Reading parquet from scratch

Parquet has been a part of my data toolkit for the past 8 years, but despite using it extensively, I realized that I didn’t really understand the inner workings of the format.

To fill this gap, I decided to learn about the intricacies of the parquet format and build my own reader from scratch. Let’s dive into the format and build a simple parquet reader, step by step.

I’m going to assume that you have some familiarity with parquet, If parquet is new to you heres an overview of the motivation behind the format.

A little help from another format

The Parquet format leverages the Thrift format to store metadata. Thrift enables defining a schema for data, and automatically generates efficient serializer/deserializers in multiple programming languages.

Lucky for us someone has already gone through the trouble of taking the official Apache parquet metadata specification and generating a Thrift client for rust called parquet-format-safe. We’ll use this client to deserialize metadata from a Parquet file.

targets

To start reading data from a Parquet file, we need some information about the data like the location of columns, their types and how many values are in the file. We call this information metadata and contrary to what you might expect, Parquet places this incredibly valuable information in a footer at the end of the file.

In fact a lot of file formats place metadata at the end of the file. It allows writers to collect information as they go, and write it all in one pass at the end.

In addition to column information the footer also has two 4 byte regions one for the footer length, followed by 4 bytes for the Magic number. The Magic number validates that the file is a parquet file and the footer length tells us how many bytes we need to read for the footer.

Reading the footer presents us with a bit of a chicken and egg problem. We need to know how big the footer is in order to read it, but the footer length is stored …. in the footer! It would be wasteful to read the 4 bytes of footer length and then do a second read for the footer because I/O operations usually read a page of bytes even if you’re only interested in a small portion.

To optimize performance and reduce the number of I/O operations, we’re taking an optimistic approach! We’ll start by reading a chunk of bytes that typically includes the footer, then hold our breath as we check the footer length. If it’s shorter than the initially read block, we’ve already captured the footer data! But if the length is longer, we’ll have to go back for more and perform a second read. Fingers crossed for that first try!

// Our optimistic starting point
const STARTING_READ_SIZE: usize = 1024;

//The magic number of a parquet file should be "PAR1"
const MAGIC_NUMBER : [u8; 4] = [b'P', b'A', b'R', b'1'];

let mut reader = File::open(file_path)?;
let mut buffer = Vec::with_capacity(STARTING_READ_SIZE);

// Read the last `STARTING_READ_SIZE` bytes of the file
reader.seek(SeekFrom::End(-1 * (STARTING_READ_SIZE as i64)))?;
reader.read_exact(&mut buffer)?;

// Check if the last four bytes of the buffer match the magic number
// This validates that we are working with a parquet file.
if buffer[buffer.len()-4..] != MAGIC_NUMBER {
    return Err(io::Error::new(io::ErrorKind::InvalidData, 
        "Invalid Parquet File"))
}

// Grab the footer length, its in the second to last 4 bytes in the buffer.
let start_offset = buffer.len() - 8;
let end_offset = buffer.len() - 4;
let file_metadata_size = u32::from_le_bytes(
    buffer[start_offset..end_offset].try_into()?
);

// Seek to the position of the metadata in the file
reader.seek(SeekFrom::End(-8 -(file_metadata_size as i64)))?;

let metadata: &[u8];

if file_metadata_size as usize > buffer.capacity() {
    // if our optimistic guess was off 
    // lets reserver more capacity and re-read the metadata
    buffer.try_reserve((file_metadata_size as usize) - STARTING_READ_SIZE)?;
    buffer.resize(file_metadata_size as usize, 0);
    reader.read_exact(&mut buffer)?;
    metadata = &buffer;
} else {
    // Our guess encompasses the footer!, 
    // we just need to select the slice of bytes that contains the footer.
    let remaining = buffer.len() - file_metadata_size as usize;
    metadata = &buffer[remaining..]
}

Now that we have a buffer containing the bytes of the footer we can use the thrift client to deserialize it into a FileMetadata struct

//It's ok, i don't understand thrift either
let mut protocol = TCompactInputProtocol::new(metadata, max_size);
let file_metadata = FileMetaData::read_from_in_protocol(&mut protocol)
                    .unwrap();

targets

Parquet organizes data in a hierarchical manner. The largest unit of the organization is called a Row Group. A Row Group stores a batch of rows from each column in the file in a columnar layout and values of a column are stored together in units called Column Chunks.

Column Chunks are made up of one or more Pages, each page is preceded by a Page Header that contains important information such as compression, encoding and the location of data.

To access the data from the first Row Group, we need to navigate through the hierarchical structure of the Parquet file and locate the first Column Chunk. From there, we can access the Page Header and retrieve the necessary information to start reading data.

Here are the steps we will be taking

Row Group -> Column Chunk -> (Page Header + Data Pages)

//Select the first row group
let row_group = file_metadata.row_groups.get(0);

//Select the first column chunk in the group
let column = match row_group {
    Some(rg) => rg.columns.get(0),
    None => None,
};

//get the metadata of the column chunk
let meta_data = match column {
    Some(col) => col.meta_data.as_ref(),
    None => None,
};

//Read the offset of the first page of data
let page_offset = match meta_data {
    Some(meta) => Ok(meta.data_page_offset as u64),
    None => Err(Error::new(ErrorKind::InvalidData, "Page offset not found")),
};

The page_offset tells us the location of the Page Header within the file. We can now seek to the page_offset and read the Page Header at that location.

reader.seek(SeekFrom::Start(page_offset))?;
// More thrift shenanigans
let mut protocol = TCompactInputProtocol::new(reader, max_size);
let page_header = PageHeader::read_from_in_protocol(&mut protocol)
                  .unwrap();

Read a Column from the Row Group

We’re at the final stretch of our journey, The Page Header contains all the information we need in order to read data from the Column Chunk. Column data is stored right after the Page Header in a compressed form.

The Page Header tells us how many compressed bytes there are and what compression format they are in. Let’s use this information to read the data.

We’ll start by reading the compressed data into a buffer.

let compressed_size = page_header.compressed_page_size as usize;
let mut compressed_buffer = Vec::with_capacity(compressed_size);

// Read the compressed bytes
let bytes_read = match reader.take(compressed_size as u64)
.read_to_end(&mut compressed_buffer) {
    Ok(bytes) => bytes,
    Err(err) => return Err(Box::new(err)),        
};

//Validate that bytes_read matches the expected compressed bytes.
if bytes_read != compressed_size {
    return Err(
        Box::new(std::io::Error::new(std::io::ErrorKind::Other, 
        "Unable to read all compressed bytes in Page"))
    );
}

Next we’ll decompress the column data. For simplicity I’ve chosen the Zstd compression library because I know my parquet file uses Zstd compression.

let uncompressed_size = page_header.uncompressed_page_size as usize;
let mut decompressed_buffer = vec![0u8; uncompressed_size];

// Use the Zstd library to decompress the bytes we read
let mut decoder = zstd::Decoder::new(&*compressed_buffer)?;
decoder.read_exact(&mut decompressed_buffer)?;

We now have a decompressed page of column data. The first two sections of this page are the definition and repetition levels. Since we’re reading a flat column we won’t be utilizing this information so we can skip these two sections.

The remaining part of the page contains the encoded column data. In this example, I’m reading plain integer data stored in a 64-bit little endian format.

let num_values = page_header.data_page_header.unwrap().num_values as usize;

//Calculate the offset to skip over the defintition and repetition levels
let values_start = uncompressed_size - (num_values * 8);
let values_buffer = &output_buffer[values_start..];

//Loop through 8 bytes at a time and decode them from little endian.
for chunk in buffer.chunks_exact(8) {
    let value = u64::from_le_bytes(chunk.try_into()?);
    println!("{:?}\n", value);
}

And there we go ! We’ve read a page of column data and printed it out!

Further reading

In order to keep these examples simple i’ve skipped over some details. The Parquet format is a rich and complex beast, here are some topics that you might like to follow up on:

The full code for this toy reader lives here https://github.com/damnMeddlingKid/parquet-reader-rs