An intro to nom

parsing made easy for Rustaceans

Rust Dublin Meetup 2024-03-06

Roberto Gambuzzi - Luciano Mammino

loige.link/nom-rs

🚀 Principal Software Engineer here and there

Let's connect! linktr.ee/gambuzzi

Hello, I'm Roberto...

💤 Lazy by nature, Pythonic by choice

🦀 Old enough to be rusted

👋 I'm Luciano (🇮🇹🍕🍝🤌)

👨‍💻 Senior Architect @ fourTheorem

nodejsdp.link

📔 Co-Author of Node.js Design Patterns 👉

Let's connect! linktr.ee/loige

loige.link/nom-rs

What the heck is a parser? 🤓

loige.link/nom-rs

What the heck is a parser? 🤓

A program that can turn text (or bytes) into structured information

loige.link/nom-rs

What the heck is a parser? 🤓

A program that can turn text (or bytes) into structured information

"Point: (22, 17, -11)"

struct Point3D {
  x: i32,
  y: i32,
  z: i32,
}

loige.link/nom-rs

Naive approach

String split like a mad-man 🤪

[label, remainder] = split(input, ": ", 2)

input = "Point: (22, 17, -11)"

label

remainder

Naive approach

String split like a mad-man 🤪

[label, remainder] = split(input, ": ", 2)

input = "Point: (22, 17, -11)"

[, remainder] = split(remainder, "(", 2)

Naive approach

String split like a mad-man 🤪

[label, remainder] = split(input, ": ", 2)

input = "Point: (22, 17, -11)"

[, remainder] = split(remainder, "(", 2)

[numbers,] = split(remainder, ")", 2)

Naive approach

String split like a mad-man 🤪

[label, remainder] = split(input, ": ", 2)

input = "Point: (22, 17, -11)"

[, remainder] = split(remainder, "(", 2)

[numbers,] = split(remainder, ")", 2)

[x, y, z] = split(numbers, ", ", 3)

Problems with this approach

Not always suitable (you need "delimiters")
Simple for simple stuff, increasingly hard for "realistic" use cases
Hard to handle errors consistently

Using a Regex 🧠

Point: (22, 17, -11)

/ /

Using a Regex 🧠

Point: (22, 17, -11)

/^Point: /

Using a Regex 🧠

Point: (22, 17, -11)

/^Point: \(/

Using a Regex 🧠

Point: (22, 17, -11)

/^Point: \((-?\d+)/

Using a Regex 🧠

Point: (22, 17, -11)

/^Point: \((-?\d+), (-?\d+)/

Using a Regex 🧠

Point: (22, 17, -11)

/^Point: \((-?\d+), (-?\d+), (-?\d+)/

Using a Regex 🧠

Point: (22, 17, -11)

/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/

Using a Regex 🧠

Point: (22, 17, -11)

/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/

Capture groups

Using a Regex 🧠

Point: (22, 17, -11)

/^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$/

Named capture groups

$ cargo add regex

yes, the Rust standard library doesn't have built-in support for regex (yet)! 🤷‍♀️

use regex::Regex;

#[derive(Debug)]
struct Point3D {
    x: i32,
    y: i32,
    z: i32,
}

fn main() {
    let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
    let caps = re.captures("Point: (22, 17, -11)").unwrap();

    let point = Point3D {
        x: caps["x"].parse().unwrap(),
        y: caps["y"].parse().unwrap(),
        z: caps["z"].parse().unwrap(),
    };

    println!("{:?}", point);
}

use regex::Regex;

#[derive(Debug)]
struct Point3D {
    x: i32,
    y: i32,
    z: i32,
}

fn main() {
    let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
    let caps = re.captures("Point: (22, 17, -11)").unwrap();

    let point = Point3D {
        x: caps["x"].parse().unwrap(),
        y: caps["y"].parse().unwrap(),
        z: caps["z"].parse().unwrap(),
    };

    println!("{:?}", point);
}

use regex::Regex;

#[derive(Debug)]
struct Point3D {
    x: i32,
    y: i32,
    z: i32,
}

fn main() {
    let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
    let caps = re.captures("Point: (22, 17, -11)").unwrap();

    let point = Point3D {
        x: caps["x"].parse().unwrap(),
        y: caps["y"].parse().unwrap(),
        z: caps["z"].parse().unwrap(),
    };

    println!("{:?}", point);
}

Problems with this approach

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

— Jamie Zawinski

Problems with this approach

Regex are hard to debug:
- it's either a full match (with captures)
- or nothing!
Learning to write regex takes a lot of practice (possibly years!)
Regex might not be suitable for more advanced parsing cases
- E.g. nested structures

Consuming parser

Tries to "eat and match" the source text (or bytes)

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point: (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

Parsed data

Remainder string

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point: (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

2. readSeparator(remainder)

":", " (22, 17, -11)"

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point: (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

2. readSeparator(remainder)

":", " (22, 17, -11)"

3. readSpaces(remainder)

" ", "(22, 17, -11)"

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point: (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

2. readSeparator(remainder)

":", " (22, 17, -11)"

3. readSpaces(remainder)

" ", "(22, 17, -11)"

4. readIntTuple(remainder)

[22,17,-11], ""

Consuming parser

Handling errors: if we fail to match we have to return an error

input = "Hello World"

readNumber(input)

"Cannot match number on 'Hello World'"

$ cargo add nom

fn main() {
    let s = "Point: (22, 17, -11)";
    
    let (_, point) = parse_point(s).unwrap();
    println!("{:?}", point);
}

fn parse_point(input: &str) -> IResult<&str, Point3D> {
    // ... parsing logic goes here

    Ok((input, Point3D { x, y, z }))
}

the value to parse from

A nom Result type

Input type

Output type

fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    
    // ...

    Ok((input, Point3D { x, y, z }))
}

nom combinator: exact match

what to match

the value to match against

If it fails to match

we propagate the error

the parsed value

(in this case "Point: ")

the remainder string

Point: (22, 17, -11)

fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    let (input, _) = tag("(")(input)?;

    // ...

    Ok((input, Point3D { x, y, z }))
}

Point: (22, 17, -11)

fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    let (input, _) = tag("(")(input)?;
    let (input, x) = i32(input)?;

    // ...

    Ok((input, Point3D { x, y, z }))
}

combinator that parses a sign (+ or -) and sequence of digits and converts them to a i32

first coordinate

Point: (22, 17, -11)

fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    let (input, _) = tag("(")(input)?;
    let (input, x) = i32(input)?;
	let (input, _) = tag(", ")(input)?;
    let (input, y) = i32(input)?;
    let (input, _) = tag(", ")(input)?;
    let (input, z) = i32(input)?;
    let (input, _) = tag(")")(input)?;

    Ok((input, Point3D { x, y, z }))
}

Point: (22, 17, -11)

With nom you generally break parsers into smaller and reusable parsers 🤏

/// parses "," followed by 0 or more spaces
fn separator(input: &str) -> IResult<&str, ()> {
    let (input, _) = pair(tag(","), space0)(input)?;
    Ok((input, ()))
}

nom combinator that applies 2 parsers in sequence and returns a tuple with the 2 parsed values

/// parses 3 numbers separated by a separator (e.g. "22, 17, -11")
fn parse_coordinates(input: &str) -> IResult<&str, (i32, i32, i32)> {
    let (input, (x, _, y, _, z)) = tuple((
        i32,
        separator,
        i32,
        separator,
        i32
    ))(input)?;

    Ok((input, (x, y, z)))
}

nom combinator that applies a sequence of parsers in sequence and returns a tuple with the parsed values

Note how we are reusing the parser we just created!

fn parse_point2(input: &str) -> IResult<&str, Point3D> {
    let (input, _) = tag("Point: ")(input)?;
    let (input, (x, y, z)) = delimited(
        tag("("),
        parse_coordinates,
        tag(")")
    )(input)?;
    
    Ok((input, Point3D { x, y, z }))
}

We already learned about these useful nom combinators:

tag: exact string match
i32: parse a number from text as i32
space0: matches 0 or more spaces
pair: apply 2 parsers in sequence
tuple: apply N parsers in sequence
delimited: combines 3 parsers to extract a value wrapped around 2 delimiters

nom combinators official cheatsheet

Let's learn more by trying to parse RESP
the Redis Serialization Protocol 🤠

RESP Primer (spec)

Binary protocol that uses control sequences encoded in standard ASCII
The \r\n (CRLF) is the protocol's terminator, which always separates its parts
The first byte in a payload identifies its type. Subsequent bytes constitute the type's contents.
Different data types (simple, bulk, or aggregate)

RESP Primer (spec)

RESP data type	First byte
Simple strings	+
Simple Errors	-
Integers	:
Bulk strings	$
Arrays	*
Nulls	_
Booleans	#
Doubles	,
Big numbers	(
Bulk errors	!
Verbatim strings	=
Maps	%
Sets	~
Pushes	>

Example messages

+OK\r\n

-ERR unknown command 'asdf'\r\n

:1000\r\n

$5\r\nhello\r\n

*2\r\n$5\r\nhello\r\n$5\r\nworld\r\n

,1.23\r\n

(3492890328409238509324850943850943825024385\r\n

%2\r\n+first\r\n:1\r\n+second\r\n:2\r\n

Simple string

Simple error

Integer

Bulk string

Array

Double

Big number

Map

["hello", "world"]

{"first": 1, "second": 2}

#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
pub enum Value<'a> {
    SimpleString(&'a str),
    SimpleError(&'a str),
    Integer(i64),
    BulkString(&'a str),
    Array(Vec<Value<'a>>),
    Null,
    Boolean(bool),
    Double(String),
    BigNumber(&'a str),
    BulkError(&'a str),
    VerbatimString(&'a str, &'a str),
    Map(Vec<Value<'a>>, Vec<Value<'a>>),
    Set(BTreeSet<Value<'a>>),
    Pushes(Vec<Value<'a>>),
}

Recursive data type

pub fn parse_value(input: &str) -> IResult<&str, Value> {
    let (input, type_char) = one_of("+-:$*_#,(!=%~>")(input)?;
    let parser = match type_char {
        '+' => parse_simple_string,
        '-' => parse_simple_error,
        ':' => parse_integer,
        '$' => parse_bulk_string,
        '*' => parse_array,
        '_' => parse_null,
        '#' => parse_bool,
        ',' => parse_double,
        '(' => parse_bignumber,
        '!' => parse_bulk_error,
        '=' => parse_verbatim_string,
        '%' => parse_map,
        '~' => parse_set,
        '>' => parse_push,
        _ => unreachable!("Invalid type"),
    };

    parser(input)
}

Alt-ernative approach 🙄

pub fn parse_value(input: &str) -> IResult<&str, Value> {
    alt((
        parse_simple_string,
        parse_simple_error,
        parse_integer,
        parse_bulk_string,
        parse_array,
        parse_null,
        parse_bool,
        parse_double,
        parse_bignumber,
        parse_bulk_error,
        parse_verbatim_string,
        parse_map,
        parse_set,
        parse_pushes,
    ))(input)
}

fn crlf(input: &str) -> IResult<&str, &str> {
    tag("\r\n")(input)
}

fn parse_simple_string(input: &str) -> IResult<&str, Value> {
    let (input, _) = tag("+")(input)?;
    let (input, value) = terminated(
        take_while(|c| c != '\r' && c != '\n'), 
        crlf
    )(input)?;
    Ok((input, Value::SimpleString(value)))
}

+Simple String Example\r\n

fn parse_bulk_string(input: &str) -> IResult<&str, Value> {
    let (input, _) = tag("$")(input)?;
    let (input, length) = terminated(u32, crlf)(input)?;
    let (input, value) = terminated(
        take(length as usize),
        crlf
    )(input)?;
    Ok((input, Value::BulkString(value)))
}

$19\r\nBulk String Example\r\n

fn parse_array(input: &str) -> IResult<&str, Value> {
    let (input, _) = tag("*")(input)?;
    let (input, length) = terminated(u32, crlf)(input)?;
    let (input, values) = count(parse_value, length as usize)(input)?;
    Ok((input, Value::Array(values)))
}

*2\r\n$5\r\nExample\r\n$5\r\nArray\r\n

Other useful nom combinators:

one_of: match one char from a list of allowed ones
terminated: applies a parser and then a second one on the remaining input but returns only the value of the first parser.
eof: checks that the remaining input has length 0
alt: tries a sequence of parsers until one matches
take_while: take characters until a given condition matches
take: takes N characters
count: repeats a parser N times and collects the results in a vector.

loige.link/nom-playlist

Can we parse binary file formats? 🤔

UINT8[80]    – Header                 -     80 bytes
UINT32       – Number of triangles    -      4 bytes
foreach triangle                      - 50 bytes:
    REAL32[3] – Normal vector             - 12 bytes
    REAL32[3] – Vertex 1                  - 12 bytes
    REAL32[3] – Vertex 2                  - 12 bytes
    REAL32[3] – Vertex 3                  - 12 bytes
    UINT16    – Attribute byte count      -  2 bytes
end

STL file format

🤷‍♂️ Did you know there's a monument of this teapot in Dublin?

use std::{fs, io};

use nom::bytes::complete::take;
use nom::multi::count;
use nom::number::complete::{le_f32, le_u16, le_u32};
use nom::sequence::tuple;
use nom::IResult;

#[derive(Debug)]
struct Triangle {
    normal_vector: [f32; 3],   // REAL32[3] - Normal vector
    vertex_1: [f32; 3],        // REAL32[3] - Vertex 1
    vertex_2: [f32; 3],        // REAL32[3] - Vertex 2
    vertex_3: [f32; 3],        // REAL32[3] - Vertex 3
    attribute_byte_count: u16, // UINT16 - Attribute byte count
}

#[derive(Debug)]
pub struct StlFile<'a> {
    header: &'a [u8],         // UINT8[80] - Header
    number_of_triangles: u32, // UINT32 - Number of triangles
                              // little endian
    triangles: Vec<Triangle>, // Variable number of triangles
}

fn parse_triangle(input: &[u8]) -> IResult<&[u8], Triangle> {
    let mut three_floats = tuple((le_f32, le_f32, le_f32));
    let (input, normal_vector) = three_floats(input)?;
    let (input, vertex_1) = three_floats(input)?;
    let (input, vertex_2) = three_floats(input)?;
    let (input, vertex_3) = three_floats(input)?;
    let (input, attribute_byte_count) = le_u16(input)?;
    Ok((
        input,
        Triangle {
            normal_vector: [normal_vector.0, normal_vector.1, 
                            normal_vector.2],
            vertex_1: [vertex_1.0, vertex_1.1, vertex_1.2],
            vertex_2: [vertex_2.0, vertex_2.1, vertex_2.2],
            vertex_3: [vertex_3.0, vertex_3.1, vertex_3.2],
            attribute_byte_count,
        },
    ))
}

pub fn parse_stl(data: &[u8]) -> IResult<&[u8], StlFile> {
    let (data, header) = take(80usize)(data)?;
    let (data, number_of_triangles) = le_u32(data)?;
    let (data, triangles) = count(parse_triangle,
                            number_of_triangles as usize)(data)?;
    Ok((
        data,
        StlFile {
            header,
            number_of_triangles,
            triangles,
        },
    ))
}

fn main() -> io::Result<()> {
    let file_path = "example.stl";
    let data: Vec<u8> = fs::read(file_path)?;
    let bin_data: &[u8] = &data;
    let (_, stl_file) = parse_stl(bin_data).unwrap();
    dbg!("{:?}", stl_file);

    Ok(())
}

Result

[src/main.rs:61:5] stl_file = StlFile {
    header: [
        34,
        74,
        101,
        119,
        101,
    ...
        0,
        0,
        0,
        0,
        0,
        0,
    ],
    number_of_triangles: 2466,
    triangles: [
        Triangle {
            normal_vector: [
                -0.0,
                0.0,
                1.0,
            ],
            vertex_1: [
                146.41861,
                -0.44428897,
                0.0,
            ],
            vertex_2: [
                151.41861,
                -0.44428897,
                0.0,
            ],
            vertex_3: [
                146.56938,
                0.4107614,
                0.0,
            ],
            attribute_byte_count: 0,
        },
        Triangle {
...

It's a wrap! 🌯

If you need to do parsing, string splitting is an option but it gets messy really quick
Regex can be a better approach, but they are hard to learn, debug, and to get right
`nom` offers a more interesting approach through parser combinators
Once you learn how to create a parser function and some of the combinators available you are good to go!
Nom also allows you to parse binary data!

Bonus content 😱

Why is it called nom? 🤔

Bonus content 😱

nom will happily take a byte out of your files :)

Alternative parser combinator crates

winnow: fork of nom. It seems to be updated more frequently and the team claims they are working to provide a better DX.
chumsky: Better error reporting and error recovery. Favors a function chaining style rather than composition.

loige.link/winnow-nom-successor

THANK YOU!

loige.link/nom-rs

Questions?

linktr.ee/gambuzzi

linktr.ee/loige

An intro to `nom`, parsing made easy for Rustaceans - Rust Dublin

By Luciano Mammino

An intro to `nom`, parsing made easy for Rustaceans - Rust Dublin

In this talk we will give you a pragmatic introduction to `nom` , a famous Rust parser combinator crate, which, in other words, is a tool that makes it easy to create parsers even for more advanced use cases. We will start by showcasing common suboptimal ways to do parsing including string splitting and using regexes. Then we will introduce nom and the main concepts it builds on, including its most commonly used functions. Finally, we will present several examples, from a few simple ones to building an entire parser for RESP, the Redis Serialization Protocol. If we do a good job, you'll probably forget about regex :)

2,129

Luciano Mammino PRO

Cloud developer, entrepreneur, fighter, butterfly maker! #nodejs #javascript - Author of https://www.nodejsdesignpatterns.com , Founder of https://fullstackbulletin.com

An intro to nom

What the heck is a parser? 🤓

What the heck is a parser? 🤓

What the heck is a parser? 🤓

Naive approach

Naive approach

Naive approach

Naive approach

Problems with this approach

Using a Regex 🧠

Using a Regex 🧠

Using a Regex 🧠

Using a Regex 🧠

Using a Regex 🧠

Using a Regex 🧠

Using a Regex 🧠

Using a Regex 🧠

Using a Regex 🧠

Problems with this approach

Problems with this approach

Consuming parser

Consuming parser

Consuming parser

Consuming parser

Consuming parser

Consuming parser

With nom you generally break parsers into smaller and reusable parsers 🤏

Can we parse binary file formats? 🤔

Result

It's a wrap! 🌯

Bonus content 😱

Why is it called nom? 🤔

Why is it called nom? 🤔

Bonus content 😱

Alternative parser combinator crates

Links

THANK YOU!

Questions?

An intro to `nom`, parsing made easy for Rustaceans - Rust Dublin

More from Luciano Mammino