Ragel: state machine compiler

Really fast way to build really fast parser.

Intro

Recently I was working on go-syslog1 package making some changes for a private project. This package parses incoming Syslog2 records and returns structured data.

My first questions, when I opened a file https://github.com/influxdata/go-syslog/blob/develop/rfc3164/machine.go were: “What a heck is that?”, “Who is writing in Golang but in C-like style?” and “How can I read it and make any modifications here?”

Before making any modifications I have to get a bit understanding how it works. Everything is completely clear until line:97 where two things start: function machine.Parse and “goto hell”. The function is 8500 lines long and mostly consists of labels, goto statements and “magic numbers”. At that point I had no idea how to deal with it.

That is how it looks like

	{
		var _widec int16
		if (m.p) == (m.pe) {
			goto _testEof
		}
		switch m.cs {
		case 1:
			goto stCase1
		case 0:
			goto stCase0
		case 2:
			goto stCase2
		case 3:
			goto stCase3

//<skipped>

		goto tr37
	st104:
		if (m.p)++; (m.p) == (m.pe) {
			goto _testEof104
		}
	stCase104:
		if (m.data)[(m.p)] == 32 {
			goto tr39
		}
		if 33 <= (m.data)[(m.p)] && (m.data)[(m.p)] <= 126 {
			goto st105
		}
		goto tr37
	st105:
		if (m.p)++; (m.p) == (m.pe) {
			goto _testEof105
		}

// and so on...

Well, someone wrote it, right? That means I can modify it I thought and I’ve started project exploring. I quickly found another file machine.go.rl which also started as a Golang file but soon on line 23 starts some weird (at that point of time) statement

%%{
machine rfc3164;

include common "common.rl";

# unsigned alphabet
alphtype uint8;

action mark {
	m.pb = m.p
}

OK, it’s definitely not a Golang code. We have some directives rounded by %%{ and }%% and lines started with %%. I’ve seen such brackets for the first time.

Let’s try to find any mentions in this project which is using this *.rl file. A-ha, there is a makefile in this project, which refers to the file.

#bla-bla-bla
RAGEL := ragel -I common

#bla-bla-bla

rfc3164/machine.go: rfc3164/machine.go.rl common/common.rl

rfc3164/machine.go:
	$(RAGEL) -Z -G2 -e -o $@ $<
	@sed -i '/^\/\/line/d' $@
	$(MAKE) file=$@ snake2camel

Who? Ragel?

What the hell are you?

Ragel

Actually, Ragel is a really cool tool and a special language. It allows you to quickly define a state machine which converts incoming data into a well-defined structure. It’s almost like regexp but clearer and simpler.

What kind of task is Ragel good for?

It’s a list from Ragel’s homepage3:

  • Writing robust protocol implementations.
  • Parsing data formats.
  • Lexical analysis of programming languages.
  • Validating user input.

I rarely implement any protocols or make lexical analysis, so let’s see how we can parse user input, for example, phone number.

Ragen in action

What we’ll do parse?

First of all we have to define a phone format itself. Say, it will be

+1 (NPA) NXX-XXXX

where:

  • + and 1 is an optional part, i.e. any phone can start with +1, 1, or it can be without international code.
  • NPA is an area code within a range of 200..999.
  • NXX-XXXX is a subscriber number, range for N is [2-9], for X is [0-9].
  • All brackets, spaces and hyphens are optional.

More details about the format on Wikipedia.

With that said, allowable phones are:

+1 (555) 2334567
+1(555)2334567
+1   (555)           2334567
+1 (555) 233-4567
+1   (555)   233-45-67
+1   (555)   233           45-67
+1(555)233-4567
+15552334567
+1-555-233-4567
1 (555) 2334567
1(555)2334567
1   (555)           2334567
1 (555) 233-4567
1   (555)   233-45-67
1   (555)   233           45-67
1(555)233-4567
15552334567
(555) 233-4567
(555)233-4567
5552334567
555-233-4567
    555     233     45    67

For numbers without international code, it will be set as 1 by default.

Phone structure

Next, we need a structure which will store our parsed data. The structure will be filled during the parse process.

type Phone struct{
  IntCode   string
  AreaCode  string
  Number    string
}

I wouldn’t argue how it is better to store phone numbers as int or as a string, it’s no matter in our case.

Unit tests

Working with code generators more and more I can say that unit-tests are a significant part of the development. They validate generated code and produce the result we expect. In most of my cases it is faster to write such tests rather than trying to debug generated code manually. Also, small changes in the generator input can drastically change and output.

We have an input, we know what we want to get as an output, we are good to write tests. Due to our parser does two things, parses the phone itself, validates phone format and returns an error if format is incorrect. We need three groups of tests:

  • positive tests, i.e. tests which check all correct values parsed properly
  • tests which check correctness of area code and validates an error, and
  • tests which check correctness of phone number and validates an error

For unit tests I use Ginkgo library. Table tests looks like this:

//  machine_test.go

DescribeTable("Correct values",
  func() {
    phone := CurrentGinkgoTestDescription().TestText
    p, err := m.Parse([]byte(phone))
    Expect(err).NotTo(HaveOccurred())

    Expect(*p).To(MatchAllFields(Fields{
      "IntCode":  Equal("1"),
      "AreaCode": Equal("555"),
      "Number":   Equal("2334567"),
    }))
  },

  Entry("+1 (555) 2334567"),
  Entry("+1(555)2334567"),
  Entry("+1   (555)           2334567"),
  // skipped
)

DescribeTable("Area error",
  func() {
    phone := CurrentGinkgoTestDescription().TestText
    p, err := m.Parse([]byte(phone))
    Expect(p).To(BeNil())
    Expect(err).To(MatchError("invalid area code, expected 200..999"))
  },

  Entry("+1 (155) 2334567"),
  Entry("11992223344"),
  Entry("134 2223344"),
)

DescribeTable("Invid phone format",
  func() {
    phone := CurrentGinkgoTestDescription().TestText
    p, err := m.Parse([]byte(phone))
    Expect(p).To(BeNil())
    Expect(err).To(MatchError(fmt.Sprintf("invalid phone format: %s", phone)))
  },

  Entry("+1 (555) 1334567"),
  Entry("+1(555)23!4567"),
  Entry("+1(555+2334567"),
  // skipped
)

this and other code samples are in this repo4.

Ok, what’s happening in this test? We have a test table description started with DescribeTable. First parameter is a table name, second one is a function, which will be called for each Entry passed as a last parameter.

Due to each Entry must have a description as a first param, I use it as an actual value, which inside the function can be retrieved with CurrentGinkgoTestDescription().TestText call. Next, I’d expect m.Parse() returns no error for positive tests and expected errors for others. Also, for positive tests it checks actual values in structure fields. All the inputs have to be converted into the same output.

We’re not able to run these test, because we didn’t implement the parser itself and the test run fails with compile error.

Empty state machine aka parser boilerplate.

Our parser will be written in Ragel and Ragel requires a couple of things it uses during the parse process. First, we have to define a machine, then point out how some variables can be found and define places inside Go code where Ragel’s output will be written.

Here is an empty state machine,

package parser
%%{
  machine phone;

  main := any;

  write data;

  access m.;
  variable p m.p;
  variable pe m.pe;
  variable eof m.eof;
  variable data m.data;
}%%

// Machine contains variables used by Ragel auto generated code.
// See Ragel docs for details.
type Machine struct {
  // Data to process.
  data []byte
  // Data pointer.
  p int
  // Data end pointer.
  pe int
  // Curent state.
  cs  int
  // End of file pointer.
  eof int
  // Start of current date block.
  pb  int
  // Current err
  err error
}

// Phone represents parser result and will be filled by
// Raagel actions.
type Phone struct {
  IntCode   string
  AreaCode  string
  Number    string
}

// New initialized new Machine structure.
func New() *Machine {
  return &Machine{}
}

// string returns current parsed variable. Variable m.pb
// should be updated by "mark" action, while m.p is a
// current parser position inside m.data.
func (m *Machine) string() string {
  return string(m.data[m.pb:m.p])
}

// Parse takes a slice of bytes as an input an fills Phone structure in case
// input is a phone number in one of valid formats.
func (m *Machine) Parse(input []byte) (*Phone, error) {
  m.data = input
  m.pe = len(input)
  m.p = 0
  m.pb = 0
  m.err = nil
  m.eof = len(input)

  phone := &Phone{}

  %% write init;
  %% write exec;

  return phone, m.err
}

i.e. a machine which does nothing yet, but if we run Ragel:

% ragel -Z -G2 machine.rl -o machine.go

we will get a machine.go file, which is ready-to-compile, but again, it does nothing. But tests run and return test error.

Ran 30 of 30 Specs in 0.004 seconds
FAIL! -- 0 Passed | 30 Failed | 0 Pending | 0 Skipped
--- FAIL: TestRagel (0.00s)

In manchine.rl we have a mix of Golang and Ragel code. Ragel code, after Ragel compiler call will be replaced with Go code based on machine definition.

Ragel machine definition surrounded by %%{ and }%% will be used by compiler and the result will be written into lines started with write command, where:

  • write data; will be replaced with constants.
  • write init; - initialization machine code.
  • write exec; - machine execution code.

type Machine

Machine is a main structure which is used by auto-generated parser code. It contains all variables required by Ragel. About these variables see Ragel manual5.

In the machine definition there are a number of instruction:

  • access m.; defines that in auto-generated prefix m. will be used for accessing machine data. m also is a method receiver func (m *Machine) Parse()...
  • variable p m.p; and such define how to access specific variables.

Implementation

Our implementation consists of set machines and actions. Machine is defined as <name> = <expression>, where an expression can be another machine. Such a definition is used for building machines from smaller pieces and does not generate any states. Expressions also can include actions.

sp = ' ';

sp machine defines a space, because it’s used in more than one place it’s nice to have such a machine. Later, for instance, we can add here whitespaces and tabs.

hy = (sp* | '-');

hy defines a machine with any number of spaces or a hyphen and is used as a delimiter inside the phone later.

hbl = ( hy | '(' );
hbr = ( hy | ')' );

hbl and hbr define delimiters which surround area code.

int = '+'? '1' >mark %set_intcode;

int defines an international code with values 1 or +1. Question mark sign ? means the expression before is an optional.

int machine also is a first one in our implementation which uses actions: mark and set_intcode. Each action, in turn, has a point of the call, > is an entering action and % is a leaving action.

Actions themselves do something useful. set_intcode, for example, writes parsed international code values into appropriate fields of Machine structure.

action set_intcode{
  phone.IntCode = m.string()
}

if we look at m.string() function:

func (m *Machine) string() string {
  return string(m.data[m.pb:m.p])
}

we see m.string() returns a slice from the m.data array, within a range from m.pb and m.p. We know (you’ve read comments for an empty machine, right?) m.p is a data pointer and updated by Ragel, i.e., m.p is updated when every new symbol is processed.

m.p update

but m.pb is not and that’s why we need >mark action called before an expression. It updates the m.pb pointer with m.p value which points to the place inside the m.data array where the expression begins. Then when we call %set_intcode at then end of the expression, call m.data[m.pb:m.p] gives us an actual value. (Note: the idea of extracting values this way was found in go-syslog1 project mentioned at the beginning). Later we will extract all values this way.

nxx = ('2'..'9'  . digit{2});

nxx defines a range 200..999 which is used as an area code and a first part of the phone number.

area = hbl? nxx >mark %set_area @err(err_area) hbr?;

area defines an area code with separators: (NXX), -NXX- or sp* NXX sp*, hbl/hbr (delimiters) are optional, we extract value the same way as before >mark/%set_area, but here we have an additional call @err(err_area), it’s used for handling errors and err_area action will be called when first incorrect symbol found in the area part.

  number =  (nxx hy? digit{2} hy? digit{2}) >mark %set_number @err(err_phone);

number defines a phone number started with 2, with optional separators.

main := int? sp* area sp* number @set_defaults;

Final main machine links all previously defined machines together. Here instantiation operator := is used which generates a set of states representing an expression.

set_defaults action sets a default international code if one was not found.

With all machines and action specified let’s run Ragel compiler again:

% ragel -Z -G2 machine.rl -o machine.go

this time we see better picture

Running Suite: Parser Suite
===========================
Random Seed: 1611600245
Will run 31 of 31 specs

•••••••••••••••••••••••••••••••
Ran 31 of 31 Specs in 0.000 seconds
SUCCESS! -- 31 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 556.947243ms
Test Suite Passed

Conclusion

Ragel is definitely worth to consider, especially if you build parsers. It improves code long-term maintainability and readability by clean machine definitions, albeit it looks overcomplicated from the first glance.

See Also