Tsonnet #13 - Carets, columns, and clues: adding lexing error tracing

Welcome to the Tsonnet series!

If you're just joining, you can check out how it all started in the first post of the series.

In the previous post, we added unary operations to Tsonnet:

Tsonnet #12 - To negate or not to negate: adding unary operations

Hercules Lemke Merscher ・ Mar 9

And today, after a few days of pause, we're back to Tsonnet. It's Friday and this is the post #13 -- it would be funny if it were Friday the 13th, talking about the monsters of programming language usability: dealing with errors.

Until now, Tsonnet has only had rudimentary error reporting and no error tracing. To grow its complexity, we need to start tackling errors deliberately and methodically.

Let's start with the errors encountered during the lexing phase.

Jsonnet lexing errors

Here's the output of Jsonnet for the 2 errors covered by the cram tests so far:

$ jsonnet samples/errors/malformed_string.jsonnet
samples/errors/malformed_string.jsonnet:1:1 Unterminated String

"oops... no end quote


$ jsonnet samples/comments/unterminated_block.jsonnet
samples/comments/unterminated_block.jsonnet:1:16 Multi-line comment has no terminating */

"this is code" /*

It contains the filename, followed by a colon, the line number, another colon, the column number, and the error message. It also shows partially the file content. The malformed string example has just one line, so it output its entirety. In the second example, it shows only the first line, indicating that on column 16 is where the multi-line comment started, but did not terminate.

Could be better, but the relevant bits are there.

Tsonnet is not friendly at all yet:

$ dune exec -- tsonnet samples/errors/malformed_string.jsonnet
String is not terminated

$ dune exec -- tsonnet samples/comments/unterminated_block.jsonnet
Unterminated block comment

But that's why we are here today. It is about to change!

Adding lexing error tracing

The entrypoint of Tsonnet has been reading the entire file and passing the content around. This left the lexer blind to the context, dropping the filename from the context.

We need to change it. Let's pass the filename to Tsonnet library instead, so we can use this information later:

diff --git a/bin/main.ml b/bin/main.ml
index cab8f16..6d17333 100644
--- a/bin/main.ml
+++ b/bin/main.ml
@@ -4,10 +4,7 @@ let anonymous_fun filename = input_files := filename :: !input_files
 let spec_list = []

 let run_parser filename =
-  let input_channel = open_in filename in
-  let content = really_input_string input_channel (in_channel_length input_channel) in
-  close_in input_channel;
-  match Tsonnet.run content with
+  match Tsonnet.run filename with
   | Ok stringified_json -> print_endline stringified_json
   | Error err -> prerr_endline err; exit 1

The filename can now be passed to the parse function. Now the lexer will operate on the IO channel, rather than a plain string, closing the channel when we are done with IO. We also need to set the filename to the lexer, indicating the current open file. The new format_error function will wrap the error message around a formatted message, containing the filename, line, and column where the error was raised:

diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
index defc5f3..b3b77e3 100644
--- a/lib/tsonnet.ml
+++ b/lib/tsonnet.ml
@@ -4,11 +4,24 @@ open Result
 let (let*) = Result.bind
 let (>>=) = Result.bind

+let format_error err (lexbuf: Lexing.lexbuf) =
+  Printf.sprintf "%s:%d:%d %s"
+    lexbuf.lex_curr_p.pos_fname
+    lexbuf.lex_curr_p.pos_lnum
+    (lexbuf.lex_curr_p.pos_cnum - lexbuf.lex_curr_p.pos_bol)
+    err
+
 (** [parse s] parses [s] into an AST. *)
-let parse (s: string)  =
-  let lexbuf = Lexing.from_string s in
-  try ok (Parser.prog Lexer.read lexbuf)
-  with | Lexer.SyntaxError err_msg -> error err_msg
+let parse (filename: string) : (expr, string) result  =
+  let input = open_in filename in
+  let lexbuf = Lexing.from_channel input in
+  Lexing.set_filename lexbuf filename;
+  let result =
+    try ok (Parser.prog Lexer.read lexbuf)
+    with | Lexer.SyntaxError err -> error (format_error err lexbuf)
+  in
+  close_in input;
+  result

 let interpret_arith_op (op: bin_op) (n1: number) (n2: number) : expr =
   match op, n1, n2 with
@@ -63,5 +76,5 @@ let rec interpret (e: expr) : (expr, string) result =
     | _ -> error "invalid binary operation")
   | UnaryOp (op, expr) -> interpret expr >>= interpret_unary_op op

-let run (s: string) : (string, string) result =
-  parse s >>= interpret >>= Json.expr_to_string
+let run (filename: string) : (string, string) result =
+  parse filename >>= interpret >>= Json.expr_to_string

As indicated by the Lexing module documentation, the difference between pos_cnum and pos_bol is the character offset within the line (i.e. the column number, assuming each character is one column wide). Otherwise, pos_cnum gives us the char position relative to the beginning of the file, not the line.

I found a small bug where the multi-line comments were not accounting for line breaks. Easily fixable by calling new_line before proceeding:

diff --git a/lib/lexer.mll b/lib/lexer.mll
index 8dca713..280e0eb 100644
--- a/lib/lexer.mll
+++ b/lib/lexer.mll
@@ -65,6 +65,6 @@ and read_string buf =
 and block_comment =
   parse
   | "*/" { read lexbuf }
-  | newline { block_comment lexbuf }
+  | newline { new_line lexbuf; block_comment lexbuf }
   | _ { block_comment lexbuf }
   | eof { raise (SyntaxError ("Unterminated block comment")) }

After that, we can run dune promote to update the cram tests accordingly:

diff --git a/test/cram/comments.t b/test/cram/comments.t
index 884e99a..424b516 100644
--- a/test/cram/comments.t
+++ b/test/cram/comments.t
@@ -2,5 +2,5 @@
   "this is a string"

   $ tsonnet ../../samples/comments/unterminated_block.jsonnet
-  Unterminated block comment
+  ../../samples/comments/unterminated_block.jsonnet:12:1 Unterminated block comment
   [1]
diff --git a/test/cram/errors.t b/test/cram/errors.t
index 75db079..07e9102 100644
--- a/test/cram/errors.t
+++ b/test/cram/errors.t
@@ -1,3 +1,3 @@
   $ tsonnet ../../samples/errors/malformed_string.jsonnet
-  String is not terminated
+  ../../samples/errors/malformed_string.jsonnet:1:22 String is not terminated
   [1]

And with that we have line and column where the lexing error happened.

A nice improvement. Can we do better?

We can do better

We have the filename, the row, and colum numbers. If we think for a moment, having this information allows us to pinpoint where is the error in the source code.

A simple way of doing it is to print the file content, and use the row and column information to pinpoint it.

As we are dealing with lexing errors, the lexer will stop as soon as it finds the error. We can take advantage of that and plot a caret symbol right after the faulty row. The function plot_caret draws empty spaces and the caret symbol highlighting where the problem is -- we append this line to the end of the file content. The function enumerate_file_content reads the file and enumerates each line, and since it is performing IO, the format_error function needs to return a result instead of a simple string and we bind the result to error:

diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
index b3b77e3..e8f6304 100644
--- a/lib/tsonnet.ml
+++ b/lib/tsonnet.ml
@@ -4,12 +4,48 @@ open Result
 let (let*) = Result.bind
 let (>>=) = Result.bind

-let format_error err (lexbuf: Lexing.lexbuf) =
-  Printf.sprintf "%s:%d:%d %s"
+let enumerate_file_content filename =
+  let channel = open_in filename in
+  try
+    let rec read_lines acc line_num =
+      try
+        let line = input_line channel in
+        let numbered_line = Printf.sprintf "%d %s" line_num line in
+        read_lines (numbered_line :: acc) (line_num + 1)
+      with End_of_file -> (List.rev acc, line_num)
+    in
+    let numbered_lines, line_num = read_lines [] 1 in
+    close_in channel;
+    ok (String.concat "\n" numbered_lines, line_num)
+  with e ->
+    close_in_noerr channel;
+    error (Printexc.to_string e)
+
+let plot_caret column_size =
+  if column_size <= 0 then
+    ""
+  else
+    let buffer = Buffer.create column_size in
+    (* Fill with spaces except the last position *)
+    for _ = 1 to column_size - 1 do
+      Buffer.add_char buffer ' '
+    done;
+    (* Add caret at the end *)
+    Buffer.add_char buffer '^';
+    Buffer.contents buffer
+
+let format_error (err: string) (lexbuf: Lexing.lexbuf) : (string, string) result =
+  let* content, n = enumerate_file_content lexbuf.lex_curr_p.pos_fname in
+  let pos_cnum = lexbuf.lex_curr_p.pos_cnum - lexbuf.lex_curr_p.pos_bol in
+  let carot_padding = String.length (string_of_int n) + 1 in
+  ok (Printf.sprintf "%s:%d:%d %s\n\n%s\n %*s"
     lexbuf.lex_curr_p.pos_fname
     lexbuf.lex_curr_p.pos_lnum
-    (lexbuf.lex_curr_p.pos_cnum - lexbuf.lex_curr_p.pos_bol)
+    pos_cnum
     err
+    content
+    carot_padding (plot_caret pos_cnum)
+  )

 (** [parse s] parses [s] into an AST. *)
 let parse (filename: string) : (expr, string) result  =
@@ -18,7 +54,7 @@ let parse (filename: string) : (expr, string) result  =
   Lexing.set_filename lexbuf filename;
   let result =
     try ok (Parser.prog Lexer.read lexbuf)
-    with | Lexer.SyntaxError err -> error (format_error err lexbuf)
+    with | Lexer.SyntaxError err -> (format_error err lexbuf) >>= error
   in
   close_in input;
   result

The implementation is a bit naive and not performant for now. This is a trade-off I'm happy to make, considering the language is far from done.

Now, let's run dune promote to update the cram tests:

diff --git a/test/cram/comments.t b/test/cram/comments.t
index 424b516..136fc36 100644
--- a/test/cram/comments.t
+++ b/test/cram/comments.t
@@ -3,4 +3,18 @@

   $ tsonnet ../../samples/comments/unterminated_block.jsonnet
   ../../samples/comments/unterminated_block.jsonnet:12:1 Unterminated block comment
+  
+  1 "this is code" /*
+  2 This is a block comment
+  3 .
+  4 .
+  5 .
+  6 isn't
+  7 going
+  8 to
+  9 end
+  10 ?
+  11 ?
+  12 ?
+     ^
   [1]
diff --git a/test/cram/errors.t b/test/cram/errors.t
index 07e9102..679c504 100644
--- a/test/cram/errors.t
+++ b/test/cram/errors.t
@@ -1,3 +1,6 @@
   $ tsonnet ../../samples/errors/malformed_string.jsonnet
   ../../samples/errors/malformed_string.jsonnet:1:22 String is not terminated
+  
+  1 "oops... no end quote
+                        ^
   [1]

And, ta-da! Error messages for humans.

Concluding

As I mentioned before, the implementation here is simple and naive, but I'm ok with that for the time being. We don't have even imports implemented yet. Also, this is just the tip of the iceberg. It covers only lexing errors. We still have to deal with parsing errors, and eventually, type checking errors.

I've drawn inspiration from Elm and the blog post Compiler Errors for Humans -- it is nearly a decade old and still inspiring to read.

I want Tsonnet to have error tracing that is as human-friendly as possible. Who's with me?

Want to trace the evolution of Tsonnet? Subscribe to Bit Maybe Wise and we'll point the caret ^ right at the good stuff.

Hercules Lemke Merscher @bitmaybewise