Custom JSON Validator
lizardkinglk

lizardkinglk @lizardkinglk

About: Hello!

Joined:
Oct 7, 2020

Custom JSON Validator

Publish Date: Jun 12
2 1

Introduction

As a developer I want to make cool apps and learn the insides of systems and know how they behave. The goal is to become a better dev. So I found this coding challenges newsletter. It has many build your own challenges with references linked to resources. Step by step approach with, in each step describing the goals need to have or achieve in the solution.

Building your own json parser is one of them. So I tried it and could achieve the goals.

Below is the link for the challenge so you can check it out.

Build Your Own JSON Parser

This challenge has the final target of validating JSON also known as syntactical analyzing AKA parsing.

Resources

There are many resources provided in the challenge page. For validating a syntactically correct JSON is that the JSON should follow the correct JSON grammar. These JSON grammar is provided in the challenge page with resource linked to below page.

Introducing JSON

There is another resource that points to a book called 'Dragon Book'. This dragon book describes everything about compilers. This parsing is an area or an early step of compiling a language.

Also the challenge has provided some official test data I call them. You can access them in the challenge page.

Approach

My approach to implementing the solution is using the test data given in each step and overcome the goal for that with TDD (Test Driven Development) approach.

Started as a console app and increment the functionality in each step.

Besides the official test data, each step has test data that is not enough for me and I thought I should add more tests. So I added custom test data as well.

How to do?

To put it simply figuring how to is an incremental process and my tests failed with each change and debugging helped a lot. As an example I thought I can simply trim the json string and use split with commas to divide them and do the validation with the content in the splitted array.

But it raised questions like whenever I split them it could be spliting them from the characters which might be inside a string literal. So I had to refigure. This took couple of iterations of figuring outs and I finally came up with a solution.

That is if I started from first index and read every byte to the last index then I can choose which is valid json and which are not. If any byte contains invalid JSON then I stop the process and return failed error message to the user regarding the invalid json found. If not it is valid JSON and program ends with success.

Summarize steps

To achieve the challenge I'm following these steps below.

1 Read the user's arguments from the Command Line

These commands contain only the file path argument (so far) which points
to a text or json (preferably) file path.

2 Validate the path and read all of the content then put it to a Character array

This JSON String acts as read-only variable throughout the execution.

I put the content to a character array because if I simply used
the json string (which I previously did) I realised it
using more memory as I'm passing the string jsonString
argument to many functions in the solution.

The reason behind this is, strings acting as value types allocating more
memory for the copy of variable inside the function that gets passed to.
To avoid this, using a character array to store the JSON String was
an option because it behaves as reference types saving memory, avoiding
the overhead of copy operation.

Another reason to use a character array is the ease of using indexes or
simply direct memory access.

But there must be a caveat here as the size limitation for character array
could only be too much.

I should handle it with a message to the user
or find a solution like separate it to different arrays to
validate as subtasks for bigger files.

3 Validate the starting character

Then from the start, as the starting character there could be either,

curly brace - '{'
square brace - '['
or a bunch of whitespace characters as the starting character.

So my solution has a function named ValidateNextCharacter
that goes to target character reading from a given index
of the array excluding the whitespace characters.

All of the whitespace are ignored until the target character.
And there can be special characters including
backslash - '\'
alphabetical - a-z
symbols - '[!@#$%^*.]' etc.

If any of it contain above non-whitespace characters
then validate it as invalid json as the target character
only could be either curly or square brace.
(Even in vscode the value literal itself could be valid json,
the challenge and json standard does not comply with that)

4 Validate JSON Object

Let's assume the starting character is for a JSON Object

scenario1
{ "key1": "value1", "key2": "value2" }

As you can see, it contains key value pairs separated by commas
and should end with clossing curly brace.
(Don't forget the whitespaces)

If it contains none of that then the json object should be like below.

scenario2
{}

So let's say the starting index is like below,

current starting index being 2
This is for the first scenario with two key value pairs.
Character at this starting index is double quotes - '"'

current starting index being 1
This is for the second scenario with no key values and no whitespaces.
Character at this starting index is close curly braces - '}'

So we need to figure out which scenario path we should go.
In my soultion there is a function called ValidateFirstOccuringCharacter
to check what character comes first between a character set
and another character.

As an example for this scenario we need to call our method like below.

ValidateFirstOccuringCharacter(ref startIndex, [Quotes], CloseCurlyBrace)

In here, startIndex argument is passed by ref for modify it in a single place
in memory and the second arguemnt for parameter is the character set
as array that contains only quotes.
The last argument being the Close Curly Brace
which is expected as the second character.

What we achieve by doing this is figuring out if
either the JSON Object contains key values
(if that is the case then our solution knows it by returning 0)
or
it returns 1 if there were no key value pairs found.

Furthermore,
This validates whitespace characters
and returns -1 if illegal characters were found.

If it returns 1 which means that is the
scenario2 - '{}'

Then we can think the JSON is valid and the result is handled
to the user with the success message.
But what if there was a scenario3 which is an extension to scenario2.
What if the JSON object would look like this below?

scenario3 - ' { } '
Not a problem. Validate whitespace and reach curly brace open.
Check for first occuring character to be either Quotes - '"' or
Close Curly Brace - '}'

Can close as valid JSON but can we actually?
Without validating until the end of the file we can't be certian.
See the scenario4 below

scenario4 - ' { } what? '
The word 'what?' is invalid json as we concluded the JSON Object as
valid and it cannot contain values outside of whitespaces.
So in my solution there is a function named
ValidateLegalTrailing

This validates if the last set of characters to be contain only whitespace.
So if 'what?' or any other illegal characters were found then
the solution handles as invalid.

We can see that empty arrays - '[]' can also be validated or even
spacious arrays - ' [ ] ' using these functions implemented.
Let's look into arrays with values later.

5 Validating Keys

In the previous JSON Object if keys were found then it should start with
double quotes everytime. We should iterate the key value pairs until
end of object that is Close Curly Braces - '}' were found.

Each key value pair are validated. To validate the key,
the procedure is below.
Starting from the first character of the key until the end
of the key content iterate the characters for illegal characters.
In here starting index is the index of quotes + 1 and last being
closing quotes - 1. See below.

       start index
       v
      "key1"
          ^
          end index
Enter fullscreen mode Exit fullscreen mode

The invalid characters for the keys are
line breaks - '\n' and tab spaces - '\t'

So if any of these were found then the key is invalid.
Also if someone used any special characters which start from backslash
then these characters should escaped which means they should be followed
by character set mentioned in JSON reference provided in the challenge.
See below.

valid example but why put "\n" inside a key?

      "ke\ny1"
Enter fullscreen mode Exit fullscreen mode

(this is interpreted by backslash - '\' followed by a
line character - 'n') I use a simple regex to match every and all of the
whitespace characters, escape sequences & special unicode characters
which exist within a JSON key. See below.

   ^(\\[\\""/bfnrt]{1}|(\\u[0-9aA-fF]{4}))$
Enter fullscreen mode Exit fullscreen mode

See an invalid example below.

invalid example because of line breaks '\n'

      "ke <-- line break here
       y1"
Enter fullscreen mode Exit fullscreen mode

Let's see another example

invalid example because '\y' is not a valid escape character

      "ke\y1"
Enter fullscreen mode Exit fullscreen mode

(this is interpreted as - '\y' character. But since '\y' is
not in the given set of special characters in the
the JSON reference so this is a invalid JSON key scenario.

Same goes for tabspaces as well. Instead of line breaks this acts
as tabspaces. See below.

invalid example because of tab spaces '\t'

      "ke   y1"
         ^
         actual tabspace here
Enter fullscreen mode Exit fullscreen mode

Note that validating values for string literals in
the value validation these rules goes the same.
Let's look into that later.

6 Validating Values

When it comes to JSON values that are used by not only JSON Objects
but also by JSON Arrays. They are validated by a common function that calls
separate and specific child functions in my solution.
These values starts with any of the below characters (or tokens)

   Negative        '-' for number values
   CharZero        '0' for number values
   CharOne         '1' for number values
   CharTwo         '2' for number values
   CharThree       '3' for number values
   CharFour        '4' for number values
   CharFive        '5' for number values
   CharSix         '6' for number values
   CharSeven       '7' for number values
   CharEight       '8' for number values
   CharNine        '9' for number values
   Quotes          '"' for string values
   OpenCurlyBrace  '{' for object values
   OpenSquareBrace '[' for array values
   CharNullStart   'n' for null as value
   CharTrueStart   't' for true as value
   CharFalseStart  'f' for false as value
Enter fullscreen mode Exit fullscreen mode

So our ValidateNextCharacter should expect for a value within this array.

The solution checks for exact match for values of null, true & false
by using a two pointers method that implemented in the function
ValidateJsonValueForNullOrBool
These values must end with a Comma - ',' character
or a Close Curly Brace - '}' thus indicating no more key-value pairs.

For string values as mentioned previously it uses the same function
implemented for key validation as well. String values must end
with a Quote - '"' character.

For JSON Objects and JSON Arrays that appear as values they are validated
separately with functions and they use a recursion logic for validating
their children. More on that later.
Just like one above,
the termination character being either Comma - ','
or a Close Curly Brace - '}' for value found as JSON Object,
and either Comma - ','
or a Close Square Brace - ']' for value found as JSON Array.

For validating numbers I read the content until the termination character
which are any values ',', '}', ']' or any whitespace character.
So the numeric value itself can be validated regardless of what being
the termination character. (Termination character and whatever after
is validated by the called parent).
Collected numerical content is then validated by a regular expression.
The number could be simple or a scientific one. All numbers
should follow decimal representation as the JSON reference does not allow
0b - binary or 0x - hexadecimal values. There is a
caveat here as when the number is too long it might cause performance
issue. So it should be handled. The regular expression is below.

   ^[-]{0,1}(0|([1-9]{1})([0-9]{0,}))(\.[0-9]{1,}){0,1}([eE][-+]{0,1}[0-9]{1,}){0,1}$
Enter fullscreen mode Exit fullscreen mode

Below are two great sources I use to practice and become
somewhat bearable at regex

RegexOne

RegExr

This regex matches for negative values and if not present
then its a positive.
i.e.
"-" as value is invalid. empty value (zero-length) for a value is invalid.

Then checks for other characters which is any digit either 0 to 9
i.e.
"-0" as value is invalid. "0" is valid

Then checks for multiple digits.
"-01" this is invalid as non-rational numbers in JSON cannot start with 0.
"-11" is valid. So is "12"

Then checks if rational number. If not then checks if scientific part was
included. If a rational number then it must start with a Dot - '.'
If it has a sceintific part then it should follow a Letter E - 'e' both
cases included. Then it must follow one or many positive or negative number.
"0.1" is valid. "12.01" is valid.
"-1.1E" is invalid. "1.12e2" is valid.
"1.12e+1" is valid. "-12.01E-1" is valid.

That concludes value validation logic.

7 Validate JSON Array

To validate JSON string that is also a JSON Array is bit easier but not
as performant as validating objects since it checks for containment of
any character included in the value prefix array which has many
comparrison hits to take.

So I need to start from reading empty whitespaces and then check
if an array literal was found (Open Square Brace - '['). If found
then it should start the array validation logic.

Simply following the above steps it validates each and every value reside
inside the array until a Close Square Brace - ']' was found.

Also as mentioned above if Object or Array literal was found as
the value then it takes a function call to validate their children.
To make it further nested let's take below examples.

   {
     "key1": {
       "key1a": true,
       "key1b": {
         "key1ba": [
           1,
           null
         ]
       },
       "key1c": [
         true,
         false
       ]
     }
   }
Enter fullscreen mode Exit fullscreen mode

In here "key1" has a JSON Object found as the value for its "key1a" key.
So "key1a" calls the function to validate its children just like their
parent ("key1") expected. And reports the result to called parent.
This uses self calls (recursion) inside child object to invoke
grandchildren to validate these results.
Also the start index continues from the ended position of child so it does
not need to travel back to the position where the parent of its child
first called to validate which is good.

That concludes the steps this validation process takes.

Testing

I tested all the scenarios as mentioned above with XUnit.
These includes default (step-wise tests), custom and official tests.
And those are also included in the repository.
No tests were added for utility functions as they were manually tested.
Also I had to turn off parallel testing and enable collection per assembly
options in XUnit because when executed with vscode it gave me at least
one error due to tests using shared resources (globals) internally.

Benchmarks

It takes under half a second to validate a JSON file that is
the size of 10MB which I downloaded from the below source.

Example File

Besides the areas I need to cover for scaling up and performance these
results look good for a simple JSON parser/validator.

But please note that sometimes under heavy load my pc performs slower
so the results can be vary from your device.

Notes

Thanks to the team at Coding Challenges NewsLetter

Repository

jPTool

Thank you for Reading!

Keep Coding ツ

Comments 1 total

Add comment