Writing a Scanner: Markdown to HTML

←Back Writing a Scanner: Markdown to HTML
Published	25-02-2025	Updated	27-09-2025

Introduction

On the path to becoming a 'better' engineer, I stumbled upon Crafting Interpreters. This book aims to be more practical than theory oriented. Since The Dragon Book makes me sleepy, I decided to continue with crafting interpreters. In this book, we build two interpreters: one which uses AST(Abstract Syntax Tree) walker while the other one makes use of a VM(Virtual Machine). An interpreter is a very interesting project because it comprises of many different components like a scanner, parser, intermediate representation, code generation, optimisations etc.

While I am learning about interpreters, I decided to write my own Markdown to HTML generator. There are many code repositories that solve this problem, but I decided to write one for myself. Later on, I can add custom syntax in markdown which can get converted to HTML. Hence, I am going to start with the Scanner, the first part of our generator.

What does it do?

A Scanner takes raw source code and produces something meaningful. In our case, the scanner reads the source code of the program and produces a set of tokens. It should also handle error cases where misplaced or irrelevant characters are detected and presented to the user with all the required information.

Goal

My goal is to write a program that converts markdown to HTML code. This would simplify my workflow. The current workflow goes something like

Write the blog on any app(I use obsidian)
Manually convert the markdown to HTML
Add more HTML, CSS, JS to the blog
Upload the blog

I want to simplify my process by automating the first 3 steps. The program should convert the markdown to HTML. Add relevant CSS to the HTML. Obsidian offers more advanced features, but those will be added later.

There are many tools like Lex, Bison which would solve my problem but where is the fun in that. Hence, I am going to write my own scanner. Later, I would like to add my own custom logic which can be specified in obsidian and converted to the required HTML. This gives me freedom to write my blogs in any way which can then be converted to HTML,CSS,JS.

Let's Begin

I will be using Go for this program. We will start with the obvious, which is reading the file

			
func getFileData(filename string) (string, error) {
	file, err := os.Open(filename)
	if err != nil {
		return "", err
	}
	defer file.Close()

	content, err := io.ReadAll(file)
	if err != nil {
		return "", err
	}

	return string(content), nil
}

This is a simple function which opens the file, reads all the data as bytes. The bytes are converted to string and returned.

Tokens

The contents of the input file can be classified into tokens. Token is a sequence of character which cannot be destructured further. The scanner takes a series of characters and forms tokens. Since we are working with a markdown, we do not have keywords, identifier etc. We need to read the content which can contain any character, and the special characters that format the markdown.

			
type Token struct {
	TokenType TokenType
	Lexeme string
	Line int
}

Scanner

Scanner is the most important part. It takes the raw string from the user and produces a set of tokens. These tokens are then parsed and transpiled to HTML. We can use regex with a scanner but I would like to have more control over the process. The scanner struct contains the source code, the tokens to be identified, start and current position to make tokens, and the line number.

			
type scanner struct{
	source string
	tokens []Token
	start int
	current int
	line int
}

We will start with some helper functions which will make building our scanner much easier.

			
func (s *scanner) isAtEnd() bool {
	return s.current >= len(s.source)
}

isAtEnd function is to check whether we have reached the end of file. Since Go does not store EOF, we have to check the current position and the length of the source file

			
func (s *scanner) peek() rune {
	return rune(s.source[s.current])
}

peek function returns the character at the current position of the reader.

			
func (s *scanner) match(expected rune) bool {
	if rune(s.source[s.current]) != expected {
		return false
	}

	s.current += 1
	return true
}

match function accepts a rune and checks the rune against the current rune under the reader. If there is a match, it moves the reader forward while returning true. Else it will return false.

			
func (s *scanner) advance() rune {
	defer func() { s.current += 1 }()
	return rune(s.source[s.current])
}

advance function returns the character present at the reader, then moves forward. This is an important function which drives the Scanner.

			
func (s *scanner) addToken(tokenType TokenType) {
	s.tokens = append(s.tokens, NewToken(tokenType, "", s.line))
}

addToken is a helper function which makes adding tokens easier when discovered.

There 3 more helper functions which help us find what each character is

			
func isContentChar(c rune) bool {
	// (,),*, [,\,]
	if (c >= 40 && c <= 42) ||
	   	(c>= 91 && c <= 93) ||
		(c>= 95 && c <= 96) || 
		c=='\n' {
			return false 
	} 
	return true 
}

func isSpecialCharacter(c rune)bool {
	 return c=='*' || c=='_' 
}
func isDigit(c rune) bool {
	return c>= '0' && c <= '9' 
}

These functions are used later and I will go in-depth into them later on.

Now that all the auxiliary/helper functions are defined, its much easier to describe how the scanner works.

Initially, the scanner starts at index 0 of the source code. We read the first character and move the reader to the next character. The first character is analysed and depending on the value, a token is generated.

			
				### Hi
				^
				start = 0, current = 0

				advance()

				### Hi
				 ^
				start = 0, current = 1

In the above example, the initial position of start and current is 0. Once the advance function is called, the first '#' is returned and current is moved to next. Since we know headings start with a '#', we call the heading function.

			
func (s *scanner) heading() {
	for !s.isAtEnd() && s.peek() == '#' {
		s.advance()
	}

	if s.match(' ') {
		length := s.current - s.start - 1
		switch length {
			case 1:
				s.addToken(H1)
			case 2:
				s.addToken(H2)
			case 3:
				s.addToken(H3)
			case 4:
				s.addToken(H4)
			case 5:
				s.addToken(H5)
			case 6:
				s.addToken(H6)
			default:
				s.content()
			}
	} else {
		s.content()
		return
	}
}

We use isAtEnd and peek to ensure we are reading only '#'. Once we finish reading '#', we check whether a '#' is followed by a space. If not, it is not considered as a heading. We also check whether the number of '#' is between 1 to 6. Anything above that is not a valid heading. Once we are done reading the heading, a new Token is generated for heading. We continue reading. I am considering any character other than the special characters as content. Hence, the 'Hi' is a content. We read till '\n'.

			
				### Hi
				    ^
				start = 4, current = 4

				advance()

				### Hi
				     ^
				start = 4, current = 5

				advance()

				### Hi\n
				      ^
				start = 4, current = 6

In markdown, a paragraph can continue on the next line or a new format(heading, code block) can be present. Hence, a newline acts as a delimiter.

			
func (s *scanner) content() {
	for !s.isAtEnd() && isContentChar(s.peek()) {
		s.advance()
	}

	s.tokens = append(s.tokens, NewToken(CONTENT, s.source[s.start:s.current], s.line))
}

We use the auxilary function isContentChar to check each character. In a line, * can be used to format a certain portion of a text, along with () in the form of "[]()" which can be used to add links. '\' can be used to escape characters. Hence, the function checks the ASCII values of these characters. More characters which can be escaped will be added later on. Once we reach the end of the content, we store the content itself along with other details of the token. This can be used by our parser and transpiler to produce the required HTML.

The most important function, scanTokens, decides what token to produce for each character it reads. It contains a lot of switch statements, but I couldn't find any other way. Here you go!!

			
func (s *scanner) scanTokens() {
	c := s.advance()
	switch c {
		case ' ':
			s.addToken(SPACE)
		case '\n':
			s.addToken(NEWLINE)
			s.line++
		case '\t':
			s.addToken(TAB)
		case '>':
			s.addToken(GREATER)
		case '[':
			s.addToken(LEFT_BRACKET)
		case ']':
			s.addToken(RIGHT_BRACKET)
		case '(':
			s.addToken(LEFT_PARAN)
		case ')':
			s.addToken(RIGHT_PARAN)
		case '!':
			s.addToken(EXCLAMATION)
		case '\\':
			s.addToken(FORWARD_SLASH)
		case '=':
			s.addToken(EQUAL)
		// Multi Character
		case '`':
			s.code()
		case '-':
			{
				if s.match('-') && s.match('-') {
					s.addToken(TRIPLE_DASH)
				} else {
					s.addToken(DASH)
				}
			}
		case '_':
			{
				if s.match('_') {
					if s.match('_') {
						s.addToken(TRIPLE_UNDERSCORE)
					} else {
						s.addToken(DOUBLE_UNDERSCORE)
					}
				} else {
					s.addToken(UNDERSCORE)
				}
			}
		case '*':
			{
				if s.match('*') {
					if s.match('*') {
						s.addToken(TRIPLE_STAR)
					} else {
						s.addToken(DOUBLE_STAR)
					}
				} else {
					s.addToken(STAR)
				}
			}
		case '+':
			s.addToken(PLUS)
		case '#':
			s.heading()
		default:
			if isDigit(c) {
				for !s.isAtEnd() && isDigit(s.peek()) {
					s.advance()
				}

				// Number followed by period and space
				// Ex
				// 1. Hello
				if s.peek() == '.' && s.peekNext() == ' ' {
					s.advance()
					s.tokens = append(s.tokens, NewToken(LIST_NUMBER, s.source[s.start:s.current-2],s.line))
				} else {
					s.content()
				}
			} else {
				s.content()
			}
		}
	}

Yikes!! Pretty big, but that's how we roll. One might wonder why are we capturing so many tokens. Well, what we capture is decided by the grammar of the language. I have made a decision to capture groups of character if they are present. For ex, if *** is read, they are stored as 'TRIPLE_STAR' instead of 3 individual stars. This makes the work for our parser easier as it need not count the number of stars. Since the number of stars matter, we are capturing them as 'STAR', 'DOUBLE_STAR', 'TRIPLE_STAR'. Same action is performed with '-' and '_' as well.

Writing the scanner logic for code was a bit tricky. Anything inside the code is part of the code block and not markdown. The usual formatting syntax doesn't work. There are also cases where a single backtick ` or double `` works only on a single line while triple backticks (```) has new lines in them. Hence, we need to store the count of backticks as well.

			
func (s *scanner) code() {
	noOfBackticks := 1
	line := s.line
	for s.peek() == '`' {
		noOfBackticks++
		if noOfBackticks > 3 {
			break
		}
		s.advance()
	}

	for !s.isAtEnd() {
		if s.peek() == '`' {
			count := 0
			for !s.isAtEnd() && s.peek() == '`' {
				count++
				s.advance()
			}

			if count == noOfBackticks {
				s.tokens = append(s.tokens, NewToken(CODE, s.source[s.start:s.current], line))
				break
			}
		}

		if s.peek() == '\n' {
			s.line++
			if noOfBackticks != 3 {
				s.tokens = append(s.tokens, NewToken(CODE, s.source[s.start:s.current], line))
				break
			}
		}
		s.advance()
	}
}

code handles all number of backticks. First we count the number of backticks. The initial number is 1 because, we have already read a backtick. Once the number of backticks is found, we continue to read the rest of the body. Since single and double backticks cannot handle newline, we break out of the loop and form a token. At the end of the code, if the same number of backticks are read, we add a token. This is not a part of a regular language since we need memory to store information and decide what should be present. Another important peace of code which you would have noticed is the one in default case. Markdown allows us to specify list with '1. ' format. Hence, we are trying to check that case, else it is a content.

			
func (s *scanner) ScanTokens() []Token {
	for !s.isAtEnd() {
		s.start = s.current
		s.scanTokens()
	}

	s.tokens = append(s.tokens, NewToken(EOF, "", s.line))
	return s.tokens
}

This is the function which starts the scanning process. After each token which has been read, we update the start to current value. This provides us with lots of information especially while storing content. Once the whole file is scanned, we append an EOF token at the end for our parser. Markdown is not a regular language. I will talk more on this in the next blog. By storing each character(for formatting), it would be beneficial for our parser, but certain assumptions are made (matching opening and closing format strings etc). Later on, we will change our code to handle complete markdown.

Demo

You can test out the scanner here. Input the markdown on the left hand side and click "Run". If successful, the output will be presented on the right hand side. Note: wasm should be present in your browser./

Closing Remarks

A lot of code has been written, which can produce a set of tokens based on the input. These tokens are then used by the parser to produce an intermediate representation, which is then converted into HTML code.