CS241: Assignment 4

CS241 — Spring 2026

Assignment 4

Out of 9 marks

Assignments for CS241

← A3

Assignment 4

—

Due Friday, June 19th, 5:00 PM

P1 P2 P3 ( P3a P3b )

In this assignment, you will write a tokenizer for WLP4, a simple C-like programming language that we will incrementally write a compiler for in this course. In addition, you will be asked to write some simple programs in WLP4 to better understand the language.

Your WLP4 compiler will eventually translate WLP4 source code into ARM64 assembly language. The tools you write in Assignments 4 through 8 will eventually combine to form a complete compiler from WLP4 to ARM64 assembly. Combined with the tools you created in Assignments 1 through 3 to compile assembly to machine code, you will have a complete pipeline for compiling WLP4 into executable ARM64 programs.

You will get some practice with WLP4 to better understand its various quirks and limitations, and then you will specialize your Simplified Maximal Munch scanner to work with WLP4 programs, completing the first component of the compiler.

Introduction to WLP4

WLP4 is quite similar to a restricted subset of C, but it uses C++ style memory allocation (new/delete instead of malloc/free). The syntax is heavily restricted to make it easier to write a compiler for. For example:

All variable declarations must appear at the start of a procedure. Mid-procedure declarations are not allowed.
All variables must be initialized to a constant. They cannot be initialized to another variable or the result of an expression. They cannot be left uninitialized.
Integer literals cannot be negative. In fact, there is no unary (one-operand) minus operator. To get -1, for example, you must write 0 - 1.
if statements must be written with blocks and must have a matching else, even if the else block is empty. Because of the restriction to blocks, there is no "else if" syntax.
An if statement can only have one boolean condition (no && or || operators).
You can only have one return statement per procedure and it must be at the very end.
Array access must be done using pointer arithmetic. Instead of writing array[index], you must write *(array+index).

More information about WLP4 is below:

How to Compile and Run WLP4 Programs

Brief Introduction to WLP4

Formal WLP4 Specification

Problem 1: WLP4 Practice (filename: `practice.wlp4`) [2 marks of 9: 1 public, 1 release]

Warm-Up Exercise 1 (not submitted)

Write a WLP4 procedure (function) that computes the absolute value of an integer. You do not need to handle the case of -2⁶³ = -9,223,372,036,854,775,808 (this integer has no positive counterpart in 64-bit two's complement).

Warm-Up Exercise 2 (not submitted)

Write a WLP4 procedure called tieBreak that takes two distinct integers as parameters, and does the following:

If the integers have different absolute values, return the integer with the smaller absolute value.
If the integers have the same absolute value, then since they are distinct, one is positive and one is negative. Return the negative integer.

Actual Problem

Write a WLP4 program that takes a non-empty array of integers in the range -241 to 241 (inclusive) as input and returns the integer that occurs most frequently. If there is a tie for the most frequent integer, break the tie using the rule from the tieBreak procedure described in Warm-Up Exercise 2.

Your program should run in linear time (in the size of the input array). This likely means you will need to allocate a sufficiently large block of memory to store the number of times each integer occurs.

Examples

Suppose the input array is [1, 2, 2, 3, 3, 3]. Your program should return 3.

Suppose the input array is [4, 4, 4, 2, 2, 2, -3, -3, -3]. There's a three way tie between 4, 2 and -3. Since 2 has the least absolute value, your program should return 2.

Suppose the input array is [1, -1, 2, -2]. There's a four way tie between 1, -1, 2, and -2. Among these, 1 and -1 tie for the smallest absolute value. To break this tie, you choose the negative number, so you return -1.

Testing

Compile your program with cs241.wlp4c, then run the compiled code with cs241.arm64emu. Since cs241.wlp4c produces machine code directly,you don't need to use cs241.binasm.

Alternatively, on the student Linux system:

      $ cs241.wlp4c < practice.wlp4 > practice.bin
      $ cs241.arm64emu -a practice.bin 4 4 4 2 2 2 -3 -3 -3

Please remember that the student Linux system is unreliable. Do your development and testing on your own system, using the web tools when possible, so that you won't be stuck when (rather than if) the student Linux servers go down.

The wain function in your WLP4 program should take an long* as the first parameter and an long as the second parameter. The first parameter will receive the value that cs241.arm64emu places in x0 (a pointer to the array) and the second parameter will receive the value it places in x1 (the size of the array).

When the program terminates, the return value of the wain function will be stored in x0. Ensure the value in x0 is correct.

Problem 2: WLP4 Radix Printing Integers (filename: radix.wlp4) [2 marks of 9: 1 public, 1 release]

In Assignment 3, you wrote an assembly program to print integers in base 10. WLP4 includes a print statement, written println(x), that directly prints integers in base 10. In this problem, you will write a WLP4 program to print integers in any base. The base of an ordinal number system is also called its radix; hence, the problem name. To do so, you will need to use putchar(x);, which outputs the least significant 8 bits of its integer argument as a character to standard output (same fashion as outputting a character in base ARM64 assembly).

Write a WLP4 program that takes two 64-bit signed integer arguments: the first is the number to print, and the second is the radix to print it in. The number to print must be non-negative. If a negative number is given as the first argument to wain, your program must print nothing and return 1 to indicate an error. The radix must be greater than 1 and less than 37. If a number out of that range is given as the second argument to wain, your program must print nothing and return 2 to indicate an error. If the number and radix are both valid, your program must print the digit(s) of the number in the given radix. After the digits, it must print a newline, and return 0.

If the radix is greater than 10, use the capital English letters for the digits after 9: A, B, C, etc. This is why the radix is restricted to being less than 37: we have 10 standard decimal digits plus 26 letters equals 36 possible digit characters.

Examples

Suppose that the input is 12345, 10. Your program should print "12345" and a newline, and return 0.

Suppose that the input is 342391, 8. Your program should print "1234567" and a newline, and return 0. Note that 1234567 interpreted in base 8 is equal to 342391 interpreted in base 10.

Suppose that the input is 147, 2. Your program should print "10010011" and a newline, and return 0.

Suppose that the input is 683248722, 36. Your program should print "BASE36" and a newline, and return 0.

Suppose that the input is 9172063, 29. Your program should print "CS241" and a newline, and return 0.

Suppose the input is -1, 3. Your program should return 1.

Suppose the input is 15, 1. Your program should return 2.

Suppose the input is -37, 37. Your program should return 1.

Testing

Compile your program with cs241.wlp4c, then run the compiled code with cs241.arm64emu. Since cs241.wlp4c produces machine code directly, you don't need to use cs241.binasm.

Alternatively, on the student Linux system:

      $ cs241.wlp4c < radix.wlp4 > radix.bin
      $ cs241.arm64emu radix.bin 683248722 36

Problem 3: WLP4 Scanner

Write a scanner for WLP4.

You have two options for how to write your WLP4 scanner. One option is to use a general Simplified Maximal Munch tokenizer provided by us, and provide the DFA, and the other is to write your own scanner from scratch (or adapt your scanner code from assignment 2). If you choose to write your scanner from scratch, solve Problem 3a. If you choose to use the general tokenizer provided by us, solve Problem 3b. YOU DO NOT NEED (and are not advised) TO SOLVE BOTH.

If you were able to complete the assembly scanner for A2, we recommend solving 3a. It will be easier to adapt your existing code than to conform to the behaviour of a tool you haven't otherwise encountered. If you were not able to complete the assembly scanner for A2, you may find 3b easier.

Problem 3a: WLP4 Scanner (Scratch) (filenames: wlp4scan.cc OR wlp4scan.rkt) [5 marks of 9: 1 public, 2 secret+release, 2 secret]

IF YOU SUBMIT A SOLUTION TO THIS PROBLEM, YOU DO NOT NEED TO SUBMIT A SOLUTION TO PROBLEM 3B. If solutions are submitted to both problems, the solution receiving a higher score will be used for problem 3.

Write a scanner for WLP4. The scanner should read a WLP4 program from standard input and tokenize it, producing a sequence of lines, each representing one token of the program. Each line should be terminated by a line feed (ASCII 0x0A) and should contain the token kind (the all-caps name indicating the type of the token as defined in the WLP4 Specification), followed by a single space, followed by the token lexeme (the actual string represented by the token).

Whitespace and comments are not considered tokens and should not be part of the output. However, tokens in the input might be separated by whitespace or comments.

If the input can't be scanned into a sequence of valid tokens as per the requirements in the WLP4 Specification, your program must print an error message containing the string ERROR in all caps to standard error and exit normally.

For programs with a valid tokenization, your output should be identical to that of the wlp4scan tool.

Clarifications

In the earlier Simplified Maximal Munch scanner problem, you were provided a scanning DFA, followed by a string to scan using the DFA. For this problem, the input to your program is simply WLP4 source code. That is, the input will not include a DFA for WLP4 tokens. You will likely have to create such a DFA yourself and embed it in your program, similar to the suggestion for the assembly scanner.
If a NUM token's numeric value is out of range for a 64-bit signed integer, you should report an error. This can be done directly in the DFA, though doing so may result in a fairly cumbersome DFA. If you prefer, an alternative way of doing this is to use std::stol, since that reports a std::out_of_range exception if the number being read is too large. If you're testing your code on Windows, you must use std::stoll instead, since longs are 32-bit on Windows systems.

Example

If the input is this WLP4 program:

      //
      // WLP4 program with two integer parameters, a and b
      //   returns the sum of a and b
      //
      long wain(long a, long b) {
        return a + b;   // unhelpful comment about summing a and b
      }

Then the correct output is:

      LONG long
      WAIN wain
      LPAREN (
      LONG long
      ID a
      COMMA ,
      LONG long
      ID b
      RPAREN )
      LBRACE {
      RETURN return
      ID a
      PLUS +
      ID b
      SEMI ;
      RBRACE }

Testing

Compare your results to cs241.wlp4scan. For this problem, Marmoset again uses diff -Zb, so that small differences in whitespace are ignored (e.g., extra space at the end of the line).

On the student Linux system:

      $ ./wlp4scan < input.wlp4 > output.txt
      $ cs241.wlp4scan < input.wlp4 > expect.txt
      $ diff -Zb output.txt expect.txt

Problem 3b: WLP4 Scanner (Preprocessed) (filenames: `wlp4.dfa` AND `wlp4scan.(cc/cpp/rkt)`) [5 marks of 9: 1 public, 2 secret+release, 2 secret]

IF YOU SUBMIT A SOLUTION TO THIS PROBLEM, YOU DO NOT NEED TO SUBMIT A SOLUTION TO PROBLEM 3A. If solutions are submitted to both problems, the solution receiving a higher score will be used for problem 3.

Write a scanner for WLP4, using a generic simplified maximal munch tokenizer as a preprocessing step.

We have provided a tool, cs241.smm, which is a general simplified maximal munch tokenizer. It takes a DFA with no .INPUT segment as an argument, and a program to be tokenized with that DFA from standard input. It outputs one token lexeme per line. It is not capable of ignoring whitespace or comments on its own; in cs241.smm, you must tokenize whitespace like any other token. cs241.smm supports two extensions to the DFA file format:

The alphabet is allowed to include the special symbols .SPACE and .NEWLINE. .SPACE means any whitespace other than a newline, and .NEWLINE, of course, means newline. The distinction needs to be made because comments end with a newline.
Transitions may be on any alphabet character (including .SPACE and .NEWLINE), or the special symbol .OTHER. In a transition, .OTHER is used if no other transition is valid for the given character; i.e., it is the "else" case, used for all symbols not otherwise written in a transition for the given state. This even includes characters outside of the alphabet. This is needed because comments are allowed to contain any character.

Note that .SPACE and .NEWLINE are also included in the lexemes that are output by cs241.smm. All other characters in lexemes are output verbatim.

cs241.smm is capable of breaking its input down into a sequence of tokens, but is not capable of categorizing them into kinds. Thus, in order to use it for a scanner, you must write two components: a DFA for cs241.smm, and a program to take the lexemes output by cs241.smm, discard whitespace and comment tokens, categorize the remaining tokens, and check ranges. The description of both components follows.

First Component: `wlp4.dfa`

Write a DFA for WLP4, following the WLP4 Specification, adequate for use by cs241.smm. Given a WLP4 program and your DFA as input, cs241.smm should write the lexemes of each token, one per line, to standard output. This will also include lexemes for whitespace and comments, which must be discarded in the next step.

If the input does not tokenize by the Simplified Maximal Munch algorithm with the given DFA, cs241.smm will output ".ERROR" as a line to standard output, as well as outputting a descriptive error to standard error and quitting.

Example

If the input is this WLP4 program:

    long wain(long a, long b) {
      return a + b;
    }

Then the correct output is:

    long
    .SPACE
    wain
    (
    long
    .SPACE
    a
    ,
    .SPACE
    long
    .SPACE
    b
    )
    .SPACE
    {
    .NEWLINE
    .SPACE.SPACE
    return
    .SPACE
    a
    .SPACE
    +
    .SPACE
    b
    ;
    .NEWLINE
    }
    .NEWLINE

Note: It's OK if your tokenization of whitespace or comments differs from how it's done above; all that matters is that the final tokenization (made in conjunction with the second component) is correct.

Testing

Run your DFA with cs241.smm.

On the student Linux system:

    $ ./wlp4scan wlp4.dfa < input.wlp4

Note that your DFA will not be tested in isolation. It will only be tested with the next component.

Second Component: `wlp4scan.(cc/rkt)`

Write a scanner for WLP4 that takes one lexeme per line, from cs241.smm, as input.

The scanner should read a sequence of token lexemes from standard input (i.e., the output of cs241.smm) and produce a full tokenization, one token per line. Each line should be terminated by a line feed (ASCII 0x0A) and should contain the token kind (the all-caps name indicating the type of the token, such as ID, NUM, LPAREN, etc. as defined in the WLP4 Specification), followed by a single space, followed by the token lexeme (the actual string represented by the token, exactly as output by cs241.smm).

Whitespace and comments are not considered tokens and should be discarded.

If the input cannot be scanned into a sequence of valid tokens as per the requirements in the WLP4 Specification, your program must print an error message containing the string ERROR in all caps to standard error and exit normally. Note that this can be due to an error that was caught by the DFA (and thus a reaction to ".ERROR" from standard input), or due to an error not caught by the DFA which you instead catch in your scanner.

If a NUM token's value is too large to fit in a 64-bit integer, that's an error that you need to catch as well. A good way of doing this is to use std::stol, since that reports a std::out_of_range exception if the number being read is too large. If you're testing your code on Windows, you must use std::stoll instead, since longs are 32-bit on Windows systems.

For programs with a valid tokenization, your output should be identical to that of the cs241.wlp4scan tool.

Example

If the input to cs241.smm is this WLP4 program...

    //
    // WLP4 program with two integer parameters, a and b
    //   returns the sum of a and b
    //
      long wain(long a, long b) {
        return a + b;   // unhelpful comment about summing a and b
      }

...and the input to your program is the output of cs241.smm, then the correct output is

    LONG long
    WAIN wain
    LPAREN (
    LONG long
    ID a
    COMMA ,
    LONG long
    ID b
    RPAREN )
    LBRACE {
    RETURN return
    ID a
    PLUS +
    ID b
    SEMI ;
    RBRACE }

Testing

Compare your results to cs241.wlp4scan. For this problem, Marmoset again uses diff -Zb, so that small differences in whitespace are ignored (e.g., extra space at the end of the line).

On the student Linux system:

    $ cs241.smm wlp4.dfa < input.wlp4 | ./wlp4scan > output.txt
    $ cs241.wlp4scan < input.wlp4 > expect.txt
    $ diff -Zb output.txt expect.txt

Be careful not to confuse cs241.wlp4scan (the course-provided reference implementation of the scanner) with your own program. Perhaps name your own executable something different during testing, like mywlp4scan.