Dorper Website of Paul Carver Harrison

Microcoded Stack Machine

For fun and edification, I created a 32-bit microcoded stack machine in Verilog. It features a compact instructions, typically around a single byte, as well instructions that operate on immediates, minimizing the use of "push" instructions.

ALU

The ALU has 16 operations including signed and unsigned comparisons, sign extensions, logic operations, arithmetic operations, single cycle multiply, and LSL/ASR barrel shifter.

Assembler

I used the customasm assembler to create a custom assembler for my ISA.

Microcode

Microassembler

The instruction set is completely microcoded using microcode assembled using my own custom microassembler that supports custom fields, mapping ROM generation, enums, and subroutines. It is superior to the offerings provided by AMD for their Am2900 line. It is written completely in Python. Once ready for release, the microassembler will be posted here. If you want a preliminary copy, send me an email.

Microcode Sequencer

I have create a microcode sequencer written in Verilog. It is loosely inspired by the AMD Am2910 Microcode Sequencer but removes redundant features and moves some hardware out of the sequencer, for example, the mapping ROM / immediate source selection is done outside the microcode sequencer.

It contains just two registers: PC (Program Counter) and RA (Return Address). The microcode sequencer supports up to 512 words of microcode. The word length I have decided on for my CPU is 36 bits as GOWIN FPGAs implement BSRAMS as multiples of 9 bits, presumably for parity.

The microcode sequencer supports 4 operations:

Op PC' RA'
STEP PC+1 RA
JUMP A RA
CALL A PC+1
RETN RA 0

Clearing the return address allows for easy nesting of subroutines using CALL. the rot instruction calls swap twice. swap does not JUMP back to fetch @ 000h at the end of the instruction but rather RETNs back to rot if it was called from that function. If it wasn't, the value of RA will be 0 and the fetch will execute.

Conditionals

If the halt signal is high, then the PC and RA registers do not change. But if halt is low and the condition ~cce | (cci ^ cc) is met, then PC and RA are updated to their next values as shown in the table above. cce is the condition code enable microcode field. When is 0, the PC and PA registers are updated if the condition is met, otherwise, a STEP operation is performed. cci is the conditional code invert microcode field. When set, cc is inverted using XOR. cc is the condition code. It is connected to the o_zero output of the ALU. o_zero is high when all the bits of the output of the ALU operation are zero.

Compact Encoding

A byte instruction has it's top 5 MSBs sent through a mapping ROM that then is used to determine what instruction to execute. Given there are 16 ALU operations, there are four mapping entries that correspond to each of the 16 ALU operations. The mapper allows for compact use of microcode without many duplicate entries, while at the same time allowing for many different instructions to be microcoded. There are in fact, 3 different types of ALU microsubroutines (these implement ISA instructions): alu1 (only operates on TOS), alu2 (operates on TOS and NOS), and alui (operates on TOS and an immediate in the next byte). In total, these make up 12/64 entries, 108/576 bits, 18.75% of mapping ROM, only require 10 words (360 bits) of microcode to implement. However, these three microsubroutines provide 48 powerful instructions.

Physical Registers

The CPU has 7 physical registers and one pseudo-register IMM that allows for the microcode immediate field to be used for the value of rs1. The end programmer does not have to worry about these; they are microarchitectural.

Address Registers

There are four 24-bit address registers. They are 24-bit because the Tang Nano 20K has only 8 MiB of SDRAM. Therefore, it is not necessary to have more than 16 MiB of address space. Having the address registers as 24 bits wide also saves a lot on LUT and FF usage. Address registers can only be used for the value of rs1 and rd. No use could be found for performing ALU operations on two address registers. Any of these registers can be used for the rma (Read Memory Address) field, that determines what address in memory will be read/written to as well as can be used to select the ALU operation (used in alu1, alu2, and alui).

PCP
Program Counter Pointer
DSP
Data Stack Pointer
RSP
Return Stack Pointer
TMP
TeMporary Pointer

Data Registers

There are three 32-bit data registers. They can be used for rd, rs1, and rs2. A third pseudo-register IMM selects the value from the microcode imm field.

MDR
Memory Data Register
All memory reads and writes use this register.
TOS
Top Of Stack
Single entry stack cache to allow for fewer clock cycles and memory accesses.
ACC
ACCumulator
General purpose.

Memory

The CPU includes a 16KiB SRAM at 004000h where both the the return stack (grows down) and the data stack (grows up) reside, as well as a 16KiB ROM at address 000000h. Memory accesses currently take two cycles but I hope to get that down to one cycle.

Programs

Fibonacci

A function that finds the n-th Fibonacci number was written in just 13 bytes, utilizing the powerful dtor and repeati instructions that can be used to implement a for (int i = n; i >= 0; i--) loop in just 3 bytes. 13 bytes is much shorter than the equivalent program in ARM or x86.

fib:  dup
      retz
      dtor
      push0
      pushdb 1
fi:   over
      add
      swap
      repeati fi
      drop
      ret

Yes, that's the entire program. Compare to this huge (in comparison) x86 program. Note that I have actually optimized the code compared to the original version I found online:

_fib:
    pushl %ebp
    movl  %esp, %ebp
    pushl %ebx
    pushl %ecx
    # Saved Registers
 
    movl  8(%ebp), %ecx
    xorl  %ebx, %ebx
    movl  $1, %eax
    jmp   fib2
fib1:
    add  %ebx, %eax             # last = last + secondlast
    neg  %ebx
    add  %eax, %ebx             # secondlast = –secondlast + last
    dec  %ecx                   # n = n – 1
fib2:
    or   %ecx, %ecx
    jne  fib1                   # if n != 0 goto fib1
 
    # Restore registers and return
    popl %ecx
    popl %ebx
    popl %ebp 
    ret

That x86 program is 32 bytes!

Peripherals

I have a color VGA character generator I designed that I hope to implement into this project.

Simulation

I have a testbench simulator that allows debugging of microcode.

Synthesizability

My CPU is synthesizable targeting a GOWIN GW2AR-18 FPGA, utilizing under 1,200 LUTs.