Vitaly Parnas - Computer architecture bottom-up, improvised

What series of components and abstractions comprise the functionality of a basic computer? The question can easily resolve itself by means of a refresher in a digital logic or architecture textbook, or better yet, the NAND to Tetris self-educating course I would myself be inclined to explore one day in effort to reacquire a lot of the fascinating in-depth insight that escapes my memory.

Notwithstanding, I would expect any computer user beyond, let’s say, casual, to possess at least an approximate high-level comprehension to the respect. I would consider this an intriguing (interview, or otherwise) question even: describe how a computer works from first principles, to the best of your ability. Could I handle such a question at this given moment? I’ll make an attempt and give my word of honor in not having prepared in advance beyond the university architectural course from ~15 years ago and an occasional independent effort to refresh certain knowledge, the last of which dates to three years ago. Feel free, however, to regard the following as a product of fiction.

We’ll start with a transistor as the simplest analog unit of memory. Actually, the transistor houses a chemical substrate that facilitates particular electron flow depending on the charge to the terminals. I believe several different transistor models exist, including the only one I can recall - FET (Field Effects Transistor). I’ll embarrass myself in any further electrical engineering oriented deliberation attempt.

We can construct logic gates by means of a handful of transistors and other analog components I conveniently won’t mention (because I don’t know or remember). The logic gates include AND, OR, XOR, NAND, and NOT gates. Some of these gates can comprise a conjunction of other simpler gates, but I’m inclined to say that a spartan gate construction is vital to minimizing the circuit critical path (propagation delay), cost, energy consumption, and power dissipation.

With these gates we can construct some (state-oblivious) combinational circuits: multiplexors, demultiplexers, encoders, decoders, half adders, full adders, carry-look-ahead adders, and others composed of the aforementioned simpler components. Most importantly, we can construct the ALU (Arithmetic Logic Unit), capable of carrying out varying arithmetic operations in one circuit.

How are multipliers and dividers constructed in a way that’s obviously efficient? [This is where I sweat.] I recall particular divide and conquer multiplier circuit constructions that scale more efficiently [logarithmically?] with respect to input. We clearly want to abstain from repetitive addition or the grade-school multiplication table paradigms. With respect to input length, multipliers and ALUs in general are designed to handle X-length inputs. Anything less and the units pad the input with zeros. Anything greater, and we can stack multiple units, or split up the input and reuse the circuit. [some long abandoned neurons begin to (perhaps deceptively) shiver in my brain.]

Subtraction is really negative addition in disguise. I should mention that all of our arithmetic takes place in binary format, and 2’s and 1’s compliment arithmetic are the two different forms of handling overflow (I believe).

I’m clueless in regards to how to handle division. Floating point units bear responsibility for more complex arithmetic, but I don’t recall having studied them in detail.

Using gates and combinational circuits, we can construct sequential (state-aware) circuits. These involve simple memory organisms such as latches and flip-flops, both designed to reliably and continuously assert a 1 or 0 digital output depending on the combination of inputs. I don’t recall the difference between the two components, but I believe it concerns the timing of when the unit asserts the input signal. All sequential circuits depend on a clock signal as another input, as this enables the ‘sequencing’ to actually take place. How do we construct an accurate clock? I suspect it involves chemical compounds and complexity beyond what I regretfully know to explain.

Continuing, other sequential circuits involve counters of varying complexity (up/down, X digits), composed of mostly latches or flip flops and gates. We can construct registers - more complex memory units consisting, again, of a caravan of latches/flip-flops and gates. Consequently, we can design complex RAM/ROM modules by combining these latches/flip-flops, multiplexors/demultiplexers in a visually intriguing grid pattern. I think the above desperately attempts to describe an SRAM module, the more expensive due to a base in digital memory units. The cheaper, greater-scale DRAM module relies on capacitors to store charge, along with some other circuitry to periodically replenish the gradually diminishing charge.

What now? Ultimately, a datapath defines a particular CPU architecture. Fundamental machine-level instructions include operations such as load/store from/to memory to/from a specified register, arithmetic operations (on values stored in registers and results written to a register), boolean operations that modify a register appropriately based on input, conditional operations that function similarly - based on value of one register, set another.

Most immediate operations take place on the register level, for an astronomical increase in speed in comparison to suffering the delay in operating in memory. A particular architecture provides a certain number of registers and defines a format for these types of instructions. RISC and CISC come to mind as the two architectural categories defining how we format instructions. From what I recall, most architectures beyond Intel are RISC based.

A PC (program counter) is a special register that holds the memory address of the currently executing instruction. Certain machine level operations, such as jump or return modify the PC. In addition, a special register holds the return address, necessary for returning from a function to the immediate origin. To return to arithmetic operations, another that I omitted is the shift operation, serving as means to perform efficient power-of-two multiplication and division by respectively shifting a bit.

As I contemplated how one would execute a shift operation, I recalled a series of sequential circuits called parallel/serial load/store shift registers, which beyond the register load and store operations, provide a shift operation to precisely shift the contained bits in the indicated direction, implemented similarly by a series of latches or flip flops and gates, in addition to the clock port to trigger the relevant operations.

As a slight detour, I would also highlight the sensitive nature of timing in sequential circuits. A misaligned clock signal to different circuits can result in a quiet undesirable clock skew effect.

Anyway, how does a computer work? Architecture-specific encoded machine instructions load into memory, never mind yet how. The datapath stores the instruction operands in specific registers. Some instruction address decoding also takes place by means of further arithmetic operations. Ultimately, the sequential circuits we constructed so far carry out a lot of the work - arithmetic, shifting, storing/loading results in registers, manipulating the program counter for dictating which next instruction to load from memory, a multiplexer/demultiplexer to store a value in memory at an indicated, and particularly encoded address, etc.

So far I managed to omit pipelining, a mechanism to leverage the parallel execution of multiple instructions by means of splitting different types of work across the available circuits. Loading, storing, arithmetic, floating-point operations, are all different types of tasks facilitating independent processing by the respective circuits, as means to exploit more bandwidth. Pipelining, the pinnacle of Instruction Level Parallelism, reflects what we already rely on in the laundry and kitchen. In the laundry, for example, as I load the dryer, I am free to leverage the washer with another load. In essence, pipelining breaks up instructions into smaller chunks, but to insure the intermediary results properly propagate from one station to the next, we incorporate more registers into the data path precisely for storing these intermediary results. The resulting circuit grows in effective processing speed at the expense of increased cost and complexity. Simpler processors tend to feature a smaller number of pipeline stages, as little as two. More complex, workhorse processors feature as many as ten layers, if I had to guess.

That’s more or less what takes place once we load the instructions into memory and initialize the program counter. (I omit certain trivialities such as caches, Virtual Memory, disk/network I/O operations, and interrupts.) However, how do those instructions find themselves in memory to begin with? The OS, itself consisting of a myriad of instructions, upon loading a program from disk (or other secondary storage), loads the respective decoded set of instructions into memory, using the functionality in our datapath. And how did the OS end up in memory? In chunks. Upon boot-up, a small ROM program loads a smaller, kernelized portion of the OS into memory, which, in turn, determines what further components to load.

Compilers, by means of a series of clever abstractions (deserving a separate publication that I will not write in light of remaining compiler oblivious) transform high-level code to architecture-specific assembly code to be loaded into memory for execution.

That concludes what I can conjure in one continuous paper-notebook session, short of having taken a restroom break and eaten a bowl of granola. Naturally, an oral introspection to the respect would have to severely compress this prose. I didn’t necessarily recall every detail verbatim, but gradually derived or fabricated the components that I deemed logically fitting with each layer. The ability to abstract is crucial in conceptualizing complex systems.

Questions, comments? Connect.