Tuesday, January 22, 2013

Accumulator Machine & Self-modification

Multiprocessor system based on register-less architecture can be illustrated in the image below as if tasks run themselves while going round and chasing each other through processor elements.

smp_image.png

Inner state information of each task should be as simple and light as possible to move between processors, requiring employment of architecture minimizing register.  The most typical of such architecture is accumulator machine.  For example, accumulator machine executes such a form of instruction as shown below.

ADD memory_address

The actual processing this instruction does is to add the value of memory at location memory_address to the value of accumulator Acc, the only general-purpose register of the processor.

Acc <= Acc + (memory value at memory_address)

While being quite useful for downsizing circuit, accumulator machine is not suitable for speeding up because of increasing memory access.  To do “ADD memory_address,” for example, accumulator machine requires at least twice access to memory: once to read the ADD instruction and memory_address and the second time to read the value of memory at the memory_address.

Accumulator machine has been almost abandoned as being out of date against current development in processors to minimize memory access.  However, except this weak point, accumulator machine has preferable properties to be an element of small-scale multiprocessor.  Therefore, I am introducing a technique of equipping small-scale multiprocessor with improved accumulator machine.

If processor has register other than accumulator, reading register can replace once of memory access by specifying register number instead of memory address:

ADD register_number  {Acc <= Acc + (register value)}

Even so, reading register value is possible only after having register number fixed and therefore impossible at the same time of reading instruction.  Actually, an instruction is executed through at least four stages:

  1. Read instruction and register number
  2. Read register
  3. Do addition
  4. Write result

Most of current processors achieve high performance by simultaneously running these stages through pipelining.  In reality, however, this technique requires various kinds of efforts to perform well, causing current processors to be complicated.

After all, this kind of problem is inevitable, even by using register, unless parameters necessary to run instruction (in case of addition, both left-hand side and right-hand side) are fixed at the same time that the instruction is retrieved.  The only solution is to limit instruction parameters to fixed values (immediate/literal).  In ADD instruction, it goes like this:

ADD immediate  {Acc <= Acc + (immediate value)}

Processor needs to access memory only once, if it can read instruction and parameters at the same time.  Then it goes to addition instantly, and basically has no need for complicated pipeline processing.  Of course, programs only with fixed parameters make processor nothing but useless, however, this is solvable by self-modifying code.  That is, to be more specific, PUT instruction to replace parameter of instruction at the address indicated by parameter of the PUT instruction with current value of accumulator:

PUT instruction_address  {[instruction_address] <= Acc}

As the parameter of PUT instruction itself is rewritable by another PUT instruction, array and other indirect addressing are also available.  PUT instruction makes parameters of every instruction available as general-purpose registers in fact.

Self-modification of codes is a rarely-used technique in current processors except special cases.  Not only such issues like security and memory protection, but the biggest negative reason seems to be that code’s self-modifying process is quite inefficient in cache memory and instruction pipeline.  Register-less architecture spotlights and employs aggressively this out-of-the-mainstream technique of self-modification.

That is, register-less architecture is what revives non-mainstream techniques of accumulator machine and self-modifying as elements of small-scale multiprocessor.

To be continued…

Friday, January 18, 2013

Multiprocessor & Memory Access

In recent years, multi-core processors are loaded so commonly in notebook PCs and even mobile devices.  And now, even smartphones with quad-core are circulating in the world.  Then, why not add more cores? — But it doesn’t go that easily.

While there is an argument called Amdahl’s law due to the fact that programs in the world cannot always be parallelized, even if they could be, the lack of memory access throughput inevitably holds down processing performance (Von Neumann bottleneck).  In computing, a CPU needs the process of reading a program and data from memory and then writing results to memory.  No matter how fast CPU alone operates, it cannot do much as long as it cannot reach memory at the same speed.

So, how are current multiprocessors addressing this problem?

cache_memory_en.png

Most multiprocessors in the world are basically configured to have distributed cache memory as shown in the figure above.  In this configuration, every processor copies necessary parts of programs and data into cache memory to use, drastically reducing the need for access to shared memory.  Usually, cache memory and shared bus run much faster than external shared memory, a multi-core processor enhances its performance along with the increase of its internal processor cores.

Since memory access directly determines the performance of processor, cache memory is the component consuming the largest area and most electric power in processor.  Cache memory has the original role of covering the difference in the speed of external memory and the inside of processor, but has become indispensable to simultaneous access to memory in multi-core processor due to cache coherency.  But in processor with chip-embedded memory, cache memory is of course waste and useless.

Then, how can a small-scale multiprocessor with embedded memory be composed?  — Firstly, by force.

shared_memory_en.png

The use of multiport memory theoretically enables direct connection of shared memory and processors.  However, it reduces area efficiency (necessary area is said to be proportional to the square of port number) and is unavailable in a FPGA, which usually has up to two ports.

Another choice is to assign local memory to each processor and have communication bus, instead of giving up shared memory.  This format facilitates designing hardware and so is relatively common — For example, the Cell processor of PS3 has a similar structure.

local_memory_en.png

Despite of easy designing hardware, designing software is all the harder.  Particularly in small-scale processors, it is out of question to divide already scarce main memory to each processor.  The system could be also designed to allow a processor to access local memory of other processors somehow, which, however, seems to be inefficient for its complexity.

Another structure is possible by dividing shared memory not per processor but to memory banks corresponding to address areas.

crossbar_switch_en.png

Crossbar switch enables simultaneous access as long as each of processors aims at a different memory bank.  But when some processors try to access the same memory bank, all except one are blocked from access to the bank.  The problem is that the increasing number of processors rapidly complicates crossbar mechanism.  There has been lots of research on how to simplify the mechanism since early times. (Like this study, for example)

It brings interesting results if you divide memory into banks by the remainder of each address divided by the number of processors.  In the figure attached above, for example, memory banks numbered as 0 through 3 mean the memory areas tagged with the remainders from 0 through 3 of addresses divided by 4.  At one point in time, if four processors have execution addresses 100, 201, 302 and 403, respectively, the remainders of these addresses divided by 4 are 0, 1, 2 and 3, thus allowing processors to fetch operation code from respective corresponding memory banks simultaneously.  At a next point, execution addresses become 101, 202, 303 and 404, and the remainders are 1, 2, 3 and 0, again enabling simultaneous access to respective memory banks.  Likewise, four processors sustain the situation of simultaneous accessibility unless any of them executes a jump instruction.

If a group of processors can be shifted in this situation, there would be no need for crossbar mechanism at all.  That said, it is impossible to physically rotate processors within an IC, and therefore, the actual method is to transfer entire internal state of every processor by a chain of bus connections.

cyclic_smp_en.png

Since processor’s internal state should naturally include values of all registers, the values as many as the registers a processor has must be sent out to next processor.  That means the less registers a processor has, the more efficient implementation could be.

This point finds out the feasibility of small-scale multiprocessor based on register-less architecture.

To be continued…

Tuesday, January 15, 2013

What is “Register-less Architecture”?

“Register-less Architecture” is literally an architecture of a microprocessor omitting register.  That said, a processor with no register cannot do anything.  Therefore, the processor in this case has minimal register and can also use its main memory as general-purpose register.

Several classic processors prior to RISC substituted part of main memory for register.
A typical example was the 8-bit CPU 6502, used in Apple II and Commodore machines.  Instead of little internal register, the CPU 6502 had 256byte area from address 0 to 255 of its main memory available as a zero page going for virtual general-purpose register.  This kind of designing was possible at that time because of a small difference in the speed of the internal register and the external main memory with a CPU no faster than 1 to 2 MHz in the clock speed.

Current high-performance processors, operating by GHz unit internally, depend on lots of register and cache memory.  However, the circumstances are different when we use a FPGA as a microcontroller accommodating main memory and a processor.  This configuration has no need of external memory, but has to allocate the precious internal memory resource to register file and main memory.  Therefore, the more register turns to be the less main memory.  As the register and the main memory use the same internal memory after all, there is little effect in reducing access to main memory by using register.

In this context, the primary feature of the “register-less architecture” is that the entire memory embedded in a FPGA is available as main memory.

Another significant feature of the architecture is multi-core availability.
Multi-core is a form of multiprocessor, and simply put, an IC packed with several core processors.  In earlier times, ICs were in such a low degree of integration that processors were usually packaged separately.  While processors communicate signals in many lines at high speed, designing mainboard PCB was all the more complicated and in need of advanced techniques.  The increasing packaging density of processors enabled the implementation of multi-core integration, which is so compact and eases designing PCB, and eventually most multiprocessors in current use are multi-core processors.

Designing mainboard may sound like mission impossible to hobby users, once multiprocessors emit high-speed signals externally on PCB, but it is achievable just by struggling with EDA tools, as long as it is all about inside FPGA.  Once packing multi-core processor and main memory into a FPGA, all you need is a hardware description language for complete design.  Therefore, this would be almost the only way for hobby users to design a system with multi-core so far.  However, this method requires a multi-core processor with (commonly incredible) compact circuit so as to fit in a FPGA unit.  Register-less architecture also serves as a design methodology for it.

To be continued…