Z8086: Rebuilding the 8086 from Original Microcode
Posted by nand2mario 3 days ago
Comments
Comment by tasty_freeze 2 days ago
Comment by ErroneousBosh 2 days ago
Want a different architecture? Sure, just draw it with a different ROM. Simple (if you've got IBM money to throw around).
Comment by tasty_freeze 2 days ago
The book also had a glossary section in the back and a number of the entries were funny. One I recall was his definition for "methodology", which was something like "A word people use when 99% of the time they mean 'method'."
Comment by slartibardfast0 2 days ago
Comment by ErroneousBosh 2 days ago
Comment by adrian_b 2 days ago
Comment by ErroneousBosh 2 days ago
Comment by jecel 2 days ago
Comment by tom_ 2 days ago
Comment by adrian_b 2 days ago
They assume that instructions have been fetched concurrently without ever causing a stall and that memory accesses are implemented with 0 wait states.
In reality, instruction fetching was frequently a bottleneck and implementing a memory with 0 wait states for 80286 was much more difficult than for MC68000 or MC68010.
With the available DRAM, normally both 80286 and 80386 would have needed a cache memory. Later, after the launch of 80386DX, cache memories became common on 386DX MBs, but I have not seen any 80286 motherboard with cache memory.
They might have existed at an earlier time when 286 was the highest end, but by the time of the coexistence with 386 the 286 became the cheap option, so its motherboards never had cache memory, thus the memory accesses always had wait states, increasing the probability of instruction fetch bottlenecks and resulting in significantly more clock cycles per instruction than in the datasheet.
Comment by LargoLasskhyfv 1 day ago
Not true. I vaguely remember servicing systems with chipsets from OPTI(only 2 large ones) having it. IIRC those were funtional(not exact) clones of Chips&Technologies NEAT(4 to 5 large chips), later shrunken to one by SCAT (Single Chip AT).
Also in times when the 386 ran at 33Mhz, or even at 40 if made by AMD, Compaq introduced 386SX systems with cache, and I remember wondering "why, oh why?". Talk about overengineering...
Comment by raphlinus 2 days ago
Comment by adrian_b 2 days ago
MC68000 and MC68010 had essentially the same addressing modes with 80286, i.e. indexed addressing with up to 3 components (base register + index register + displacement).
The difference is that the addressing modes of MC68000 could be used in a very regular way. All 8 address registers were equivalent, all 8 data registers were equivalent.
In order to reduce the opcode size, 80286 and 8086 permitted only certain combinations of registers in the addressing modes and they did not allow auto-increment and auto-decrement modes, except in special instructions with dedicated registers (PUSH, POP, MOVS, CMPS, STOS, LODS), resulting in an instruction set where no 2 registers are alike and increasing the cognitive burden of the programmer.
80386 not only added extra addressing modes taken from DEC VAX (i.e. scaled indexed addressing) but it made the addressing modes much more regular than those of 8086/80286, even if it has preserved the restriction of auto-incremented auto-decremented modes to a small set of special instructions.
Comment by retrac 2 days ago
Comment by jecel 2 days ago
Comment by CodeWriter23 2 days ago
Comment by tasty_freeze 2 days ago
LDDR was the same but decremented HL and DE on each iteration instead.
There were versions for doing IN and OUT as well, and there was an instruction for finding a given byte value in a string, but I never used those so I don't recall the details.
Comment by CodeWriter23 1 day ago
I was referring to LODSB/W (x86) which is quite useful for processing arrays.
Comment by rasz 2 days ago
https://retrocomputing.stackexchange.com/questions/4744/how-...
Repeat is done by decrementing PC by 2 and re-loading whole instruction in a loop. 21 cycles per byte copied :o
To be fair Intel did same fail implementation of REP MOVSB/MOVSW in 8088/8086 reloading whole instruction per iteration, REP MOVSW is ~14 cycles/byte 8088 (9+27/rep) and ~9 cycles/byte 8086 (9+17/rep), ~same cost as non REP versions (28 and 18). NEC V20/V30 improved by almost 2x to 8 cycles/byte V20 or unaligned V30 (11+16/rep) and 4 cycles/byte on fully aligned access V30 (11+8/rep) with non REP cost being 19 and 11 respectively. V30 pretty much matched Intel 80186 4 cycles/byte (8+8/rep, 9 non rep). 286 was another jump to 2 cycles/byte (5+4/rep). 386 same speed, 486 much slower for small rep counts, under a cycle for big rep movsd. Pentium up to 0.31 cycles per byte, MMX 0.27 cycle/byte (http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DU...), then 2009 AVX doing block moves at full L2 cache speed and so on.
In 6502 corner there was nothing until 1986 WDC W65C816 Move Memory Negative (MVN), Move Memory Positive (MVP) 7 cycles/byte. Slower than unrolled code, 2x slower than unrolled code using 0 page. Similar bad implementation (no loop buffer) re-fetching whole instruction every iteration.
1987 NEC TurboGrafx-16/PC Engine 6502 clone by HudsonSoft HuC6280 Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) theoretical 6 cycles/byte (17+6rep). I saw one post long time ago claiming block transfer throughput of ~160KB/s on a 7.16 MHz NEC manufactured TurboGrafx-16 (hilarious 43 cycles/byte) so dont know what to think of it considering NEC V20 inside OG 4.77MHz IBM XT does >300KB/s.
CPU / Instruction Cycles per Byte
Z80 LDIR 8-bit 21
8088 MOVSW 8bit ~14
6502 LDA/STA 8bit ~14
8086 MOVSW ~9
NEC V20 MOVBKW 8bit ~8
W65C816 MVN/MVP 8bit ~7 block move
HuC6280 T[DIAX]/TIN 8bit ~6 block transfer instructions
80186 MOVSW 16bit ~4
NEC V30 MOVSW ~4
80286 MOVSW ~2
486 MOVSD <1
Pentium MOVSD ~0.31
Pentium MMX MOVSD ~0.27 http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DURT1.HTMComment by rep_lodsb 2 days ago
CPU Cycles per theoretical minimum per byte for block move
Z80 instruction fetch 4 byte
Z80 data read/write 3 byte 6
80(1)88, V20 4 byte 8
80(1)86, V30 4 byte/word 4
80286, 80386 SX 2 byte/word 1
80386 DX 2 byte/word/dword 0.5
LDIR (etc.) are 2 bytes long, so that's 8 extra clocks per iteration. Updating the address and count registers also had some overhead.The microcode loop used by the 8086/8088 also had overhead, this was improved in the following generations. Then it became somewhat neglected since compilers / runtime libraries preferred to use sequences of vector instructions instead.
And with modern processors there are a lot of complications due to cache lines and paging, so there's always some unavoidable overhead at the start to align everything properly, even if then the transfer rate is close to optimal.
Comment by adrian_b 2 days ago
Moreover, the cache memories used with 286/386SX/386DX were normally write-through, which means that they shortened only the read cycles, not also the write cycles. Such caches were very effective to diminish the impact on performance of instruction fetching, but they brought little or no improvement to block transfers. The caches were also very small, so any sizable block transfer would flush the entire cache, then all transfers would be done at DRAM speed.
Comment by rasz 1 day ago
"12MHz/0 wait state with 100ns DRAM."
another https://theretroweb.com/chip/documentation/neat-6210302843ed...
"The processor can operate at 16MHz with 0.5-0.7 wait state memory accesses, using 100 nsec DRAMs. This is possible through the Page Interleaved memory scheme."