Saturday, January 28, 2006

Processor Pipeline

This is probably one of the most confusing parts of the processor to understand. The processor pipeline is like a conveyor belt. Now, imagine that instructions for the processor to carry out is food. The way a pipeline works is like this: the food moves along the conveyor belt, and when it gets to the other end of the conveyor belt that instruction is done. That is the simple way, but that is not technically correct. The way it works is this, (can’t think of any other way to explain it here) an instruction is fetched from the cache, then it continues to a different part of the pipeline, and so on. There are lots of different pipeline sizes, it all depends on what processor you have. This will explain how AMD manages to perform against Intel, even though Intel has a much higher clock speed than AMD.

The different parts of the pipeline perform different jobs. Some parts of the pipeline are duplicated, and adds to the length of the pipeline. Also, there can be more than one pipeline, which is why modern processors are said to have a super scalar architecture. The reason parts of the pipeline are duplicated is so less work has to be done at each stage. This means more instructions are completed in the same amount of time, speeding up performance. This is one of the key reasons AMD is able to contend with Intel. AMD Athlon XP’s have 3 X86 decoders, 3 floating-point pipelines, and 3 integer pipelines. This is compared with Intel’s Pentium 4, which has only one X86 decoder, 2 floating-point pipelines, and 1 more integer pipeline than AMD’s Athlon. This leads to AMD being able to decode more instructions than Intel at the same time, and being able to perform floating-point operations quicker than Intel. Overall, AMD Athlon XP processors are able to perform 9 operations per clock cycle while Intel can only manage 6. It doesn’t sound like much, but in processors every operation is crucial. This is why I said AMD are more about getting more done per clock cycle in my AMD processor buying guide.

This doesn’t mean it’s all over for Intel though. Even though AMD manages to perform more operations than Intel in one clock cycle, Intel manages to do their operations quicker. This is because of their pipeline architecture. AMD’s pipeline is only 10 stages long. This means that because the stages in the pipeline have to do more work, they can’t run very fast. Now, with Intel, their processors have a 20 stage pipeline (Prescott core processors have 31 stages). This means that the processor can run at a higher clock speed, because less work is done in each stage of the pipeline. The reason the Prescott core has been released is because this brings yet more speed. Because there are 31 stages, even less work is done, which means even higher clock speeds.

This can be linked to CISC computing and RISC computing. CISC stands for complex instruction set computing, and RISC stands for reduced instruction set computing. RISC means having less complex instructions for the processor. Here is an example of CISC giving someone commands. With CISC it would look like this:
1. Get food
2. Get fork
3. Eat

But with RISC it would be like this:
1. Go to kitchen
2. open fridge
3. get food
4. close fridge
5. open drawer
6. get fork
7. close drawer
8. open mouth
9. put food in mouth
10. close mouth
11. chew
12. swallow

The reason the CISC is more complex is because the processor has a lot more to do in one instruction. RISC is more efficient because it is very simple instructions, meaning less "thinking" is needed to perform the instruction, resulting in faster speed. Also, using RISC means there is less transistors needed in a processor, reducing cost. This is why all modern processors are RISC processors. But there is something you should know about X86 (the way the instructions are coded). X86 is actually built using a CISC architecture. This is why the processors need X86 decoders, to convert the CISC instructions into RISC instructions.
Now, even though Intel has the advantage of faster clock speeds, this doesn’t work well unless you have a lot of something. Cache. Cache is where all the instructions the processor is going to work on resides. All the cache’s data comes from main memory (RAM). When a processor is going to work on an instruction, it checks the cache to see if it is there. Cache runs at the core speed of the processor, so if you have an Intel 2.6GHz processor, the cache will also be running at 2.6GHz. Now compare this to main memory. Main memory, at the most without over clocking, runs at 400MHz. If a processor can’t find an instruction in the cache, then it has to slow right down to match the speed of the main memory, until it can get the next instruction from it. This makes a massive drop in performance, and is the reason the Celeron processors are such poor performers. With less cache, there are fewer instructions ready for the processor, increasing the chance that the processor will have to slow down to main memory speed. The reason the Intel processors are more susceptible to slowing down is because of a technique processors use to decide which instruction to work on next.

This is what is called pipeline optimization: The processor will always try to keep the pipeline full. To do this, it has to use techniques to guess what will come next. There are 3 different ways of optimizing the pipeline. These ways are:

Speculative execution - This is where the CPU has an instruction, and the next instruction cannot take place unless the CPU knows the answer to the first instruction. The CPU has to work out the answer to the first instruction, but say there is 2 instruction answers, and only one is correct. Without speculative execution, the CPU would send one of the possible answers to the instruction down the pipeline, which in an Intel CPU would take 20 clock cycles to complete. Now, if the CPU chooses the correct instruction answer, then everything is fine, the CPU can go right onto the next one. But what if it is the wrong one? The CPU has to send the other instruction answer down the pipeline, which would mean 20 clock cycles were wasted with the first instruction! So what speculative execution does is send both possible instruction answers down the pipeline, so the CPU processes both. The CPU processes both, then discards the incorrect one. That means a lot less time was wasted.

Branch prediction - This is a tough one, and can mean running at full speed without having to slow down, or having to completely start again from the start of the instruction set for a processor. This builds onto speculative execution. Remember when I said the processor will always want a full pipeline? Well, just because there is 2 possible answers doesn’t mean there is any exception. What will happen is this: The processor will see the 2 possible answers, and will make an educated guess which one is correct from the branch target buffer before the CPU executes both instructions. So, both instructions are not executed anymore, only one of them is, which is the one the CPU predicts will be correct. The branch target buffer is a bank of all the answers that turned out to be right from other instructions, and from looking at this bank the CPU can take a guess which is the correct answer from what it has already done. When the CPU looks in this bank, it allows a good prediction because it has all the results from other instructions. So after sending the instruction that the CPU has guessed correct, the instructions that would come after this prediction are also sent down the pipeline. If the CPU was right with the branch prediction, then there will be a lot of time saved. If not, the whole pipeline has to be flushed and restarted, because it all counted on the first instruction being guessed correctly. This is why the Pentium 4 needs more intelligent branch prediction technology, with a long pipeline it takes a long time for a new set of instructions to reach the end of the pipeline.

Out of order execution - This is where the second instruction cannot be performed, because the CPU has to know the answer to the first instruction before the CPU can know what the answer is to the second one. Without out of order execution, the CPU would execute the first instruction, and leave the rest of the pipeline empty. This would be a massive waste of resources. So what happens is this, the CPU will execute the first instruction, then execute other instructions that have no dependency on the first instruction. So with this, the CPU can work on other instructions while it is waiting for the first one.

No comments: