Global predictors using address information [Scott McFarling Report TN-36 1993] Global Select (Gselect): Gselect prediction scheme is an improved version of global predictor which uses both, the global history table and the current address of the branch, to index the 2nd level 2-bit saturating counters. As seen in Fig 1. Gselect predictor concatenates the last few bits of the branch address with the Global history table to index a particular 2-bit saturating counter in the 2nd level counter table
This scheme shows some improvement over global predictors as global predictors lacks the information about the branch which is being currently executed. Fig. 1- Global Select Predictors Global Share (Gshare): This is another scheme proposed by Scott McFarling which takes advantage of global history and program counter. Here more number of bits (compared to Gselect) each of PC and global history are used to reference same number of 2nd level 2-bit saturating counter. Fig. 2- Global Share Predictors Fig. shows that m bits of Program counter are X-ORed with n bits (m=n) of Global History Table to index a particular 2-bit saturating counter in the 2nd level counter table. Performance Analysis: Simple Scalar Simulator which we used for our analysis supports
Gshare algorithm to be implemented but not Gselect. As we wanted to compare the performance statistics of Gshare & Gselect, we incorporated few modifications in the code, without changing the working flow of the simulator, to implement Gselect algorithm. We implemented these changes in branch prediction module of the simulator named bpred. . The functions which modified were bpred_lookup() which is to used get predicted PC & bpred_update() which is used to update the history bits. After implementing Gselect, we tried to compare the performance of Gselect and Gshare with help of 2 integer benchmarks bzip and gzip. Fig. 3- Performance Statistics of Gshare & Gselect Predictors for gzip Integer benchmark Graphs above show the statistics of Gshare and Gselect Branch prediction scheme with different configurations. Fig. 4- Performance Statistics of Gshare & Gselect Predictors for bzip Integer benchmark
The variable parameter was global table history bits which we varied from 8 to 14 bits for Gshare thus referencing 256 to 16384 2nd level counters. Whereas for Gselect the history bits used were 4 to 7 which again in turn referenced 256 to 16384 2nd level counters. The results were very intuitive. For any number of history bits used, Gshare always outperformed Gselect. The results were very much inline to our expectations as XOR-ing n history bits and n PC bits has more information than the concatenation of n/2 bits of each. This leads to better prediction accuracies for Gshare than Gselect.
Overall Performance Comparison: To compare all the branch prediction schemes discussed so far in this paper, we kept Level 2 table size fixed for all the schemes. The size chosen was 2048 2-bit saturating counters which we found to be optimal by running various configurations of individual schemes. Fig. 5- Configurations for different Branch Prediction Schemes For the fixed size of 2nd level table we configured all other parameter as shown above in Fig. 5. We used two different configurations for GAp and PAp to get better insight and a clear overall picture.
Fig. 6- Branch Prediction accuracies for different Branch Prediction Schemes The branch target buffer was kept constant for all the schemes with 512 entries, and associatively of 4-set. Same 4 benchmarks were used for overall comparisons. The Table above in Fig. 6 shows the percentage prediction accuracies for all the above mentioned schemes and configurations. Inferences: By comparing various schemes we were able to draw out various inferences as listed below. Dynamic branch prediction schemes have higher prediction accuracies than static schemes
This can be explained in the following way. Since different data sets will let the programs have different dynamic branch behaviors, so the branch behaviors of the programs can’t be simply predicted statically quite well. Even for the same data set, since the same branch instruction will be executed several times during the execution of the program and exhibits different dynamic branch patterns for the different times in the same running, its branch behaviors can’t be simply predicted statically. That is, its branch behaviors are changed dynamically.
So, the hardware-based dynamic branch predictors always work better than the software-based static branch predictors do. Not-taken scheme is the worst among all In the programs, the branch statements are more inclined to be taken rather than not to be taken. This is peculiar behavior of majority of the programs found in this universe. So, the Taken branch prediction strategy usually works better than the Not-taken branch prediction strategies. Bimodal scheme shows the best performance among all This can be explained in the following manner.
As we are using same numbers of 2-bit counters for all the schemes, the total number of simultaneous branch tracked is most for bimodal scheme. Out of same number of total counters available with all schemes, all other than bimodal schemes employ some of the 2-bit counters to track the state corresponding to the history bits. Thus a single instruction will have more counters used in all other cases whereas in bimodal scheme only 1 counter is used per instruction thus tracking more number of simultaneous instructions and in turn improving performance.
PAg shows the worst performance among all the dynamic schemes discussed For the PAg branch predictor, the branch histories of all the branch instructions share the same global two-bit counter tables, which lead to confusion. Therefore, it can’t achieve as high performance as other branch predictor. PAp has potential of giving the best performance if there were no cost restrictions PAp has per address branch history in 1st level as well as in the 2nd level. This history can track various past records of the various branches on per address basis.
This can give us the maximum performance, but to do these we require very large number of counters in the second level counter table. Thus if there is no restriction on cost, PAp can be configured to give the best results. PAp-2 shows higher prediction accuracies than PAp-1 We can see from the configuration table that PAp-2 uses less number of history bits than PAp-1. At the same space cost, with a longer shift register, since there are too many branch history patterns compared to the number of instructions executed, each branch history pattern can’t get enough training to reach high performance.
Therefore, PAp with a shorter shift register achieves higher prediction accuracy. GAp-2 shows higher prediction accuracies than GAp-1 We can see from the configuration table that GAp-2 uses less number of history bits than GAp-1. Here, we use more 2-bit counters in tracking past history for GAP-1 whereas in GAp-2 we use less counters in tracking histories but in turn track more branches simultaneously. This shows that using 2bit counters for tracking more branches is beneficial than using counters for various states of past histories.
Increase in history bits after a certain limit will decrease the performance of the predictor More number of history bits implies more number of history patterns. As the history pattern increases, each branch pattern can’t get enough training. This causes increase in the number of initial mispredictions. Thus increase in history bit is good till one level after which performance deteriorates.