 |
|
| |
|
|
| |
Performance tuning and scaling on Itanium 2-based servers
( 01 May 2008 )
By Mark Whitener, Software Engineer, Unisys
|
As data center density increases, optimizing data center resources has become a concern for most IT professionals. To address this challenge, the Itaniums Solutions Alliance recently published a white paper about Performing Tuning and Scaling on Itanium 2-based Servers. The paper discusses tuning methodology across system, application and microarchitecture levels, and best practices specific to the Intel Itanium 2 processor.
Tune from the top-down
Performance tuning is essential in getting the most from applications running on Itanium 2-based systems. These powerful processors are more dependent than average on optimal code to function at their peak capabilities. There are three levels of performance tuning – system, application, microarchitecture. System-level tuning requires the identification of system bottlenecks, such as processors, disks and network configuration. These bottlenecks when addressed give the highest performance gain. The middle application-level tuning addresses issues such as locks, heap contention, threading and API calls that either require modifying application semantics, redesigning or re-implementing application components. Bottom microarchitecture tuning is especially relevant for applications, which use 100 percent of the CPU. On Itanium 2-based systems, using a code-optimizing compiler is highly recommended.
Workload and tools
A good workload is measurable, repeatable, static and representative. Measurable because the workload and any performance gains from run to run can expressed quantitatively; repeatable because the workload is reasonably consistent from one run to the next; static because the ratio of transactions to workload does not change significantly as the workload increases; and representative because the workload matches how an application will be used. The baseline performance measurement should be an average of multiple identical runs.
Stochastic workloads are flexible and lean. They use a probability function to generate customer arrivals and other elements of the workload on the fly. A trace-driven workload replays requests from a real log file or trace. It is more realistic but less flexible, and uses more disk space.
Configured to a separate disk, Windows performance monitor (Perfmon) is a crucial tool for system-level tuning. Complete Perfmon objects that should be logged include system, memory, processor, process, network interface, physical disk, thread and SQL server.
Another tool, the Vtune performance analyzer collects, organizes and displays system data in multiple views and suggests performance-enhancing measures. For self-instrumentation, it is recommended to use high-resolution counters, such as Query performance counter and Intel’s macro IAPerf.h.
Optimization strategies for Intel Itanium-based systems
The use of different compilers and choice of test optimization are specific strategies in optimizing Itanium-based systems.
Reduce branch mispredicts
Branch mispredicts occur when code jumps to a routine that the processor could not predict ahead of time. This can be avoided through instruction predication. The Intel Itanium 2 architecture uses predication and speculation to process branches efficiently and minimize the consequence of mispredicts.
Predication and speculation can be optimize through making jumps using the fewest possible comparisons; using predictable conditional tests and short branches; having common paths occur first in multi-level; encouraging similar data to take similar paths; and using PGO.
Minimize cache misses
The Intel Itanium 2 processor predicts data fetches and preloads the expected data into cache. As it must observe a pattern of memory fetches before doing so, random rather than sequential memory access can create stalls.
When inserting an element into a closed hash table, for instance, tradition has it that randomly rehashing an entry and probing the new hashed-to location is more efficient than a sequential search for an open slot. However, with the Intel Itanium 2 microarchitecture, this proves false. If a random search finds a slot full, it requires a new rehash and memory fetch. With a linear search, the next slot is likely to be in L1 cache, resulting in the fastest possible execution rate – an order of magnitude faster than the rehash.
If cache misses cause a significant number of stall cycles, increase prefetching or localize data use with techniques such as: raising the optimization level passed to the compiler; inserting prefetch intrinsic; restructuring the algorithm to more work on a given piece of data; or restructuring data to make sequential accesses through sequential addresses.
Avoid mixing integer and floating-point data
Keep floating-point data in its own block where possible. On Intel Itanium 2 processors, floating-point numbers are serviced from the L2 cache, while all other items are serviced from the L1 cache. In mixed data, floating-point values are needlessly loaded into L1 cache. If the floating-point data is updated, the L1 cache’s copy of the cache line will be marked ‘dirty’, causing it to be reloaded.
Good threading improves cache
Threads keep the processor busy. When one task is waiting for disk I/O, others can use the CPU. Good threading candidates include downloading, GUI events, disk activity, and processing double-ended data structures like queues and stacks. The following are guidelines to improve threading:
• Binding threads to CPUs improves cache coherency and load balancing
• Time-critical functions can be prioritized
• Thread pools are more efficient than creating and deleting threads
• Microsoft SQL server can use fibers – lightweight threads with lower context switch overhead
• The performance cost of creating and synchronizing threads can outweigh the gains.
Optimize memory allocation via multiple heaps
Use multiple heaps rather than a default heap. With the exception of the process default heap, heaps can be created and destroyed using the Win32 APIs HeapCreate and HeapDestroy. The following are best practices in optimizing memory.
• Reduce fragmentation by grouping similarly-sized objects into separate heaps
• Reduce paging and TLB activity by placing data structures, which are accessed together in one heap.
• In multi-threaded application, give each thread its own heap
• Use memory pools to shrink the overhead of allocating and de-allocating memory, especially in C++ applications that frequently create and delete objects
• Multiple heaps modularize data structures and application components, reducing the likelihood of corruption.
Prevent false sharing
False sharing occurs when separate threads utilize individual data elements stored in a single cache line. Data is loaded from RAM into cache in chunks called ‘cache stripes’. On Intel Itanium 2 processors, the stripe is 128 bytes. One stripe can be copied to multiple caches. If one stripe is modified, every cache will refresh its copy. If separate threads simultaneously use data elements stored in a single cache line, the stripe is needlessly updated in all caches. This is called false sharing.
Pause and sleep accelerates performance
Nowadays, time spent spinning a loop is lost execution time. If a loop is spinning just to check a variable, pause the thread and free the processor to do other tasks until the loop is finished. Should the spinning continue, use a pause instruction or sleep. The pause command creates a one-clock delay in each spin of the loop, while the sleep forces the thread to reduce time off the processor – enough for the request to resolve properly. According to tests, pause and sleep accelerate performance.
Conclusion
One can get up to speed on the latest performance tuning techniques through the Itanium Solutions Alliance. No matter what the approach, the golden rule of performance must be kept in mind – keep it simple. Making one change at-a-time and testing it thoroughly can lead to cutting-edge optimization results.
Click here for Illustrations:
Figure 1, Figure 2, Table 1
|
|
|