ECN Asia
  Mark as your homepage Bookmark us Print Subscription
               
Monday, May 12, 2008
Home About Us Current Issue Archive RSS Free Subscription Trade Shows Media Kit Contact Us

Boards & Modules

Computers, Peripherals & Networking Devices

Digital Den

Electromechanical/Mechanical Devices

Embedded Systems & Networking

Integrated Circuits & Semiconductors

Microwave & RF Components

Optoelectronics & Displays

Packaging & Interconnects

Passive & Discrete Components

Power Sources & Conditioning Devices

Sensors & Actuators

Software

Test & Measurement

Search:
 
 
Product Info Search:
 
     
 
 
 
Issue > May 2008 > Cover Story
 
 
Ads by Google
 

Performance tuning and scaling on Itanium 2-based servers


( 01 May 2008 )

By Mark Whitener, Software Engineer, Unisys

As data center density increases, optimizing data center resources has become a concern for most IT professionals. To address this challenge, the Itaniums Solutions Alliance recently published a white paper about Performing Tuning and Scaling on Itanium 2-based Servers. The paper discusses tuning methodology across system, application and microarchitecture levels, and best practices specific to the Intel Itanium 2 processor.

Tune from the top-down

Performance tuning is essential in getting the most from applications running on Itanium 2-based systems. These powerful processors are more dependent than average on optimal code to function at their peak capabilities. There are three levels of performance tuning – system, application, microarchitecture. System-level tuning requires the identification of system bottlenecks, such as processors, disks and network configuration. These bottlenecks when addressed give the highest performance gain. The middle application-level tuning addresses issues such as locks, heap contention, threading and API calls that either require modifying application semantics, redesigning or re-implementing application components. Bottom microarchitecture tuning is especially relevant for applications, which use 100 percent of the CPU. On Itanium 2-based systems, using a code-optimizing compiler is highly recommended.

Workload and tools

A good workload is measurable, repeatable, static and representative. Measurable because the workload and any performance gains from run to run can expressed quantitatively; repeatable because the workload is reasonably consistent from one run to the next; static because the ratio of transactions to workload does not change significantly as the workload increases; and representative because the workload matches how an application will be used. The baseline performance measurement should be an average of multiple identical runs.

Stochastic workloads are flexible and lean. They use a probability function to generate customer arrivals and other elements of the workload on the fly. A trace-driven workload replays requests from a real log file or trace. It is more realistic but less flexible, and uses more disk space.

Configured to a separate disk, Windows performance monitor (Perfmon) is a crucial tool for system-level tuning. Complete Perfmon objects that should be logged include system, memory, processor, process, network interface, physical disk, thread and SQL server.

Another tool, the Vtune performance analyzer collects, organizes and displays system data in multiple views and suggests performance-enhancing measures. For self-instrumentation, it is recommended to use high-resolution counters, such as Query performance counter and Intel’s macro IAPerf.h.

Optimization strategies for Intel Itanium-based systems

The use of different compilers and choice of test optimization are specific strategies in optimizing Itanium-based systems.









Reduce branch mispredicts

Branch mispredicts occur when code jumps to a routine that the processor could not predict ahead of time. This can be avoided through instruction predication. The Intel Itanium 2 architecture uses predication and speculation to process branches efficiently and minimize the consequence of mispredicts.

Predication and speculation can be optimize through making jumps using the fewest possible comparisons; using predictable conditional tests and short branches; having common paths occur first in multi-level; encouraging similar data to take similar paths; and using PGO.

Minimize cache misses

The Intel Itanium 2 processor predicts data fetches and preloads the expected data into cache. As it must observe a pattern of memory fetches before doing so, random rather than sequential memory access can create stalls.

When inserting an element into a closed hash table, for instance, tradition has it that randomly rehashing an entry and probing the new hashed-to location is more efficient than a sequential search for an open slot. However, with the Intel Itanium 2 microarchitecture, this proves false. If a random search finds a slot full, it requires a new rehash and memory fetch. With a linear search, the next slot is likely to be in L1 cache, resulting in the fastest possible execution rate – an order of magnitude faster than the rehash.

If cache misses cause a significant number of stall cycles, increase prefetching or localize data use with techniques such as: raising the optimization level passed to the compiler; inserting prefetch intrinsic; restructuring the algorithm to more work on a given piece of data; or restructuring data to make sequential accesses through sequential addresses.

Avoid mixing integer and floating-point data

Keep floating-point data in its own block where possible. On Intel Itanium 2 processors, floating-point numbers are serviced from the L2 cache, while all other items are serviced from the L1 cache. In mixed data, floating-point values are needlessly loaded into L1 cache. If the floating-point data is updated, the L1 cache’s copy of the cache line will be marked ‘dirty’, causing it to be reloaded.

Good threading improves cache

Threads keep the processor busy. When one task is waiting for disk I/O, others can use the CPU. Good threading candidates include downloading, GUI events, disk activity, and processing double-ended data structures like queues and stacks. The following are guidelines to improve threading:

• Binding threads to CPUs improves cache coherency and load balancing

• Time-critical functions can be prioritized

• Thread pools are more efficient than creating and deleting threads

• Microsoft SQL server can use fibers – lightweight threads with lower context switch overhead

• The performance cost of creating and synchronizing threads can outweigh the gains.

Optimize memory allocation via multiple heaps

Use multiple heaps rather than a default heap. With the exception of the process default heap, heaps can be created and destroyed using the Win32 APIs HeapCreate and HeapDestroy. The following are best practices in optimizing memory.

• Reduce fragmentation by grouping similarly-sized objects into separate heaps

• Reduce paging and TLB activity by placing data structures, which are accessed together in one heap.

• In multi-threaded application, give each thread its own heap

• Use memory pools to shrink the overhead of allocating and de-allocating memory, especially in C++ applications that frequently create and delete objects

• Multiple heaps modularize data structures and application components, reducing the likelihood of corruption.

Prevent false sharing

False sharing occurs when separate threads utilize individual data elements stored in a single cache line. Data is loaded from RAM into cache in chunks called ‘cache stripes’. On Intel Itanium 2 processors, the stripe is 128 bytes. One stripe can be copied to multiple caches. If one stripe is modified, every cache will refresh its copy. If separate threads simultaneously use data elements stored in a single cache line, the stripe is needlessly updated in all caches. This is called false sharing.

Pause and sleep accelerates performance

Nowadays, time spent spinning a loop is lost execution time. If a loop is spinning just to check a variable, pause the thread and free the processor to do other tasks until the loop is finished. Should the spinning continue, use a pause instruction or sleep. The pause command creates a one-clock delay in each spin of the loop, while the sleep forces the thread to reduce time off the processor – enough for the request to resolve properly. According to tests, pause and sleep accelerate performance.

Conclusion

One can get up to speed on the latest performance tuning techniques through the Itanium Solutions Alliance. No matter what the approach, the golden rule of performance must be kept in mind – keep it simple. Making one change at-a-time and testing it thoroughly can lead to cutting-edge optimization results.

Click here for Illustrations:



Figure 1, Figure 2, Table 1



 

 
 
 
ADVERTISEMENT
 
 
 
Ads by Google
 
OUR SPONSOR
   
   
 
 
 
   
   
     
 
 
         
     
 
Related Articles
   
Control plane scaling platform delivers next evolution of scale
Service router enables cost-effective delivery
2.5” HDD holds 500GB capacity
Tiny PoE PD module supports wide input voltage
Model-based design: latest solution of engineering problems
Notebook PC hard drive features 320GB capacity
“Nature of distribution is changing”
'Base Stations For The Home And Enterprise' conference: QoS questions remain
Lotusphere 2008: An anarchy of innovation?
“Asia is not a homogenous market”
   
 
Product News
   
Super Talent Launches MLC SATA-II SSDs for Notebooks
ESMexpress System-On-Module Standard from MEN Micro
Intradyn Launches Two ComplianceVault eSeries
ITT Develops High Reliability Stainless Steel D-Subminiature Connectors
C&K Combines Sensitive Touch and Tactile Technology
Sputtered Thin Film Pressure Transducer from Omegadyne
Omega Offers G3 Big Flexible Display
Omega Launches Paddlewheel Flow Sensor
Lloyd Launches PET Intrinsic Viscosity Test Instrument
Swivel Unit SRH - Modular Is Simply Better
   
  More News >>
 
     
     
 
         
 
 
     
         
 
spacer
Country Report
spacer
   
bullet

TAIWAN: Inductor technologies are developed independently

bullet

KOREA: Inductor manufacturers are highly competitive, but scarce

bullet

CHINA: World’s high-volume producer of transformer, coil and inductor

bullet

TAIWAN: Moderate but steady growth in LED market

bullet

KOREA: LED has a bright future in our homes

  more on country report >>
   
 
spacer
Our Sponsor
spacer
   
bullet
 
   
 
     
 
     
 
spacer
Features
spacer
   
bullet

Simulation of passive rectified load

bullet

Engineering for robustness

bullet

Performance tuning and scaling on Itanium 2-based servers

bullet

Model-based design: latest solution of engineering problems

bullet

Advanced alloy opens up new possibilities in electronics manufacturing

  more on features >>
   
 
Distribution
   

“Nature of distribution is changing”

Top supply chain predictions in Asia Pacific for 2008

Global impact of environmental legislations in 2008

Support from distributors must go an extra mile

Paradigm shift seen in semiconductor distribution

  more on distribution >>
   
 
     
         
 
kellysearch
 
     
         
 
Industry Focus
   

Ethernet adoption encourages open protocols

Managing Bluetooth profiles: A billion served

Enabling a true wireless multimedia home network

Bluetooth paves the way for truly wireless car interiors

Eliminating massive clock trees in SoC designs using GALS

  more on industry focus >>
   
 
Web Exclusives
   

LED: A tiny light source with a bright future

SSDs: Carving a Niche in the Consumer and Enterprise Markets

FRAM reaches highest capacity to date

Considering enclosure needs up-front saves time and cost

Wringing out thermistor nonlinearities

  more on web exclusives >>
   
 
     
     
 
 
     
 
Semiconductors
   

Simulating the effect of blockers on data converter performance in wideband receivers

Decrease processor power consumption using a CPLD

Taking full advantage of new, low-power MCUs

Power train integration for 2007 and beyond: The true dawn of multi-chip modules

Wireless network options for industrial applications

  more on semiconductors >>
   
 
Field Applications
   

Test Equipment

Power Sources/Circuit Protection

Advanced Signal Processing Dramatically Improves Capability of Artificial Limbs

Voice Interface Technology for Hands-free Function in Automobiles

LXI: A Technology Leap for Test Instrumentation

  more on field applications >>
   
 
     
     
   
     
     
 
INDUSTRY LINKS
   
Photonics Association (Singapore)
bullet Singapore Industrial Automation Association (SIAA)
   
 
 

 

 
         
 

 
 
 
 
 
© 2008 Reed Business Information, a division of Reed Elsevier Inc.
All rights reserved. Use of this web site is subject to its Terms and Conditions of Use. View our Privacy Policy.