#139 – A BRIEF INTRODUCTION TO HALT – FRED SCHENKELBERG

ABC FredWhat Is HALT

Highly accelerated life testing (HALT) is a technique to expose weaknesses or faults with a product. HALT uses individual or combined stresses in a step-stress approach to quickly apply sufficient stress to reveal defects.

HALT is not a specific chamber or a fixed set of test conditions. It is an exploratory process to reveal weaknesses in a design. The product development process naturally includes a check step, to determine whether the expected functions of the product work as expected. Some teams then add a measured amount of stress (temperature, vibration, dust, load, etc.) to the product to explore functionality at elevated stress levels.

When the stress levels continue to increase till the product ceases to function, that could also be called HALT but a better term might be “discovery.” The idea is to determine what fails.

Action Upon Failure During HALT

For HALT to be effective, the response to failures is critical:

  1. Determine the root cause of the failure.
  2. Improve the design or manufacturing process to eliminate the failure cause.

The symptoms that indicate that a product function has failed is only a failure mode. You have to determine whether the cause of the failure was due to overcurrent (heating), a logic error, a timing error, or a technological limit. The basic steps in failure analysis apply here.

In practice, we mitigate or patch the fault to continue discovering faults. One of the tenets of HALT is to identify as many faults as possible as quickly as possible. This is a balancing act between the speed of finding failures and gathering sufficient information to later determine the root causes.

The second step is to take action based on the failure root cause information. This may take a number of different paths depending on the nature of the failure. If the failure is due to a technology limit, say a polymer softens at high temperature, then you must decide whether the difference between the expected operating temperatures and the temperature when the material softens is a sufficient margin. Is there little to no chance that the temperature during use will approach the temperature that causes the failure? For example, if the polymer softens at 70°C and the expected use temperature is 60°C, that may indicate little margin, compared to a polymer with a softening point of 150°C. The chance of the product experiencing 70°C is higher than it experiencing 150°C given an expected operating temperature of 60°C. In general, the greater the margin the better.

The Three Phases

Harry W. McLean in HALT, HASS, and HASA Explained: Accelerated Reliability Techniques, describes the three phases of HALT: pre-HALT, HALT, and post-HALT.

Pre Halt

Pre-HALT is the planning stage. This is the time when discussions on what is likely to fail take place. You can use an FMEA or risk analysis study to understand specific functions and stresses to explore during HALT. Product development engineers usually have a short list of what is likely to fail first based on their experience, design tradeoffs, and often little more than hunches.

First, document what is expected to fail and plan to be able to detect those specific failures. The ability to check the general functionality of the product during testing should be included. For example, if there is a test mode or simulated operation capability, plan to have the supporting equipment available during the HALT.

Being aware of what could failure and being able to detect failures increases that capability of HALT to find faults, both expected and unexpected.

Next, determine which stresses to apply. In actual operation the product experiences all stresses simultaneously. In the lab, we have the capability to focus on a single or small set of stresses. In practice, the product should be functioning in normal or ambient conditions. Then increase the application of one or more stresses to determine the extent of the margin for specific stresses.

In selecting which stresses to employ, include those the risk analysis expects estimate to have the least margin plus those that commonly cause failures for the technology and materials involved. For electronics, temperature and vibration typically cause a majority of failures, yet voltage and current variation, signal or traffic loading, and other stresses may apply.

Fixtures, cabling, stress application, fault detection, and failure analysis capabilities all comprise part of HALT planning. HALT should be started once the prototypes are available.

During HALT

During HALT, tests should start at ambient condition. Turn on the product and check whether the diagnostics or fault detection process is working as expected. A common practice is to apply steps of increasing stress for the stress least likely to cause failures. Establish a reasonably large margin between expected operating conditions for the working product then move to the next stress.

For example, for an electronics product, cold temperatures cause fewer types of failure than high temperature. So, for a product expected to operate as low as 0°C, start there and step down in temperature. If the product still operates at −40°C, then switch to another stress, say high temperature. If the product still operates at 40°C over specifications, move to another stress.

Explore the boundary by probing each stress. Think of this as exploring the size of a room. Move in one direction, then another to define the open space of a room. The walls are defined by failures. Given a limited number of prototypes available for HALT, first explore each stress to determine whether there is some margin. Then revisit each stress and find the point of failure (the wall in the room analogy).

With each failure, gather as much information as possible about the stress conditions, status of various elements of the product, and anything observable about the failure. For example, was there flickering prior to the screen going black?

Return the stress levels to nominal and check whether the functionality returns. If so, the stress level that caused the failure is an operating limit. This often is well beyond the product specification operating limit, which is fine. If the product does not recover, we call that a destruct limit.

Keep testing and exploring. Patch, repair, replace, or isolate elements that cause failures. Continue to explore stresses to find failures. Add combined stresses, such as temperature and vibration, to continue to reveal faults. Many failures manifest themselves only when using combined stresses.

Post-HALT

Post-HALT is the time to complete the root cause analysis and improve the design. You may need to prioritize which failures to remedy based on the nature of the failure (safety issues, for example) or the amount of margin. Improving the design to be robust to the suite of expected stresses will make the product less likely to fail when in use.

More to Study

There are excellent books by Gregg Hobbs and Harry McLean about HALT. There are many articles and papers, too. The basic idea of HALT is to explore the functioning of product using elevated stresses to quickly reveal faults and then improve the design to create a robust, reliable product.

Bio:

Fred Schenkelberg is an experienced reliability engineering and management consultant with his firm FMS Reliability. His passion is working with teams to create cost-effective reliability programs that solve problems, create durable and reliable products, increase customer satisfaction, and reduce warranty costs. If you enjoyed this articles consider subscribing to the ongoing series at Accendo Reliability.

Leave a Reply

Your email address will not be published.