In a previous posting (Creating a Reliability Plan – Starting Points), I introduced the two basic philosophies for creating a reliability plan for a new product or system: the build, test, fix, approach, and the analytical approach. Combining these two styles in a balanced approach leads to the best results. In this posting, I want to get more specific and outline the 7 basic steps to follow to create a sound reliability program.
1. GATHER REQUIREMENTS AND SET RELIABILITY GOALS
Reliability goals start with understanding customer needs and the competitive situation surrounding those needs. However, this information must also be viewed through the lens of marketing strategy. Setting a new standard in reliability is a proven way to gain market share. Low warranty costs may also allow a lower price than the competition can offer.
Reliability goals for complex products have at least two dimensions: how frequently will random-in-time failures be tolerated and how long the device must last. The first of these is usually expressed as a failure rate (e.g., Annual Failure Rate, or AFR, in % per year) when speaking to customers, or as a MTBF (mean time between failures) when speaking to engineers. The second of these is often expressed as the L1, L10, or L50 life, when 1%, 10%, or 50% of the devices, respectively, have failed in an end-of-life (wear-out) failure mode. The latter can be expressed in time (hours or months), number of cycles, takeoffs plus landings, and so on (e.g., L10 = 1000 cycles).
These two dimensions of reliability are independent. It is possible to have an MTBF of millions of hours, have a situation where random-in-time failures are rare, yet have most units fail before five years because certain components inevitably wear out.
Subassemblies and components that have wear-out failure modes that cannot economically be pushed to last long enough are often called “service items.” Examples are rechargeable batteries and automobile tires. Sometimes these service items can be incorporated with a consumable item or grouped with other service items in logical periodic replacement cartridges. Examples are putting the drum of a laser printer in the cartridge with the powdered toner or selling defibrillator pads with battery packs for periodic replacement.
Indeed, service strategy is one of the major inputs to the product design. Sometimes sterility considerations or shelf life make it logical to break the product into a long-lived portion and a “disposable” portion. An example of this is an electronic thermometer with a disposable sheath for each use. Other considerations from the service plan may be to design a low cost, unserviceable, sealed unit or to perform on-site repairs because of the large system size. If the intention is to use loaner units and bring all units to one centralized location for repair, the device must be rugged enough for multiple shipments.
Risk control measures (mitigations) from risk management (safety) are another rich source of reliability goals. Safety may dictate an architectural change to the product to achieve desired reliability. An example of this is the dual-diagonal braking system on automobiles, which is now standard. Other sources of reliability goals are external standards (e.g., ISO and IEC) and the manufacturing plan. For example, if manufacturing screening is to be utilized, good design margin is needed to ensure that the product’s fatigue life will be only slightly consumed during manufacturing.
2. ALLOCATE RELIABILITY GOALS TO SUBASSESSMENTS AND KEY COMPONENTS
If the product contains redundancy or the ability to partially function in the presence of failures, a reliability block diagram should be constructed showing how the reliability of each individual piece combines to produce the top-level reliability. Based on field experience with previous models, competitive information, and engineering judgment, the top-level goal for random-in-time failures should be allocated to subassemblies and key components. This will typically result in a Pareto of (unequal) individual failure rates.
The service strategy (e.g., annual preventive maintenance or service based on mileage) will combine with field experience, competitive data, and engineering judgment to set the end-of-life goals for the “service items.” For example, it may make no sense to increase product cost by pushing out the end of life for one service item, if another service item requires annual maintenance and both can be replaced simultaneously.
3. DEVELOP ANALYTICAL MODELS AND RELIABIITY PREDICTIONS
Depending on the nature of each key component or subassembly, you should develop an estimate of reliability based on finite element analysis, comparison to similar designs, physics of failure, parts count prediction techniques, and supplier test data. Caution, however, is warranted when using supplier data. If your application is more stressful than the conditions of the supplier test, additional testing will be necessary.
Comparing the allocated reliability goal with the various analytical models or reliability predictions for each subassembly or key component reveals where to focus the reliability engineering effort. It may be possible to reallocate the individual goals in light of the analytical results. If the gap between the goal and the prediction is large, an architectural change in the product, such as specifically targeted redundancy, should be considered.
4. BEGIN LONG TERM LIFE TESTS
Often called “cycle testing,” long-term life testing involves subjecting components that rotate, flex, or receive repetitive electrical or mechanical stresses to several worst-case lifetimes of wear. Depending on the product’s intended usage pattern, it may be possible to accomplish this quickly by increasing the frequency of cycling. For example, switches, cables, and connectors can usually be thoroughly tested over a weekend. However, if the component is intended to run 24/7, such as a disk drive’s spindle, it may require an accelerated life test, during which the stress is increased, to see the relevant failure modes.
Whether the frequency or the stress or both are increased, the goal is to discover relevant failures. It is possible to miss failure modes because some things only happen over time and their frequency of occurrence does not increase in proportion to the number of cycles. An example of this would be copper migration and subsequent oxidation in a connector. It is also possible to produce “foolish failures” that will not occur in the customer environment. These are test artifacts that result from the accelerating stress. Reliability engineering requires experience and judgment to make these distinctions.
Wear-out failure modes can be reduced by identifying the “reservoir” of material that is being consumed (or transformed) by a “process.” One then increases the reservoir and/or slows down the process to push out the occurrence of failure until satisfactory life is achieved. If this is not possible, preventive maintenance is used to restore the reservoir of material by replacing a service item.
5. TEST TO FAILURE TO DISCOVER DESIGN WEAKNESSES
As physical hardware becomes available, step-stress testing and highly accelerated life testing should be used to discern the weak links in the design. There is no room here for success testing; one must test to failure. The focus should be on those items that are NUD: New to the organization, Unique to this product, and Difficult to design and/or manufacture.
One should start at the lowest level subassembly that is conveniently testable and continue later as higher levels of integration become available. The lower levels require more electrical, mechanical, and software fixturing to test while under stress, but the stress levels can go further. More fully integrated products are more easily tested and require less fixturing, but the stress level is limited by the weakest subassembly.
Each failure should be investigated to understand the root cause, no matter what kind of stress or stress level led to the failure. Until a root-cause understanding exists, one cannot make any estimates of the relevance of the failure mode. Indeed, some things do not have to be fixed, but it is often easier to fix the issue than to establish whether it can be safely ignored. Repeated highly accelerated life tests should be performed with each round of prototypes.
6. USE MANUFACTURING SCREENING TO ENSURE EARLY LIFE SUCCESS
Some components are weakened by anomalies in their manufacturing process or are damaged in shipping, storage, and handling. These defects are latent (hidden) and the parts will test well in manufacturing but fail early (i.e., within the first 90 days) in the product’s life, typically because they contain stress concentrators.
After corrective actions from highly accelerated life and step-stress testing have established good design margin, manufacturing screening will be able to cause weak components to fail without removing significant fatigue life from the good components. In this way, latent defects can be eliminated before shipping the device. Keep in mind that, without first having a rugged design, manufacturing screening may decrease life and increase warranty.
Run-in, burn-in, environmental stress screening, and highly accelerated stress screening are increasingly sophisticated methods of precipitating and detecting these hidden defects. After precipitation by stress, different detection screens must be used with appropriate testing to locate the (now) latent or visible flaw. A proof of screen test is run to ensure that the trial regimen is tough enough to precipitate defects, and a safety of screen test is done to ensure that enough fatigue life is left.
7. VALIDATE THE DESIGN AFTER DESIGN VERIFICATION AND TRANSFER TO MANUFACTURING
Using specimens from the actual manufacturing process, one should subject the product to the suite of required environmental and regulatory tests. These are “success tests,” as the objective is to pass these qualification tests. This ensures that the baseline product as transferred to manufacturing will meet customer needs.
Every situation is different and not one reliability plan will work for all situations. Create your reliability plan to meet your needs. Focus on the decisions you need to make along the way and select tasks that will help you have the right information at the right time to create a reliable product.
Bio:
Fred Schenkelberg is an experienced reliability engineering and management consultant with his firm FMS Reliability. His passion is working with teams to create cost-effective reliability programs that solve problems, create durable and reliable products, increase customer satisfaction, and reduce warranty costs. If you enjoyed this articles consider subscribing to the ongoing series Musings on Reliability and Maintenance Topics.