Intel® Memory Failure Prediction substantially improves memory
reliability through online machine learning and reduces downtime
Business: Tencent is one of the biggest cloud
solution providers in China with a
presence throughout three continents.
Tencent Seafront Towers
in Shenzhen, China
Challenges • Real-time visibility into memory health
• Effective DIMM ...replacement strategy
• Predictive insights into server memory uptime and workload transfer
Solution • Intel® Memory Failure Prediction
Tencent, a leading China-based global cloud-solutions provider with operations
in APAC, Europe and North America, set up Intel® Memory Failure Prediction
(Intel® MFP) for a test deployment with thousands of servers based on Intel® Xeon®
Scalable Processors to reduce downtime caused by server memory failures.
Tencent’s IT staff deployed Intel® MFP in their data center and integrated it into
their existing management systems to analyze their server memory failures, predict
potential future failures, reduce downtime, and improve their current Dual Inline
Memory Module (DIMM) replacement and upgrade policies.
The Intel® MFP deployment resulted in improved memory reliability due to predictions
based on the capture of micro-level memory failure information from the operating
system’s Error Detection and Correction (EDAC) driver which stores historical memory
error logs. Intel® MFP also gave Tencent’s IT staff enough information to proactively
address potential memory issues, and replace failing DIMMs before they reach a
terminal stage and cause server failures, and thus reducing downtime
This initial test deployment indicated 5X improvement on DIMM level failure prediction.
If Tencent deployed Intel® MFP across its entire data centers, they would improve
the effectiveness of server reliability aware workload management and decrease the
percentage of Uncorrectable Errors (UEs) and therefore significantly reduce downtime.
Additionally, Tencent’s operational efficiency would improve and so would their
expenses on unnecessary DIMM purchases.
Background Memory failures are one of the most critical hardware failures that occur in data
centers today. Intel® MFP is a perfect solution for organizations such as online
and cloud service providers that depend heavily on server reliability, availability
and serviceability (RAS). Intel® MFP predicts memory failure events by analyzing
historical data to prevent potential catastrophic events before they happen.
Intel® MFP is vendor agnostic and works in conjunction with other data center
management solutions including Intel® Data Center Manager (Intel® DCM). Once
Case Study | Intel® Memory Failure Prediction at Tencent
deployed, the resulting data can be used to analyze and
predict server memory issues before they happen.
Tencent deployed Intel® MFP in a test environment containing
thousands of servers with Intel® Xeon® Scalable Processors
to gain better insights into their memory health. Intel® MFP
monitored the health of the servers’ Dynamic Random Access
Memory (DRAM) modules and provided administrators with
critical information about them including a health-score
based on their historical data.
Intel® MFP Provides Real-time Memory Health Insights
Intel® MFP uses online machine learning to analyze the
historical data collected on server memory down to the
DIMM, bank, column, row, and cell levels and gives a memory
health score to predict potential future failures.
The resulting analysis and health
Read the full Memory Failure Prediction
Tencent Cloud Solutions
Intel® Memory Failure
Prediction at Tencent