References

What? / Why?
I have created this section to share the references I used throughout the project as a form of thank you to the people on whose shoulders my paper sits on together with some of my raw notes / extracts from these works for my own work / paper. My notes may be a bit rough around the edges but please know that I mean no disrespect with any of them and also know that your works have helped me better understand the effort and techniques used to tackle issues related to flagging a disk as good or bad and even to estimate its remaining useful life. The list below is no particular order of preferences other than how I discovered each paper using the IEEE and ScienceDirect platforms as well as Google. The title is the link to the official article and the body is the abstract as I extracted it from the source.
In current storage systems, to protect data security, disk failure prediction is required. Machine learning proved to be a method to solve the problem of disk failure prediction. However, because the disk-related values are affected by factors such as their use and usage environment, the values of different disks in the event of a failure are not the same. The normal value on one disk may be the value when another disk fails. Some studies have introduced the concept of time windows into disk failure prediction, trying to improve the ability of disk failure prediction by studying the relationship of a disk’s value change over time, and achieved good prediction results. We chose the neural network with time-series model to further validate the prediction of time series affect performance, and would like to be able to further improve the prediction performance by using a neural network. In this paper, we will introduce a disk failure prediction system based on LSTM networks. Considering the individual differences of the disks, we replace the input in the LSTM network with the continuous running records of the disks. The network will learn the disk information over a period of time and predict whether this disk will fail. With the proposed approach we are able to predict a disk will fail in next fifteen days with an average precision of 86.31. By comparing with other algorithms, our method performs well.
Hard disk (HDD) failure is the most important reliability issue in the data center. Therefore, the prediction of hard disk failure has become the focus of attention of major data centers. However, most current research work does not notice the fact that the data on the hard disk is mostly unlabeled data. Since the degradation period in HDD is very short, the mixture of health data and erroneous data can cause serious data imbalance. This makes fault prediction a difficult task. In response to the above problems, a multi-instance long-term sequence classification method based on long-short-term memory (LSTM) network is proposed. By dividing the longterm sequence data packet into multiple instances, the relationship between the instance and the sample label is studied to predict HDD failure. Through the analysis of the hard disk data of a communication company and the Backblaze data center, this method can obtain better results than other methods.
The article is devoted to the problem of estimating the remaining useful life (RUL) of hard drives. Approaches to solving the problem posed using machine learning and deep learning algorithms, namely, algorithms for decision trees, random forest, simple recurrent neural network (SimpleRNN), gated recurrent unit (GRU) and long-show term memory (LSTM) are considered. The article also discusses methods for improving forecast accuracy by generating features based on time series, and also provides a comparison of approaches without feature generation and with generation. This study uses open data from BackBlaze on the operation of several tens of thousands of disk drives over several months. A general comparison table for all algorithms and approaches used is presented.
With the rapid growth in the number of disks, disk failures are increasingly becoming a problem for data centers. To improve the reliability and security of the data center, deep learning methods have been widely used by performing the remaining useful life (RUL) prediction of hard disk drives (HDD). However, deep learning methods fail to deal with the long sequence data and extract the crucial degradation information. In this paper, an attention-based bidirectional long short-term memory (LSTM) with differential features method is proposed, in which the differential features are extracted by manual feature engineering, and then apply the attention-based bidirectional LSTM network to assign higher weights to crucial features that contain useful degradation information for RUL prediction. Experiments results on the Backblaze dataset show that the proposed approach outperforms the traditional LSTM methods, and achieves a 97.83% failure detection rate (FDR) to predict RUL of HDDs up to 60 days before failure.
Hard disk drive failures are one of the most common causes of service downtime in data centers. Predictive maintenance techniques have been adopted to extend the Remaining Useful Life (RUL) of these drives, and minimize service shortage and data loss. Several approaches based on machine and deep learning techniques have been proposed to address these issues, mostly exploiting models based on Self-Monitoring analysis and Reporting Technology (SMART) attributes. While these models have proven to be reliable, their performance is affected by the lack of information about the proximity of disk failure in time. Moreover, many of these techniques are sensitive to the highly unbalanced nature of existing data-sets, in terms of good to failed hard disk ratio. In this article, we propose a LSTM (Long Short Term Memory)-based model combining SMART attributes and temporal analysis for estimating a hard drive health status according to its time to failure. Our approach outperforms state-of-the-art methods when evaluated on two data-sets, one containing hourly samples from 23395 disks and the other reporting daily samples from 29878‬ disks. Experimental results showed that our approach is well suited to data-sets with different sampling periods, being able to predict hard drive health status up to 45 days before failure.
Hard disk failure prediction plays an important role in reducing data center downtime and improving service reliability. In contrast to existing work of modeling the prediction problem as classification tasks, we aim to directly predict the remaining useful life (RUL) of hard disk drives. We experiment with two different types of machine learning methods: random forest and long short-term memory (LSTM) recurrent neural networks. The developed machine learning models are applied to predict RUL for a large number of hard disk drives. Preliminary experimental results indicate that random forest method using only the current snapshot of SMART attributes is comparable to or outperforms LSTM, which models historical temporal patterns of SMART sequences using a more sophisticated architecture.
In this paper we focus on application of data-driven methods for remaining useful life estimation in components where past failure data is not uniform across devices, i.e. there is a high variance in the minimum and maximum value of the key parameters. The system under study is the hard disks used in computing cluster. The data used for analysis is provided by Backblaze as discussed later. In the article, we discuss the architecture of of the long short term neural network used and describe the mechanisms to choose the various hyper-parameters. Further, we describe the challenges faced in extracting effective training sets from highly unorganized and class-imbalanced big data and establish methods for online predictions with extensive data pre-processing, feature extraction and validation through online simulation sets with unknown remaining useful lives of the hard disks. Our algorithm performs especially well in predicting RUL near the critical zone of a device approaching failure. With the proposed approach we are able to predict whether a disk is going to fail in next ten days with an average precision of 0.8435. We also show that the architecture trained on a particular model is generalizable and transferable as it can be used to predict RUL for devices in other models from same manufacturer.
Several research has been done to propose early failure detection techniques for hard disk drives in order to improve storage systems availability and avoid data loss. Failure prediction in such circumstances would allow for the reduction of downtime costs through anticipated disk replacements. Many of the techniques proposed so far mainly perform incipient failure detection thus not allowing for proper planning of such maintenance tasks. Others perform well only under a limited prediction horizon. In this work, we present a remaining useful life estimation approach for hard disk drives based on SMART parameters that is capable of predicting failures in both long and short term intervals by leveraging the capabilities of LSTM networks.
Nowadays Hard Disk Drives (HDDs) are essential storage devices in most large-scale storage systems. As a consequence, HDD failures have severe effects that may range from data loss to service unavailability. Considering such scenario, academy and industry have driven its attention to HDDs failure prognostics solutions. In this work, we evaluate two of the most common deep learning architecture in the task of HDD failure prediction. Our tests were conducted on real-world datasets and the proposals were compared to a recurrent neural network. The results of this study showed that deep learning models are a valid alternative to solve this problem since they achieved good results and a significant amount of data is available.
Physical and cloud storage services are well-served by functioning and reliable high-volume storage systems. Recent observations point to hard disk reliability as one of the most pressing reliability issues in data centers containing massive volumes of storage devices such as HDDs. In this regard, early detection of impending failure at the disk level aids in reducing system downtime and reduces operational loss making proactive health monitoring a priority for AIOps in such settings. In this work, we introduce methods of extracting meaningful attributes associated with operational failure and of pre-processing the highly imbalanced health statistics data for subsequent prediction tasks using data-driven approaches. We use a Bidirectional LSTM with a multi-day look back period to learn the temporal progression of health indicators and baseline them against vanilla LSTM and Random Forest models to come up with several key metrics that establish the usefulness of and superiority of our model under some tightly defined operational constraints. For example, using a 15 day look back period, our approach can predict the occurrence of disk failure with an accuracy of 96.4% considering test data 60 days before failure. This helps to alert operations maintenance well in-advance about potential mitigation needs. In addition, our model reports a mean absolute error of 0.12 for predicting failure up to 60 days in advance, placing it among the state-of-the-art in recent literature.