In this experiment I'm trying to figure out a way to predict the pre-fail state of HDDs using RandomForestClassifier withi zero false negatives. The idea is to create a new flag called pre-fail (values of 0 or 1) that is set to 1 when disks are 30 days before the actual failure state and experiment with different values until I reach a version which will predict with high enough accuracy but with zero false negatives ideally (we don't want the model to flag a drive as being okay when in reality it is about to fail). Will experiment with general model (trained on the whole dataste) as well as specific (trained on one model) and also a hybrid (where I train on one model and try to predict a related model). This will be the foundation for the first part of my problem which is -> looking at a drive for the first time, flag it in a category (aka first impression :) on drives for which we don't have historical information - like a new user who just installed the solution)

In [1]:
#import the relevant libraries 
import os
import pymysql
import pandas as pd
In [2]:
#establish the connection to the mysql database
host = "192.168.88.187"
port = "3306"
user = "backblaze"
password = "Testing.2023"
database = "backblaze_ml"

conn = pymysql.connect(
    host=host,
    port=int(3306),
    user=user,
    passwd=password,
    db=database,
    charset='utf8mb4')
In [3]:
#for this experiment I'm going to work on the data for all drives (all models)
sqldf = pd.read_sql_query("select * from drive_stats where date >= '2014-03-01' and serial_number in (select distinct(serial_number) from drive_stats where failure=1 and date >= '2014-03-01')", conn)
sqldf
/tmp/ipykernel_2266785/2026978849.py:2: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  sqldf = pd.read_sql_query("select * from drive_stats where date >= '2014-03-01' and serial_number in (select distinct(serial_number) from drive_stats where failure=1 and date >= '2014-03-01')", conn)
Out[3]:
date serial_number model capacity_bytes days_to_failure failure smart_5_raw smart_187_raw smart_188_raw smart_189_raw smart_196_raw smart_197_raw
0 2014-03-01 MJ1311YNG36USA Hitachi HDS5C3030ALA630 3000592982016 991 0 67.0 NaN NaN NaN 101.0 0.0
1 2014-03-01 MJ1311YNG733NA Hitachi HDS5C3030ALA630 3000592982016 840 0 0.0 NaN NaN NaN 0.0 0.0
2 2014-03-01 W3009AX6 ST4000DM000 4000787030016 54 0 0.0 0.0 0.000000e+00 1.0 NaN 8.0
3 2014-03-01 WD-WCAV5M690585 WDC WD10EADS 1000204886016 409 0 0.0 NaN NaN NaN 0.0 0.0
4 2014-03-01 S1F0CSW2 ST3000DM001 3000592982016 229 0 0.0 0.0 7.301556e+10 0.0 NaN 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
14734298 2023-03-31 ZCH06VE2 ST12000NM0007 12000138625024 0 0 0.0 1.0 0.000000e+00 NaN NaN 0.0
14734299 2023-03-31 X8L0A01BF97G TOSHIBA MG07ACA14TA 14000519643136 0 0 0.0 NaN NaN NaN 0.0 0.0
14734300 2023-03-31 9JG4657T WDC WUH721414ALE6L4 14000519643136 0 0 0.0 NaN NaN NaN 0.0 0.0
14734301 2023-03-31 6090A00RFVKG TOSHIBA MG08ACA16TA 16000900661248 0 0 0.0 NaN NaN NaN 0.0 0.0
14734302 2023-03-31 51R0A2Q8FVGG TOSHIBA MG08ACA16TE 16000900661248 0 0 0.0 NaN NaN NaN 0.0 0.0

14734303 rows × 12 columns

In [4]:
#x is the variable, n is the number of days before actual failure to consider
def preFailOn(x, n):
  if x.days_to_failure == 0:
    return None
  elif x.days_to_failure <= n:
    return 1
  else:
    return 0
In [5]:
traindf = sqldf.copy()
traindf['prefailure'] = traindf.apply(lambda row: preFailOn(row, 90), axis=1)
#drop the rows where prefailure is NaN (aka where failure = 1)
traindf = traindf.dropna(subset=['prefailure'])
traindf
Out[5]:
date serial_number model capacity_bytes days_to_failure failure smart_5_raw smart_187_raw smart_188_raw smart_189_raw smart_196_raw smart_197_raw prefailure
0 2014-03-01 MJ1311YNG36USA Hitachi HDS5C3030ALA630 3000592982016 991 0 67.0 NaN NaN NaN 101.0 0.0 0.0
1 2014-03-01 MJ1311YNG733NA Hitachi HDS5C3030ALA630 3000592982016 840 0 0.0 NaN NaN NaN 0.0 0.0 0.0
2 2014-03-01 W3009AX6 ST4000DM000 4000787030016 54 0 0.0 0.0 0.000000e+00 1.0 NaN 8.0 1.0
3 2014-03-01 WD-WCAV5M690585 WDC WD10EADS 1000204886016 409 0 0.0 NaN NaN NaN 0.0 0.0 0.0
4 2014-03-01 S1F0CSW2 ST3000DM001 3000592982016 229 0 0.0 0.0 7.301556e+10 0.0 NaN 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
14734123 2023-03-30 ZCH06VE2 ST12000NM0007 12000138625024 1 0 0.0 1.0 0.000000e+00 NaN NaN 0.0 1.0
14734124 2023-03-30 X8L0A01BF97G TOSHIBA MG07ACA14TA 14000519643136 1 0 0.0 NaN NaN NaN 0.0 0.0 1.0
14734126 2023-03-30 9JG4657T WDC WUH721414ALE6L4 14000519643136 1 0 0.0 NaN NaN NaN 0.0 0.0 1.0
14734128 2023-03-30 6090A00RFVKG TOSHIBA MG08ACA16TA 16000900661248 1 0 0.0 NaN NaN NaN 0.0 0.0 1.0
14734129 2023-03-30 51R0A2Q8FVGG TOSHIBA MG08ACA16TE 16000900661248 1 0 0.0 NaN NaN NaN 0.0 0.0 1.0

14716816 rows × 13 columns

In [6]:
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(1337)
from IPython.display import Image
import matplotlib as mpl
import matplotlib.pyplot as plt
from pandas import read_csv
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
In [7]:
#The following settings will be used to avoid exponential values in output or tables and to display 50 rows maximum:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.options.display.max_rows=50
In [8]:
#fix null values
traindf.fillna(0, inplace = True)
#when training the model, we don't need failure but prefailure for this experiment
traindf = traindf.drop(columns=['days_to_failure','capacity_bytes', 'model', 'serial_number', 'date', 'failure'])
traindf
Out[8]:
smart_5_raw smart_187_raw smart_188_raw smart_189_raw smart_196_raw smart_197_raw prefailure
0 67.000 0.000 0.000 0.000 101.000 0.000 0.000
1 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2 0.000 0.000 0.000 1.000 0.000 8.000 1.000
3 0.000 0.000 0.000 0.000 0.000 0.000 0.000
4 0.000 0.000 73015558161.000 0.000 0.000 0.000 0.000
... ... ... ... ... ... ... ...
14734123 0.000 1.000 0.000 0.000 0.000 0.000 1.000
14734124 0.000 0.000 0.000 0.000 0.000 0.000 1.000
14734126 0.000 0.000 0.000 0.000 0.000 0.000 1.000
14734128 0.000 0.000 0.000 0.000 0.000 0.000 1.000
14734129 0.000 0.000 0.000 0.000 0.000 0.000 1.000

14716816 rows × 7 columns

In [9]:
traindf.describe().T
Out[9]:
count mean std min 25% 50% 75% max
smart_5_raw 14716816.000 167.991 2089.559 0.000 0.000 0.000 0.000 65528.000
smart_187_raw 14716816.000 5.031 285.034 0.000 0.000 0.000 0.000 65535.000
smart_188_raw 14716816.000 1743647741.141 80449956585.986 0.000 0.000 0.000 0.000 10196408011086.000
smart_189_raw 14716816.000 5.959 523.304 0.000 0.000 0.000 0.000 65535.000
smart_196_raw 14716816.000 3.622 70.742 0.000 0.000 0.000 0.000 9031.000
smart_197_raw 14716816.000 11.524 946.471 0.000 0.000 0.000 0.000 462016.000
prefailure 14716816.000 0.100 0.301 0.000 0.000 0.000 0.000 1.000
In [10]:
obj = traindf.dtypes[traindf.dtypes == object ].index  
obj
Out[10]:
Index([], dtype='object')
In [11]:
#here we split the dataset into 70/30 train/test
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(traindf[traindf.columns[:-1]], 
                                                  traindf[traindf.columns[-1:]] ,
                                                  stratify=traindf[traindf.columns[-1:]], 
                                                  test_size=0.30)
In [12]:
X_train
Out[12]:
smart_5_raw smart_187_raw smart_188_raw smart_189_raw smart_196_raw smart_197_raw
321073 0.000 0.000 0.000 1.000 0.000 0.000
7839948 72.000 0.000 0.000 0.000 0.000 0.000
3250337 0.000 0.000 0.000 19.000 0.000 0.000
10919617 0.000 0.000 0.000 0.000 0.000 0.000
6301574 0.000 0.000 0.000 0.000 0.000 0.000
... ... ... ... ... ... ...
8848819 0.000 0.000 0.000 0.000 0.000 0.000
7464808 0.000 0.000 0.000 0.000 0.000 0.000
11592756 0.000 0.000 0.000 0.000 0.000 0.000
4724364 0.000 0.000 0.000 0.000 0.000 0.000
6032095 0.000 0.000 0.000 0.000 0.000 0.000

10301771 rows × 6 columns

In [13]:
Y_train
Out[13]:
prefailure
321073 0.000
7839948 1.000
3250337 1.000
10919617 0.000
6301574 0.000
... ...
8848819 0.000
7464808 0.000
11592756 0.000
4724364 0.000
6032095 0.000

10301771 rows × 1 columns

In [14]:
X_test
Out[14]:
smart_5_raw smart_187_raw smart_188_raw smart_189_raw smart_196_raw smart_197_raw
10519443 0.000 0.000 1.000 0.000 0.000 0.000
3094842 0.000 0.000 0.000 0.000 0.000 0.000
8753352 0.000 0.000 0.000 0.000 0.000 0.000
10076748 0.000 0.000 0.000 0.000 0.000 0.000
1957194 0.000 0.000 0.000 8.000 0.000 0.000
... ... ... ... ... ... ...
4444418 0.000 0.000 0.000 0.000 0.000 0.000
14127358 0.000 0.000 0.000 0.000 0.000 0.000
3565370 0.000 0.000 0.000 0.000 0.000 0.000
3502590 0.000 0.000 0.000 0.000 0.000 1.000
7426991 0.000 0.000 0.000 0.000 0.000 0.000

4415045 rows × 6 columns

In [15]:
Y_test
Out[15]:
prefailure
10519443 0.000
3094842 0.000
8753352 0.000
10076748 0.000
1957194 0.000
... ...
4444418 0.000
14127358 0.000
3565370 0.000
3502590 1.000
7426991 0.000

4415045 rows × 1 columns

In [16]:
import joblib

#Building the Random Forest Classifier (RANDOM FOREST) 
from sklearn.ensemble import RandomForestClassifier 

# random forest model creation 
rfc = RandomForestClassifier() 
rfc.fit(X_train, Y_train) 

#save the model
joblib.dump(rfc, "./dissertation-ml-experiment4-randomforestclassifier-predict-state-90days-before-failure-practical-allhddmodels.joblib")
/tmp/ipykernel_2266785/2762361369.py:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  rfc.fit(X_train, Y_train)
Out[16]:
['./dissertation-ml-experiment4-randomforestclassifier-predict-state-90days-before-failure-practical-allhddmodels.joblib']
In [17]:
# predictions(Notice the caps'P' of yPred to differentiate between model 1 and 2) 
yPred = rfc.predict(X_test) 
In [18]:
results = pd.DataFrame({'Actual':Y_test['prefailure']})
results['Predicted'] = yPred
results
Out[18]:
Actual Predicted
10519443 0.000 0.000
3094842 0.000 0.000
8753352 0.000 0.000
10076748 0.000 0.000
1957194 0.000 0.000
... ... ...
4444418 0.000 0.000
14127358 0.000 0.000
3565370 0.000 0.000
3502590 1.000 0.000
7426991 0.000 0.000

4415045 rows × 2 columns

In [19]:
yp = results['Predicted']
yt = results['Actual']
In [20]:
#Results of our predictions

from sklearn.metrics import classification_report, accuracy_score  
from sklearn.metrics import precision_score, recall_score 
from sklearn.metrics import f1_score, matthews_corrcoef 
from sklearn.metrics import confusion_matrix 

n_errors = (yt != yp).sum()        #here we count the number of cases where predicted and actual are different
print("Model used is: Random Forest classifier") 
  
acc = accuracy_score(yt, yp) 
print("The accuracy is {}".format(acc)) 
  
prec = precision_score(yt, yp) 
print("The precision is {}".format(prec)) 
  
rec = recall_score(yt, yp) 
print("The recall is {}".format(rec)) 
  
f1 = f1_score(yt, yp) 
print("The F1-Score is {}".format(f1)) 
  
MCC = matthews_corrcoef(yt, yp) 
print("The Matthews correlation coefficient is {}".format(MCC)) 
Model used is: Random Forest classifier
The accuracy is 0.9177976668414478
The precision is 0.8513907053797193
The recall is 0.22015696831548784
The F1-Score is 0.3498483570068378
The Matthews correlation coefficient is 0.40795033259749663

Remember! None of our drives have yet failed, we're predicting the prefailure state of the drive aka 60 days before the actual failure and the accuracy shown here (>91%) is when trying to predict on a model that is similar but not the same

In [21]:
 # confusion matrix 

LABELS = ['Healthy', 'Failed'] 
conf_matrix = confusion_matrix(Y_test, yPred) 
plt.figure(figsize =(12, 12)) 
sns.heatmap(conf_matrix, xticklabels = LABELS,  
            yticklabels = LABELS, annot = True, fmt ="d"); 
plt.title("Confusion matrix") 
plt.ylabel('True class') 
plt.xlabel('Predicted class') 
plt.show()