In this notebook I'm demonstrating what I did with the Backblaze dataset - which is pretty big (in a previous attempt to work with it the size of the database exceeds 200G and the main table had over 330 mil entries in total ) - to get it to a manageable size such that I would not need massive clusters in order to process it.
The plan is as follows:
1. get a combined list of all the drives that failed over time
2. save it to the failed_drive table in database backblaze_ml such that we can use it later on to generate a table with information specific only to these failed drives
#import the relevant libraries
import os
import pymysql
import pandas as pd
#establish the connection to the mysql database
host = "192.168.88.187"
port = "3306"
user = "backblaze"
password = "Testing.2023"
database = "backblaze_ml"
conn = pymysql.connect(
host=host,
port=int(3306),
user=user,
passwd=password,
db=database,
charset='utf8mb4')
#list of databases I'll be working with
df = pd.read_sql_query("show databases like 'backblaze%'", conn)
df.tail(100)
Backblaze provides their SMART database exports in the form of quarterly zip files containing CSV files for each day. Importing these groups of files in their own database makes most sense as it keeps the size of the database relatively small enough to still be able to work with. The actual database I will be working with during the training phase is backblaze_ml where I will havel collected only the relevant information from the raw databases which will have a manageable size (considering it will only contain the relevant information needed) to not need any large scale infrastructure. -> remember: one of the core ideas behind my project is "practical" -> needing to deploy a lot of nodes to handle a lot of data is expensive which makes it less practical if dedicated specifically for this exercise
#demo for backblaze_2013_Q2 (there is no data for Q1)
conn = pymysql.connect(
host=host,
port=int(3306),
user=user,
passwd=password,
db="backblaze_ml",
charset='utf8mb4')
df = pd.read_sql_query("show tables", conn)
df.tail(100)
The failed_drive table is a table I created which only contains the serial number and the model of the drives which I found in the database as having the fail field set to 1
#sample below
df = pd.read_sql_query("select * from failed_drive limit 10", conn)
df.tail(10)
The queries the follow are the ones demonstrating how I extracted the information I need to populate in the failed_drive table across all databases
#massive query that does union across all the databases to retrieve the list of failed drives
df = pd.read_sql_query("select distinct serial_number, model from backblaze_2013_Q2.drive_stats where backblaze_2013_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2013_Q3.drive_stats where backblaze_2013_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2013_Q4.drive_stats where backblaze_2013_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q1.drive_stats where backblaze_2014_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q2.drive_stats where backblaze_2014_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q3.drive_stats where backblaze_2014_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q4.drive_stats where backblaze_2014_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q1.drive_stats where backblaze_2015_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q2.drive_stats where backblaze_2015_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q3.drive_stats where backblaze_2015_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q4.drive_stats where backblaze_2015_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q1.drive_stats where backblaze_2016_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q2.drive_stats where backblaze_2016_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q3.drive_stats where backblaze_2016_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q4.drive_stats where backblaze_2016_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q1.drive_stats where backblaze_2017_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q2.drive_stats where backblaze_2017_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q3.drive_stats where backblaze_2017_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q4.drive_stats where backblaze_2017_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q1.drive_stats where backblaze_2018_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q2.drive_stats where backblaze_2018_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q3.drive_stats where backblaze_2018_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q4.drive_stats where backblaze_2018_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q1.drive_stats where backblaze_2019_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q2.drive_stats where backblaze_2019_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q3.drive_stats where backblaze_2019_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q4.drive_stats where backblaze_2019_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q1.drive_stats where backblaze_2020_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q2.drive_stats where backblaze_2020_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q3.drive_stats where backblaze_2020_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q4.drive_stats where backblaze_2020_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q1.drive_stats where backblaze_2021_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q2.drive_stats where backblaze_2021_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q3.drive_stats where backblaze_2021_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q4.drive_stats where backblaze_2021_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q1.drive_stats where backblaze_2022_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q2.drive_stats where backblaze_2022_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q3.drive_stats where backblaze_2022_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q4.drive_stats where backblaze_2022_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2023_Q1.drive_stats where backblaze_2023_Q1.drive_stats.failure = 1;", conn)
df.tail(10)
Now comes the fun part
conn.cursor().execute("insert into backblaze_ml.failed_drive select distinct serial_number, model from backblaze_2013_Q2.drive_stats where backblaze_2013_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2013_Q3.drive_stats where backblaze_2013_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2013_Q4.drive_stats where backblaze_2013_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q1.drive_stats where backblaze_2014_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q2.drive_stats where backblaze_2014_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q3.drive_stats where backblaze_2014_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q4.drive_stats where backblaze_2014_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q1.drive_stats where backblaze_2015_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q2.drive_stats where backblaze_2015_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q3.drive_stats where backblaze_2015_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q4.drive_stats where backblaze_2015_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q1.drive_stats where backblaze_2016_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q2.drive_stats where backblaze_2016_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q3.drive_stats where backblaze_2016_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q4.drive_stats where backblaze_2016_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q1.drive_stats where backblaze_2017_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q2.drive_stats where backblaze_2017_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q3.drive_stats where backblaze_2017_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q4.drive_stats where backblaze_2017_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q1.drive_stats where backblaze_2018_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q2.drive_stats where backblaze_2018_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q3.drive_stats where backblaze_2018_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q4.drive_stats where backblaze_2018_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q1.drive_stats where backblaze_2019_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q2.drive_stats where backblaze_2019_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q3.drive_stats where backblaze_2019_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q4.drive_stats where backblaze_2019_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q1.drive_stats where backblaze_2020_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q2.drive_stats where backblaze_2020_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q3.drive_stats where backblaze_2020_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q4.drive_stats where backblaze_2020_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q1.drive_stats where backblaze_2021_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q2.drive_stats where backblaze_2021_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q3.drive_stats where backblaze_2021_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q4.drive_stats where backblaze_2021_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q1.drive_stats where backblaze_2022_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q2.drive_stats where backblaze_2022_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q3.drive_stats where backblaze_2022_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q4.drive_stats where backblaze_2022_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2023_Q1.drive_stats where backblaze_2023_Q1.drive_stats.failure = 1;")
df = pd.read_sql_query("select * from backblaze_ml.failed_drive", conn)
df.tail(10)
Here we're taking a look at number of failed drives by model on the whole dataset (on the individual quarterly databases)
df = pd.read_sql_query("select count(distinct serial_number) as failed_disks, model from (select distinct serial_number, model from backblaze_2013_Q2.drive_stats where backblaze_2013_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2013_Q3.drive_stats where backblaze_2013_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2013_Q4.drive_stats where backblaze_2013_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q1.drive_stats where backblaze_2014_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q2.drive_stats where backblaze_2014_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q3.drive_stats where backblaze_2014_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2014_Q4.drive_stats where backblaze_2014_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q1.drive_stats where backblaze_2015_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q2.drive_stats where backblaze_2015_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q3.drive_stats where backblaze_2015_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2015_Q4.drive_stats where backblaze_2015_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q1.drive_stats where backblaze_2016_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q2.drive_stats where backblaze_2016_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q3.drive_stats where backblaze_2016_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2016_Q4.drive_stats where backblaze_2016_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q1.drive_stats where backblaze_2017_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q2.drive_stats where backblaze_2017_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q3.drive_stats where backblaze_2017_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2017_Q4.drive_stats where backblaze_2017_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q1.drive_stats where backblaze_2018_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q2.drive_stats where backblaze_2018_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q3.drive_stats where backblaze_2018_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2018_Q4.drive_stats where backblaze_2018_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q1.drive_stats where backblaze_2019_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q2.drive_stats where backblaze_2019_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q3.drive_stats where backblaze_2019_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2019_Q4.drive_stats where backblaze_2019_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q1.drive_stats where backblaze_2020_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q2.drive_stats where backblaze_2020_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q3.drive_stats where backblaze_2020_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2020_Q4.drive_stats where backblaze_2020_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q1.drive_stats where backblaze_2021_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q2.drive_stats where backblaze_2021_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q3.drive_stats where backblaze_2021_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2021_Q4.drive_stats where backblaze_2021_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q1.drive_stats where backblaze_2022_Q1.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q2.drive_stats where backblaze_2022_Q2.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q3.drive_stats where backblaze_2022_Q3.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2022_Q4.drive_stats where backblaze_2022_Q4.drive_stats.failure = 1 union select distinct serial_number, model from backblaze_2023_Q1.drive_stats where backblaze_2023_Q1.drive_stats.failure = 1) m group by m.model order by failed_disks;", conn)
df
And here we're doing the same thing only this time on the failed_drive table in the backblaze_ml (to validate that it is accurate)
df = pd.read_sql_query("select count(backblaze_ml.failed_drive.model) as failed_disks, backblaze_ml.failed_drive.model as model from backblaze_ml.failed_drive group by backblaze_ml.failed_drive.model order by failed_disks;", conn)
df