Troubleshooting Storage Devices – CompTIA A+ 220-1201 – 5.2

Our drives provide a critical long-term storage for our important documents. In this video, you’ll learn about troubleshooting grinding noises, drive not recognized errors, data loss, corruption, RAID issues, S.M.A.R.T. analysis, IOPS measurements, and more.


If you’re working on your computer and you get a message that says “cannot read from the source disk” or an error message that’s very similar to this, then our problem is that we’re either not able to write to the storage drive or read the information that has already been stored on the storage drive. When this occurs, you’ll often see the drive constantly retrying that particular area to see if it can finally read the information that’s stored either on the platters of the hard drive or the SSD.

This type of failure could affect only a certain area of the storage drive, or it may be an intermittent-type problem. Whenever it’s occurring, you’ll notice that the performance of reading and writing from the drive becomes very slow, because if the drive ever runs into this error, it will constantly try to retry the read or the write to that particular area of the drive.

That retry takes time as it tries over and over to try to either read that data or write information into that area of the drive. Sometimes this is a loud clicking noise, especially if it’s a hard drive that we’re trying to access. Sometimes you’ll hear people refer to this as the click of death, because once your hard drive starts making these clicking noises, it’s very difficult to recover any of that data.

These hard drives have platters inside that are spinning very rapidly, often at 5,400 revolutions per minute and higher. There are also these actuator arms that have heads on the end that are moving back and forth across those platters, and those can also have physical failures. All of these components have very high tolerances, and if any one of them happens to fail, it often creates a cascade effect with all of the other parts of the spinning drive.

You’ll also notice that a lot of these components are metal, so when you do start to have a failure, you’ll start to hear clicking noises or grinding noises or the sounds of metal on metal. This may result in getting very poor performance as the drive continuously tries to retry writing or reading to that drive, or you may get an error message on the screen and no access at all.

Once you start having a physical problem with a storage drive, it can be very difficult to recover from that particular problem. This is one of the reasons that we always tell you to make sure that you have a very good and recent backup of everything that’s on your system.

If you do start to hear these noises and you haven’t gotten a recent backup, this is your cue to stop what you’re doing, get a backup system, and back up as much data as possible from that storage drive. From that point, we can start doing a little more troubleshooting. We might want to check for any loose or damaged cables if this happens to be a desktop system. This might be an easy fix where we simply reseat a cable and we’re able to communicate properly to that drive.

We should always be aware of any heat problems inside of our computer case, and it could be that our storage drive is getting too hot to operate properly. You might want to check your monitoring software on your computer to see what type of temperatures we have inside of your case and on the storage drive.

If we’ve recently added new hardware to our computer, we could have a problem with how much power is being provided by the power supply. So you might want to perform an audit of all of the components in your system and calculate how much power they’ll need, and then compare that to the capabilities of your existing power supply.

If the drive is still working, you may want to go to the manufacturer’s website and download a set of their recommended diagnostics. This is a test that you would commonly run overnight, and you would try reading and writing to every single sector on that particular drive. This would give you an idea of whether the problem is isolated to a small area of the drive or if it is a much widespread problem with that particular storage device.

When you boot your system with a bad drive, you may see messages like “drive not recognized” or “boot device not found.” Sometimes there will be lights showing you when access to the drive is being attempted. Or if you’re seeing no lights on your system, it could be that the drive is completely unresponsive. These error messages will often be accompanied by beeps to let you know that an error has occurred. And if you are getting video on your screen, you may also have an error message that has more detail.

If you get a message that an “operating system was not found,” that means that the drive is available and the system is able to access that drive, but it’s looked through the file system on that drive and is not able to find any available bootable operating system. If our message is that we’re not seeing any drive at all and we’re not seeing any access lights on the front of our computer, then it could be that we have a bad cable connection. You’ll want to try reseating the power and the data connectors for your storage drive and see if it’s able to see the drive after a reboot.

Also check the boot configurations in your BIOS. You want to be sure that the boot sequence is what you would expect, and you want to be sure that your BIOS is not configured to look at any removable storage devices to be able to boot from those. If you have a USB flash plugged into an interface, your system may be trying to boot from the USB drive instead of your SSD.

And if this is a brand new storage drive that you’re using for the very first time, you may want to double check your cables, make sure it’s getting the proper amount of power, and that you’re accessing that drive properly from your system BIOS. If you have any known good cables that you could swap out for the cables you’re currently using, it would be a good way to easily check to see if your problem is simply related to the cable connection instead of the storage drive itself. And if you think the problem is with your motherboard or the SATA interfaces on your system, you might want to remove that drive, move it to a different computer, and see if you see the same problem occurring on that separate system.

As we mentioned earlier, these hard drives are mechanical devices. And because of that, the question is not if your hard drive is going to fail, but when is your hard drive going to fail? If your drive has failed and you need data off of that drive, sending it to a drive recovery service is expensive and time-consuming. If this is an SSD that has failed, you may find that you’re not able to write to that SSD, but you are often able to read the data from that SSD.

In all of these cases, the data that’s on the storage device could be corrupted, or it may simply be unavailable and inaccessible going forward. This might even be difficult, if not impossible, to recover, depending on the type of problem. This is why we always tell you, have a backup of your data. Having a backup can solve all of these problems for you, and you may be able to recover the data in a relatively short period of time.

If you’re working with a server, you probably have more than one drive inside of that server. And very often, we’ve configured those drives as a RAID array. That is, a Redundant Array of Inexpensive Disks. Any one of the drives that happened to be in that array could potentially fail. It could be that the drive itself has failed. Maybe the drive is not receiving enough power, and that’s the reason we can’t communicate to it. Or it may not be connected properly or might have a bad cable, and that might cause a communications issue.

Most RAID arrays will give you detailed information about what’s happening with that array, especially if there’s an error. So check any error messages on the screen or any email notifications that might have been sent from your RAID controller. And very often, there will be audible alarms to let you know that something about these drives is not working as expected.

One of the challenges when troubleshooting these arrays is that there are often many drives to choose from. So when you’re troubleshooting problems with a RAID array, make sure that you are focusing on the correct physical drive that is experiencing these issues.

There’s often extensive information you can gather directly from the RAID controller. It will tell you which volumes are healthy, which volumes may have a drive that has failed. This one says that one or more storage pool SSD caches are degraded. We recommend replacing the failing drives with healthy ones.

Here’s a RAID array with 12 physical drives, and almost all of these drives are identical to each other. So if you are replacing one of the drives in this array, you want to be sure that you really are replacing the bad drive and not accidentally replacing a good drive.

When you’re working with a RAID array that has failed, it’s important to know what type of RAID you’re using. This RAID type will help you understand how to reconstruct this array once you replace the bad drives. If you’re running RAID 0 or striping, you need at least two drives. Since this is RAID 0, there is zero redundancy, so if there is a single drive failure, the entire array is broken and you have lost data. The only way to recover that data is to restore from a backup once you’ve replaced the bad drive.

RAID 1 or mirroring also requires a minimum of two drives. And as long as only one drive has failed, the array will continue to work normally. Your end users probably won’t even know there’s a problem. You simply need to replace the bad drive, and the RAID array will rebuild itself with the information that still exists.

Raid 5 is striping with the equivalent of one drive being parity. To be able to use RAID 5, you need at least three physical storage drives, and all but one of those drives needs to be operational. When you lose one drive from a RAID 5 array, your array will still continue to operate. The users will still have access to their data. And when you replace that drive, you can re-synchronize the array to get 100% back up and running.

Raid 6 is also striping, but it includes the equivalent of two separate parity drives. That means that we need at least four drives to be able to have a RAID 6 array. And since we have the equivalent of two parity drives, we can lose two individual drives in that array and still be up and running. Just as we did with RAID 5, we can replace the bad drives, re-synchronize the array, and have everything back the way it was prior to the failure.

With RAID 10 or RAID 1 plus 0, we are performing a striping and a mirroring of those stripes at the same time. That means that we need at least four drives to be able to use RAID 10. And if we lose one of those striped drives, we’re still up and running because we are also mirroring those striped drives. That means that we could potentially lose one drive from each of those mirrors and still be up and running. So RAID 10 allows us to lose everything but one drive from a mirrored set of stripes.

Most of the drives we use these days have technology inside that keeps track of many different statistics regarding the performance of that drive. We refer to these statistics as SMART. This is the Self-Monitoring, Analysis, and Reporting Technology. This often uses third-party utilities to grab these SMART statistics and provide you with a detail of how that drive may be performing.

You could use a third-party tool that provides you with a breakdown of the statistics, similar to what you see on the screen here. This is the raw data directly from the SMART statistics on the drive. You would need to know which one of these statistics to see, such as Power_On-Hours, Power_Cycle_Count, Temperature_Celsius, and other metrics.

You might instead want to use third-party software that analyzes all of this data and then provides you with an analysis of whether the drive is performing well or if there are problems. This third-party software may be able to provide this analysis over time so it can perform daily checks, weekly checks, or monthly checks to give you an idea of how this drive is performing over time.

And if you start to see certain statistics degrading over time, you might be able to replace this drive before it fails completely. Using these SMART statistics can give you a good idea of the warning signs and allow you to prevent the drive from having any type of catastrophic failure.

Very often, a RAID array will have this functionality built in, but you can also have third-party software on your laptop or portable system that only has a single drive inside of it. You can receive messages on your screen that tell you how the performance of these drives happens to be. And many RAID arrays can send email messages or text messages so that you can get this information even if you’re not in the console or looking at the console in the data center. This might be your warning to take an extra backup of the data that’s in that array, and you might also want to proactively replace any drives that are showing any type of errors.

Our storage drives are a significant component inside of our computer, and they have a dramatic impact on the overall performance of our system. That’s because these drives are sending and receiving information from memory. There’s communication across the bust to be able to communicate to that device. If it’s a hard drive, there’s the spinning drive access that adds additional delay. And of course, you could be reading or writing large amounts of data to this storage drive, or only a little bit of data.

All of these different processes combine to create additional slowdowns or delay in accessing your data, and any problem with any of these steps could create significant delay. There might also be times when we’d like to compare the performance of one storage drive versus another, so you might want to perform some performance checks to see if you could tell which drive would really be the better choice.

One of the metrics that we use to provide that type of feedback is the number of Input/output Operations Per Second, or IOPS. This is a very broad view of performance, and it gives you an idea of the overall capabilities of any particular storage device.

To give you an idea of just how important these values are, let’s compare a hard drive with an SSD. The number of input operations per second of a hard drive is approximately 200. That’s the maximum number of operations you would see for one of these spinning drives. But if you replace that hard drive with an SSD, a Solid State Drive, your IOPS goes up to approximately 1 million. That is a significant difference between the spinning hard drive. And very often, we tell people to simply replace their hard drive with an SSD, and they’ve immediately upgraded their system.

If you boot your system and you notice that certain drives are no longer accessible from your operating system, it may be that those have been disabled in your BIOS or there was some type of error with your BIOS. You may want to check your BIOS logs or the BIOS configuration to see if your system is really seeing those drives as being accessible.

If this is an internal drive, it may just be a loose cable or something that may have been disconnected, so simply reseating those cables will usually solve that problem. If it’s an external drive, then you want to check all the cables going to that particular drive, and you may want to check the drive is receiving power through the power connection.

In some cases, a user may boot their computer, but the drive that’s missing is not a physical drive on their computer, but a drive that they would normally map across the network. This is done usually during a login process or a login script, or it may be something the user can do manually. You might want to look at your operating system and see what the status is for map drives across the network.

This is the map network drive for Windows where you can select what drive letter you would like to use, the folder, which in this case is pointing to a server called gate room, and the share on that server is called mission reports. And you might even want to make sure that the check mark says that it will Reconnect at sign-in so the next time this person boots their machine, this drive letter will automatically connect across the network to the gate room server and the mission reports share.

We’ve been talking during this video about the drives that go bad, but we could have cases where the drive controller is going bad. We often see this if we’re using an external drive controller, which we often do when we’re configuring a RAID array. When you boot your system, you might see messages like this one. This one’s booting up and telling you what different function keys you can press. It tells you the type of processor you’re running, how much memory is in the system. There is a boot agent for the Ethernet device, because you could boot across the network.

And then we come to the section that tells us about the drive controller. It tells us that we have a Dell PERC H200 6-gigabit SAS HBA BIOS, and that is related to the drive controller in this Dell server. This says there’s an integrated RAID exception detected, and that a volume is currently in the state of INACTIVE and OPTIMAL. So you would need to go into the configuration utility to provide more investigation about why that controller is having a problem or why the drive connected to the controller is having a problem.