Posts Tagged v7000
An experience … Upgrading Storwize v7000 firmware from 6.1.0.7 to 6.2.0.2
As part of a fully ‘destructive’ maintenance weekend we recently undertook, we had to upgrade a pair of fully populated IBM Storwize v7000′s (IBM2076). That’s 20 enclosures with 880 mdisks.
Their current firmware was at 6.1.0.7 and we were upgrading to 6.2.0.2.
Before performing any tasks we took down the entire vSphere infrastructure and any physical hardware attached to the v7000′s. This was just to ensure that no I/O was hitting the storage.
The first part of the upgrade consists of running an IBM ‘upgrade test utility’ to ensure that the upgrade can go ahead. Full instructions here (IBM portal logon required).
The tests were run as instructed and there were no errors or warnings encountered. So we then proceeded to upgrade the firmware as instructed by the wizard.
The file uploads to the first node and starts the upgrade, you may get a small disconnect if you are on the node it wants to upgrade. This is only short and you will be moved to the other node. The firmware upgrade then starts in anger. Rough estimates are 20 minutes for the upgrade and 30 minutes stabilization.
Some twenty odd minutes later the first node completes and is now at revision 6.2.0.2. The second node then starts cooking the firmware update.
Now, normally this goes through exactly as the first node. We have run a full upgrade on another pair of identical Storwize v7000′s and all went as planned. In this instance, it went nowhere.. And it went there very very slowly!
We basically waited around for several hours and eventually the interface returned an “upgrade stalled” message and told us we would needed to revert back to the previous revision. At this point we agreed IBM support would need to give us some input. The turnaround for the call was extremely fast I’m pleased to say and after various attempts at troubleshooting and log generation we managed to get the v7000 back to its original 7am state. Better known as square one. This was some 4 hours after we originally started the upgrade I may add.
So anyway, you’d probably like to know what happened to the second node and what we did to get ourselves back to sq.1.
Well in all honesty we don’t know what happened to the second node.
- Possible software corruption – the v7000 service page (reachable by going to http:///service) showed the second node coming and going from the GUI and the node ports and speeds showing erroneous information.
- Possible hardware failure – constant 835 errors (PCIe) on the first node, but this could be just the fact that the second node kept coming and going.
To cut a long story short, we eventually hot replaced the second node canister and introduced a new canister. At this point support asked us to visit the service page and see what was happening.
The newly introduced canister showed up in the service GUI, the following took place;
- The new node appeared with a new generic name
- Its status was set to “Candidate”
- We were then advised to insert a v7000 setup USB stick into the newly replaced node. Bearing in mind there were no svctasks on the stick.
- This then prompted the second node to rebuild itself to firmware revision 6.1.0.7. This took some major time (40 mins+).
- Then the first node downgraded from 6.2.0.2 to 6.1.0.7, again, taking some time to complete.
We were quite surprised to see how the control canister manage to do this. But apparently there is an SSD inside the control canister, so we assume there must be firmwares and other metadata in there.
Once the firmwares reverted we were able to get back into the GUI and all our Pools, VDisks and mappings were intact. If it proved nothing, it proved how resilient the v7000 code can be to get you back and operating again. We were then advised to re-run the upgrade process again. We decided at this point to upgrade both the v7000′s at the same time.
An hour later both v7000′s were successfully upgraded to 6.2.0.2. Not bad for a 10 hour day!
NOTE: After the firmware upgrade the control enclosure may go 100% fan, this is a known bug and the controller can be shutdown via the GUI and power cycled to clear it. This known bug isn’t documented I’d like to add.
Looking forward to the next upgrade now that will see us upgrade some older SVC hardware (inplace) and upgrade to the latest code revision.
Useful Storwize troubleshooting commands (used at an ssh shell connected to the v7000 cluster ip address);
svcinfo lssoftwareupgradestatus - Useful for checking what the status of a software upgrade is.
More to come..
Useful links (IBM portal logon required):
Storwize v7000 6.2.0.2 firmware release notes
Storwize v7000 6.2.02 upgrade test utility (29mb)
Storwize v7000 6.2.0.2 firmware (346mb)
Troubleshooting: Getting node canister and cluster information using a USB key