Multiple-point Hardware And Software Failures in Two, Separate and Not-connected Computers At The Same Time...

A few days ago, I was working on a restoration of a 100-years-or-so-old Calculus book on one of my Linux based computers, while my other computer with the Microsoft Windows Vista operating system was serving as Broadcast TV receiver with its USB HDTV tuner in the afternoon.  The weather in Los Angeles was summer-like in November, with clear skies and 90 degree Fahrenheit temperatures.  All of a sudden, my Linux based computer halted in the middle of the processing it had performed hundreds of times before in hotter days.  It would not restart.  The entire boot block of the disk seemed to have been garbled.  This did not seem even feasible at all, so I decided to shut its power off for a while.   It came back up after a while, and everything looked normal.  Then, it did the same thing again.  I decided to open its cover and check on its multiple fans as there was nothing else that could go wrong.

I then noticed that the computer with the Microsoft Windows Vista Operating system which had been receiving the broadcast TV, was displaying a freshly-booted log-in screen.  It had "Blue-screened" while I was working on the other computer across the large room.  This again did not seem any feasible as there had been utterly no connection between these two computers.  Even the AC power line circuit was different.  Furthermore, this computer had the most extensive air-cooling system I had built to have it work through 107 degree Fahrenheit temperatures indoors.  Anyway, I logged back in and started the broadcast TV reception again.  Sure enough, after a while it "blue-screened" one more time...

I went back to the Linux-based computer and found all of its fans operating, but with somewhat hotter disk drives.  The problem was that in hotter days, the same computer had cooler disk drives with nothing different.  I concluded that somehow the 80 mm fan mounted in the front side of the case, with its side with rotating blades clamped on the perforated part of the steel case serving as the fan grill, was starting up fine.   But, as the time passed the spring-loaded rotating hub was slowly drawn toward the perforated steel case by two means: The partial vacuum formed by the suction generated by the blades of the fan, and by the magnetic attraction of the rotating hub with electro-magnets in it to the partially magnetized, perforated steel casing.   The first effect was always there, so it was not the real cause, but once something else came along, it really helped the latter.  The hub was slowly drawn to the perforated steel casing due to magnetic attraction, with the holes in the casing inducing a huge air-drag on the hub blades as there was no by-pass around to supply the extra-air needed to reduce the partial vacuum.  In addition, the rotating hub with the electro-magnets now was very close to the conducting metal surface and the induced eddy-currents in the metal by the moving electro-magnets had added even more drag on the rotating hub, causing it to come nearly to a halt.  The disk drive electronics was heating up and was causing DMA access faults which in turn caused the Linux kernel to panic and halt.

Well, this was nearly unbelievable, but true...  I had not brought any magnets into the room and I still do not know how the computer case got magnetized.  It has been working at the same location for years.  The solution was to move the fan away from the perforated steel casing a little so that some air could come in through the gaps on the sides of the fan (hence supplying a by-pass), reducing the partial vacuum in front of the fan.  This kept the rotating hub far enough away to prevent the massive induced eddy-current drag from slowing the fan down to a halt.  The computer now works perfectly with the very same fan as it has had been doing for years. 

The real solution is to saw the perforated part of the steel casing in front of the fan away, and to replace it with a better fan grill.  The best fan grill material  I have found is the finely perforated, thin, black aluminum sheet that is usually used as a car audio speaker grill.  In fact, I use these in my Microsoft Windows Vista based computer.  The fans are quieter, with more air flow.  It also keeps dust away and you can brush the collected dust off easily.

The next problem was the halting of  the computer with the Microsoft Windows Vista operating system with a blue-screen.  The fans in it could not be the cause of this, as it had already had the best improvements I could put in it,  with even externally powered fans that did not load  the computer power supply.  And, all of the fans were working well.  In the meantime, the Microsoft November 2014 updates for the Microsoft Windows Vista came out, and as usual I told the computer to load and to implement them.  Sure enough, the computer again "blue-screened" in the middle of the update procedure.

That was somewhat too much, but there was nothing else I could do other than to debug it.  I had not changed anything in the computer and its power supply, completely internally updated by myself a few years ago, was working perfectly.  Whatever was causing it was not in the hardware.  It was not in the November 2014 software updates either as it "blue-screened" before those were announced.  I brought the computer back up after several disk and other software checks and after the completion of the updates,  I gingerly turned the network modem on.  I then sent the reports on the six failures (three "blue-screen" type failures and three "Anti-malware Executable" failures) to Microsoft with all of details requested using the Microsoft Windows Vista problem reporting system.  Within minutes, the Microsoft came up with a diagnosis that the USB driver code in the system had a serious bug.  I had not changed this code in years.  It suggested that I should use the "Microsoft Fix-It" for this problem and it pointed to a link to download it.  I did download it.  It ran and the "blue-screen" problem just went away, as if it had never been there...

-- Yekta

November 12th, 2014 7:36pm

The computer with the Microsoft Windows Vista operating system stayed operational for about 12 hours, then it went completely berserk.  It started to jump to "blue screen" at every opportunity for several reasons it had stated.  During those 12 hours, I was able to make  back-up copy of the boot disk at high compression (very compute intensive) using the Acronis True Image software, but the software stopped at the "verify" step claiming that the archive generated was inconsistent.  It then proceeded to jump to "blue screen" for another reason.  After a few tries to stabilize the machine, I was able to boot back up and mount the back-up image.  It seemed to be fine.  Then, the machine jumped to the "blue screen" again.  I booted the machine back up again then shut it down immediately using the regular shutdown procedure to prevent damage to the disk.

I then proceeded to take machine apart to see what was happening.  I removed all of the PCI and AGP boards (the mother board has a backup display chip, a backup network interface chip and a backup USB interface chip on board), and I temporarily replaced the power supply with an older, but known-good, test power supply.  I brought the machine back up and sure enough, it jumped to the "blue screen" again after a few minutes.  I booted it up back again to the BIOS and shut down the USB interface.   The machine came up, but it again jumped to the "blue screen" and this time it could not claim that it was the USB interface.

I sat down in front of the machine, looking at its motherboard stripped bare in its casing.  Right below the CPU, there are two electrolytic capacitors installed there to stabilize the 3.3 V power supply voltage right at the CPU.  They did not seem to be looking right.  The other electrolytic capacitors have  very flat tops with 120 degree indentations on the top surface.  These two had a visibly curved top, bulging up and out.  I shone some light on these capacitors, and light reflected from only one of the three segments at a time indicating that the segments were at different angles.  The other capacitors reflected the light back from all segments all at once.  The two pictures below show these capacitors and their surroundings.   The actual down direction is shown to the left in the pictures:

It was obvious that these capacitors had evaporated their electrolyte and they had become effectively an "open-circuit" device, with no comparable capacitance.  The indentations on top are deliberately made by the factory to prevent an explosive breaching of the casing in the case of electrolyte evaporation.  The tops simply deform in most cases when indented as shown.

Now, whether these capacitors were the cause of what I was experiencing had become another question.

  To test this, I booted the "bare" motherboard with the memory and the disks up into the BIOS and I slowed everything that I could, down (the CPU, the memory access, etc...).  The low speed of the CPU clock and the memory access rate do shake the 3.3 V supply less at the CPU chip, so the machine should be able to stay up longer this way.  The machine came up and stayed up.  I was able to go out on the network and read my e-mail and browse the news with no problems.   Few more test had to be done:  Insert every one of the boards I have had pulled out one by one, and then boot the machine up every time I have added a board to see what it does.  This should make the machine less and less reliable as I added boards.  It did.  The fully populated machine would only stay up for 15 minutes or so with slowed down CPU and such.  Finally, I put the original power supply back in, and this did not change anything, indicating that the power supply was fine to start with.  I verified this while the machine was running: All voltages were right at the level they should have had been, with full load on the machine.

The 2,200 microFarad, 6.3 V maximum voltage, 105 degrees Celcius operating temperature capacitors cost about $1 each.  I have to order them though...

My other Linux based machine with the fixed magnetically trapped fan works fine as usual.

-- Yekta

Free Windows Admin Tool Kit Click here and download it now
November 14th, 2014 11:44pm

I ordered the capacitors on Friday and they arrived on Monday, November 17, 2014.  I removed the motherboard from the machine, by removing all PCI and AGP boards, drive and fan connectors and the computer power supply first.  The motherboard then simply unbolted from the case and came out with the CPU fan assembly still attached.

I wrapped the solder side of the motherboard with aluminum foil and set up a work place with the aluminum foil under the motherboard and myself electrically well grounded.  Here came another surprise:  There were four more capacitors of the same kind just behind the CPU fan assembly and their tops were also deformed with one of them leaking the electrolyte inside from the the top.  Luckily, I had ordered more than two capacitors to get the quantity discount and the lower rate of shipping.  I do use them in other circuits I occasionally build.

Technically, the only thing one needed to do was to unsolder the six old capacitors from the motherboard and to solder six new ones in in their place with the correct polarities.  However, due to fact that the capacitors span the 3.3 V power plane and the ground plane in the multi-layer motherboard, it is nearly impossible to unsolder these capacitors using regular, fine-tip soldering irons.  The thick copper of the power and the ground planes carry the soldering iron heat away very fast, preventing the solder from melting quickly.  Continuous application of heat at this point will simply burn the internal insulating epoxy layers and cause shorts inside the motherboard which are impossible to fix in any reasonable amount of time.

The only reasonable way to remove these capacitors was to dismantle the capacitors from the top leaving their already soldered leads in place.  The new capacitors were then tack soldered to these stubs using lead-free, hard solder.  However, the CPU fan assembly and the CPU itself had to be removed from the board to be able to work on these capacitors.

To dismantle the capacitors from the top, I first drilled small holes at the tops of the capacitors at the intersections of the indentations using the tip of a hobbyist's knife.   I then used needle nosed pliers to peel back the triangular sections of aluminum from the center at the tops to their bases at the top edges of the capacitors.  Next, I  removed the plastic layers covering the outside of the capacitors by scoring the plastic layers first from the bottom to the top using the tip of the hobbyist's knife and peeling the plastic layers off starting at the cut.  The following step was to cut the aluminum cans of the capacitors from the top to the bottom using the hobbyist's knife like a can opener.  One could not use a saw like tool here to accomplish the feat as the saws generated very fine metal chips which were very hard to remove and were certain to cause shorts in the densely populated mother board.  The cans were then peeled off the rest of the capacitors starting from the top at the cuts using needle nose pliers, revealing the spiral-wound metal-paper layers of the capacitors.

The wound layers of the capacitors were peeled off layer by layer by cutting into the layers from the top to the bottom, leaving only the two aluminum electrodes which were crimped and soldered to the leads of the capacitors.  The picture below shows the six capacitors with one of them dismantled (left) and with all of them dismantled (right):

The  black disks below the aluminum electrodes are the rubber plugs covering the bottoms of the capacitors.  The rubber plugs were then cut in half using the hobbyist's knife and removed using the needle nose pliers.  It was not possible to solder to the aluminum electrodes, so these were trimmed at the point they were crimped on the leads of the capacitors, leaving only the stubs of the capacitors' leads soldered to the motherboard.

The new capacitors with suitably trimmed leads were then soldered to these stubs with the correct polarities using lead-free, hard solder.  The capacitors were lightly bonded together using a flexible glue to prevent them from moving.  The picture below shows the new capacitors as installed into the motherboard:

I then assembled everything back together and turned the computer on.  The BIOS complained on the boot screen that the CPU was out of its socket and it needed to be reset.  I set BIOS parameters correctly to their original values.  The computer came up and worked without any problems.  I typed this message on  my newly repaired computer running the  Microsoft  Windows Vista operating system. 

By the way, the manufacturing date on the motherboard is 09/12/2002 and the CPU is a Socket-478, 2.4 GHz, Intel Pentium-4.

-- Yekta

November 19th, 2014 10:05am

The two computers mentioned above have been working very well with no faults.  I regularly put back-ups of the operating system and all other files on external, USB connected disk drives that are not left running on the computer.  These drives are powered up only when the back-up or the restore programs are running.

A few days ago, I decided to back-up the operating system and all other files on my computer with the Microsoft Windows Vista operating system.  As I had mentioned above, the back-up of the operating system and all other files on this computer had failed with a verify error while this computer was running with bad capacitors on its motherboard.  It was time to back-up the system after the capacitors were fixed.

I connected the same Seagate ST31000524AS, date code 12063, SATA, 1 TB disk drive with an external SATA to USB interface to the computer.   The computer took a long time to bring the drive directory up.  I was able to delete an old backup to make some space for a new one.  Then, the drive became inaccessible to the back-up software.  I disconnected the Seagate drive with its USB cable from the computer, and I connected another USB- interfaced disk drive to it and performed the back-up.  The back-up ran flawlessly.  I then normally disconnected the  drive with the good backup from the computer.

I took the Seagate drive with its cable to another computer running another version of Microsoft Windows Vista.  It also took a long time to to bring the drive directory up, and it was not possible to read any of the files from it. 

I first examined the USB, high-speed, A/B type cable.  The middle pins on the A connector seemed to be recessed a bit too much.   Same was true on the B side as well.  The picture below shows these connectors:

I decided not to use this cable anymore.  I cut the connectors out and saved the cable itself.  I had another, shorter cable with better looking connectors.  The new cable made no difference:  It was still not possible to read any of the files from the disk under the Microsoft Windows Vista operating system.

I took the Seagate drive to my Linux computer.  It mounted the drive with no difficulty and I was able to transfer the files I needed out of the disk drive, leaving only some old back-up files.  Encouraged by all of this, I took the Seagate drive back to the computer with the Microsoft Windows Vista operating system.  It again took sometime to bring the drive directory up, but it responded to the "Format" command with the "Quick format" option instantly, and "quick-formatted" the drive.  The drive directory came up and one could look into the empty directory with ease.

I then decided to re-format the drive without the "Quick-format" option, just in case it had some bad sectors that needed to be re-mapped.  The machine responded to the command, but it was taking its time again.  I could hear the disk heads moving while it presumably formatted the drive.  It was late at night, and I went to sleep, letting the machine work on the drive...

Some 18 hours later, there was no indication of the completion of format operation in any reasonable time period.  After several tries, I was able convince the machine to cancel the format operation.  It did, but it left behind an "un-initialized" disk.  Any attempt to initialize the disk came back with "Drive not accessible" error. 

I took the drive back to the machine with the Linux operating system.  It mounted the SATA to USB interface board correctly, but the disk which was attached to the board had completely disappeared, as far as the mounting software was concerned. 

I took the drive enclosure apart, and checked all of the cables and power connections.  They were all fine.  I separated the drive from the USB interface board and installed the Seagate SATA disk into the computer with the Linux operating system as a SATA drive on one of its internal SATA ports with the internal power from the computer.  The computer's BIOS came up very sluggishly and it managed to get the correct disk drive parameters from the Seagate disk drive, but that was it.  The machine would not boot with this disk drive in it.

At this point, the disk drive seems to be damaged somehow, but it is not clear how that has happened.  I am looking into this... 

This particular event clearly demonstrates the reason why two separate, external 1/2 TB drives which are never used at the same time are much better than a single, external  1 TB drive.  Should one drive become inaccessible, the other one is readily available...

-- Yekta

Free Windows Admin Tool Kit Click here and download it now
November 23rd, 2014 6:58am

I worked on the Seagate ST3 1 TB hard-disk drive some more.  I mounted the bare drive into other machines running the Microsoft Windows Vista operating system.  The drive came up and initialized itself as far as one could tell from the noises it made.  However, the drive never became "ready".  In all of these cases, the BIOS's of these machines failed to find the drive, and booted up without it.

I downloaded the product manual from the Seagate Web page.  The manual describes the disk, its power and SATA interfaces in detail.  It  fails to mention the extra four pin connector right next to the SATA data and power connectors on the disk drive.  There are no failure modes listed in the manual for aiding in debugging operations.

I looked around the network for problems associated with this particular drive and found many of them.  It looks like these drives have a large number of failure modes.  One of these failure modes is the corruption of the device's internal firmware microcode.  When this happens, the disk powers up and initializes the head positions, but it never becomes ready depending on the level of the corruption.  There is no easy way to load a new microcode into the disk drive hardware without the BIOS of the machine finding the disk drive.  The "SeaTools" software from Seagate can load  a new microcode only when the disk drive is identified by the BIOS and is mounted.  Otherwise, it simply does not see the disk drive.

Now, I have a disk drive with an almost certainly operating actual hardware and an on-board microcode which does not know how the run the disk hardware anymore.  There are data recovery services offered by Seagate, but that is exactly the opposite of what I need:  I do not care about the data, I have already saved what I needed.  I need an operating drive and it is locked up by its own software.  Obviously, it is possible to debug the drive, but Seagate does not make the information needed to do this available.  This is somewhat funny as the drive does lock up with one's software intact on it, and then one has to send it out should one want the software back with no way to unlock the drive by repairing the microcode.

More work is needed to get the drive back it seems.  Seagate does offer a "DOS" version of the Seatools which boots up from a CD, and I have not tried that one yet.

While testing the Seagate drive, I noticed that my other computer with the Microsoft Windows Vista operating system had somehow lost one of its DVD drives as well.  The drive is hard-mounted into the computer with a completely tied down wiring harness inside.  It seemed to be connected properly.  I noticed that its SATA data cable was looking a bit odd, hanging off the drive at a slight angle.  I thought that it had come loose somehow, but it had not.  The plug on the cable would not go into the connector in the drive any more.  That seemed impossible as the drive had been working and the Seagate drive was mounted at the bottom on a spare port with a separate, loose data cable as far away from the DVD drive as possible inside the case.

I looked at the data cable and I saw that its plug had broken the plastic, keyed tab inside SATA data connector on the DVD drive.  This SATA cable was an extremely rigid one and it became more rigid as the air temperatures dropped during the fall-winter season, pulling on the keyed tab and breaking it off.  The tab was stuck inside the SATA data cable plug.

I pulled the tab off the plug, and removed the very rigid cable.  I bonded the tab back, with cyano-acrylate glue (super glue) after treating it with the primer for this glue to bond plastic.  I put in another , softer SATA data cable and tied the end of it to the DVD drive case before plugging it into the drive connector. It worked.  The picture below shows the connector before and after the repair and the very rigid SATA cable I do not use anymore...

-- Yekta


  • Edited by Yekta_Gursel Wednesday, November 26, 2014 2:19 AM Fixed spelling.
November 26th, 2014 2:13am

I worked on the seemingly broken Seagate disk drive some more.  I made a bootable SeaTools Version 2.32 CD-ROM, and connected the disk to one of my computer's power supply and to an unused SATA port.  I set the BIOS of the computer to boot from the SeaTools bootable CD-ROM.  I also attached the microphone of a voice recorder to one of the earpieces of an engine stethoscope with a long metallic probe, and pressed the tip of the probe into a mounting screw hole on the circuit-board side of the disk drive.  This screwhole is on the left side of the drive near the middle of the left edge with the drive power and SATA connectors facing away from me as I look down onto the drive's circuit board, wearing headphones attached to the earphone output of the voice recorder.

I started the recorder, and then powered the computer up.  The computer BIOS located the drive and got its parameters after a long time.  The drive motor powered up and the head initializations were  clearly audible, as well as the subsequent head motions while the BIOS tried to get the drive parameters.  After another long time period, the computer booted up from the SeaTools bootable CD-ROM, and executed the SATA "InitDisk" command on the disk it had found.  The "InitDisk" command got stuck on the Seagate Drive and gave the printout "Error in reading the partition table on 'The Seagate Drive Channel' sector 0" four times, and then gave up on the drive and booted up with only the other disks recognized.

This entire process took about 18 minutes and 47 seconds from the time of power on to the time at which the SeaTools Window came up with only the other disks recognized.  The length of the recording is 4.39 megaBytes.  Unfortunately, there is no way I know of posting this recording onto this web page due to the file type and the file length limitations.

As clearly heard on the recording, the disk heads and motor seemed to be functioning normally.  The drive somehow had lost its sector 0 reference, and it tried to find it continuously.  I do not know whether this is due to a microcode problem or a track damage problem which carries the sector 0.  I will work on it some more...

-- Yekta

Free Windows Admin Tool Kit Click here and download it now
December 8th, 2014 11:08pm

I temporarily suspended the work on the broken Seagate 1TB disk drive.  The drive electronics is too intermittent for one to be able to debug it using normal computer hardware, as it requires an excessive number of reboots for the computer BIOS to even recognize the drive.  Special hardware is needed to look into the fault in the drive electronics. 

In the mean time, my computer that ran Microsoft Windows Vista Home Premium operating system (the same one in the series of messages above with the electrolytic capacitors replaced on 3.3 V supply on the motherboard) did something today I had been able to perform only once in the past by defragmenting the drive and disk-checking it several times, all by itself:  It freed up 34.1 GB on the system disk in less than a second, after the installation of the latest Microsoft Security Essentials definitions update.  The Microsoft Security Essentials definitions were only 2 days behind that of the latest version.

The system then became unstable.  It kept spinning CD/DVD drives that were not accessed.  I was able to back the main drive with its two partitions up.  However, the back-up program (Acronis True Image WD) got stuck while verifying the backup, announcing a verification completion time of 49,710 days.  The program would not stop or exit.  The system refused to let me disconnect the USB drive.

I managed to shut the USB drive off and disconnect it from the system, after verifying that no drive access was actively taking place.  I shut the computer down using  the "Shutdown" option from the response screen to the CTRL-ALT-DEL keystroke sequence.  It took a very long time to shut down. I rebooted the computer, and it got stuck during the reboot, spinning the CD/DVD drives which were on the IDE second channel.  I interrupted the reboot by hardware reset switch, and brought the system up into the Windows SAFE MODE.  The system was still unstable in this mode.  It was not possible even to close the HELP window, which is automatically opened in the SAFE MODE.  I was able to shut the system down using the "Shutdown" option from the menu, and I pulled out all of the CD's and DVD's from the drives.  I then brought the system back up normally.

It came back up.  I then scheduled disk-checks on the two partitions of its boot drive, and shut the system down again.  I rebooted it again and it checked the disks, and it came back up several hours later.  The system had now become stable.  I ran the Acronis True Image WD program to complete the back-up verification, and it completed without an error.  The system now had 34.1 GB extra free space with no user file loss, and it was working much, much faster.  The compressed back-up file size was 20 GB smaller than the size of the last-month's back-up with the same amount compression, with more user files added in the mean time.

I had noticed in the past that Microsoft Windows Vista Home Premium had been allocating and keeping large amounts of storage (larger than 27 GB) during network transactions.  I did not know why it was doing it.  I still do not know for sure...

-- Yekta

May 9th, 2015 1:57am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics