Maui Forums
[Solved] - Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - Printable Version

+- Maui Forums (https://forums.mauilinux.org)
+-- Forum: Maui Support (https://forums.mauilinux.org/forumdisplay.php?fid=74)
+--- Forum: Hardware (https://forums.mauilinux.org/forumdisplay.php?fid=85)
+--- Thread: [Solved] - Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. (/showthread.php?tid=24237)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - rocky7x - 20th February 2017

Well I've googled a bit as well for you and found a few possibilities:
1. it can be a power supply issue, your power supply may not be giving the CPUs enough voltage, so it might be that your power supply is near end of life or you have too much peripherals attached for the power supply to handle
2. you motherboard may not be giving the CPU enough voltage, so your motherboard can be near end of life
3. since I think you have intel CPU, have you installed a package called intel-microcode? If not, please try it, it will not harm in any way, but enable the system to handle the intel CPUs properly
4. since Ubuntu 16.04, many people have started experiencing the same issue as you, there is an active bug opened with Ubuntu, last post was 22 hours ago - look here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1530405
5. seems that people with Nvidia cards are having this issue, so even though you have switched to Nouveau driver, maybe I would experiment again with reverting back to proprietary Nvidia driver and checking for the lockups
6. there is a parameter that you can set, that will potentially enable to avoid the lockups - try to set it like this:
Code:
sudo echo 120 > /proc/sys/kernel/watchdog_thresh

By default that parameter is set to 10 seconds. If you prolong it to let's say 120 seconds, it will maybe help. In any case, that behavior is definitely NOT normal and something IS going on there. And in any case, this is DEFINITELY not a Maui thing, I think that you would have similar issues with any other Ubuntu 16.04 derivative. You have to understand that Maui is ONLY the Plasma desktop environment, nothing else - everything else, below that, like graphics, kernels, hardware support etc. is purely related to the Ubuntu and stuff provided by Ubuntu, so whether it's Maui, Neon, Mint or anything else - all is equally impacted. So when you search, please search for stuff related to Ubuntu 16.04.


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - kdemeoz - 21st February 2017

(20th February 2017, 14:59)leszek Wrote: It seems to me be a bug in your hardware and the kernel has no workaround for it.

Hmmm, well i still cannot grasp how this is [only] MY problem, given i posted those links showing MANY other people have been having apparently the same pain [& Rocky also confirmed with his research]. 


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - kdemeoz - 21st February 2017

Rocky - thanks. I still intend to reply to your latest details, but must first post this latest unpleasant update.

Another total system freeze10th Hard Reset needed since my 31/12/16 Maui clean reinstallation, 6th since changing from NVidia to Nouveau GPU driver, & 2nd with Kernel 4.9.9.

Unlike most of these freezes, today's one again was not during Resume from Suspend, but instead was during a normal working session [& not running a backup, no USB sticks in USB ports], just multiple pgms open [but still less than 1/3 of my 32 GB RAM used]. 

Hilariously ironic though; wrt https://forums.mauilinux.org/showthread.php?tid=24285&pid=41774#pid41774, i had one of my Maui VMs open, & had run the HWE command in Konsole in it, which was maybe 75% complete when the Tower freeze struck. Looking at the corresponding syslog file, this time there were NO instances of that "NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s!" error / failure. 

Instead, at 15:22 this happened; 

Code:
Feb 21 15:22:42 Z97-HD3 kernel: [113067.118718] general protection fault: 0000 [#1] SMP


 


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - kdemeoz - 21st February 2017

(20th February 2017, 16:22)rocky7x Wrote: Well I've googled a bit as well for you and found a few possibilities:
1. it can be a power supply issue, your power supply may not be giving the CPUs enough voltage, so it might be that your power supply is near end of life or you have too much peripherals attached for the power supply to handle
2. you motherboard may not be giving the CPU enough voltage, so your motherboard can be near end of life

Thank you. Yes, as i wrote in my earlier post [my Point #5], PSU & MoB were mentioned by various posters per my links, but in my interpretation, these seemed to be Users taking educated guesses rather than categorically proving root cause. As for possibility of either being "near end of life", given my Tower hales from only May 2015, i would be mighty annoyed if true. 


(20th February 2017, 16:22)rocky7x Wrote: 3. since I think you have intel CPU, have you installed a package called intel-microcode? If not, please try it, it will not harm in any way, but enable the system to handle the intel CPUs properly

It is an Intel [Quad core Intel Core i7-4790 (-HT-MCP-) cache: 8192 KB]. Following your advice i have right now installed intel-microcode via Synaptic


(20th February 2017, 16:22)rocky7x Wrote: 4. since Ubuntu 16.04, many people have started experiencing the same issue as you, there is an active bug opened with Ubuntu, last post was 22 hours ago - look here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1530405

Um, yes, thanks, i know... that is my 1st link i provided in my earlier post.


(20th February 2017, 16:22)rocky7x Wrote: 5. seems that people with Nvidia cards are having this issue, so even though you have switched to Nouveau driver, maybe I would experiment again with reverting back to proprietary Nvidia driver and checking for the lockups

I did also notice that Nvidia was mentioned many times, but i was confused. Some posters seemed to think that their Nvidia drivers were the root cause [hence they advocated changing to Nouveau], whereas diametrically-opposite some posters seemed to feel it was Nouveau at fault, recommended changing to Nvidia drivers, & actually a few times mentioned they had "blacklisted nouveau" [what does that mean? is it sensible? how to do it?].

I do note your suggestion to try my Nvidia driver again [which i assume you are only making specifically in awareness of my kernel upgrade to 4.9.9, otherwise i cannot see the point, given that the explicit reason i previously switched from Nvidia to Nouveau [per advice from leszek, Pliny & you, if i recall] with the various older kernels was the plethora of freezes]. However, given today's new change already made [the intel-microcode installation, as above], i shall hold off on the GPU driver switch-back until/unless there's another freeze [ie, so i can properly judge the effectiveness or not of intel-microcode].


(20th February 2017, 16:22)rocky7x Wrote: 6. there is a parameter that you can set, that will potentially enable to avoid the lockups - try to set it like this:






Code:
sudo echo 120 > /proc/sys/kernel/watchdog_thresh

By default that parameter is set to 10 seconds. If you prolong it to let's say 120 seconds, it will maybe help. In any case, that behavior is definitely NOT normal and something IS going on there. And in any case, this is DEFINITELY not a Maui thing, I think that you would have similar issues with any other Ubuntu 16.04 derivative. You have to understand that Maui is ONLY the Plasma desktop environment, nothing else - everything else, below that, like graphics, kernels, hardware support etc. is purely related to the Ubuntu and stuff provided by Ubuntu, so whether it's Maui, Neon, Mint or anything else - all is equally impacted. So when you search, please search for stuff related to Ubuntu 16.04.

Golly that's an interesting idea, thanks. I have now done it [which is a bit naughty of me as that's now TWO variables i've changed today, against scientific-method best practice]. Note however that my attempts to do this in Konsole failed:
Code:
Z97-HD3:/$ sudo echo 120 > /proc/sys/kernel/watchdog_thresh
bash: /proc/sys/kernel/watchdog_thresh: Permission denied
So instead i opened the file in Kate as Root & edited its "10" to "120" [should i be worried at Konsole's permission rejection?]. Sigh, that also failed:
Code:
The document could not be saved, as it was not possible to write to /proc/sys/kernel/watchdog_thresh.

Check that you have write access to this file or that enough disk space is available.
There's 125 GB free, so that's not the issue. I'll reboot after posting, & try again.


EDIT: Post-reboot, still "Permission denied". Oh dear, why do these things keep happening to me?


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - rocky7x - 21st February 2017

OK, so you have Intel Haswell i7-4790, which has an Intel integrated HD 4600 GPU, which is more powerful than the Nvidia GT 610 Wink I'm quite puzzled why on earth are you even using that Nvidia card at all? From all this that was written up until now, my recommendation would be:
1. either physically take out the Nvidia card from the Tower and use only integrated Intel card
2. or if you cannot take the Nvidia GPU out, switch back to proprietary Nvidia driver and via it, switch to the Intel integrated card, either via the Nvidia GUI or via command

Code:
sudo prime-select intel

3. now that we would have the Nvidia crap sorted out, we go back to the kernel stuff - if I were you, I would remove the 4.9.9 kernel (because it's not tested thoroughly and it can be quite unstable) and install the one from HWE, so 4.8, which is much more stable and thoroughly tested.
4. to set the threshold parameter properly you would first need to switch to root, so follow the commands:

Code:
sudo su (then enter your sudo pwd)
echo 120 > /proc/sys/kernel/watchdog_thresh

That should do it. But before changing that parameter, try the first 3 steps, I'm quite positive that your troubles might be solved by that.


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - kdemeoz - 21st February 2017

Greetings once more.

Quote:OK, so you have Intel Haswell i7-4790, which has an Intel integrated HD 4600 GPU, which is more powerful than the Nvidia GT 610

Oh golly, i had no idea!  Blush


Quote:I'm quite puzzled why on earth are you even using that Nvidia card at all?

Ah, well that's easy for me to answer, but you'd best sit down as i need to lay some heavy-duty tech stuff on ya. Reason = ...that's what came in my Tower  Big Grin


Quote:From all this that was written up until now, my recommendation would be:

1. either physically take out the Nvidia card from the Tower and use only integrated Intel card
2. or if you cannot take the Nvidia GPU out, switch back to proprietary Nvidia driver and via it, switch to the Intel integrated card, either via the Nvidia GUI or via command

As a lazy cow, i shall try for #2 before having to resort to #1. Will try it after posting this.


Quote:remove the 4.9.9 kernel (because it's not tested thoroughly and it can be quite unstable) and install the one from HWE, so 4.8, which is much more stable and thoroughly tested

Certainly i'm happy & willing to do this too, but just to double-check, to "get" the 4.8 HWE kernel into my Tower, i assume that means running the:
Code:
sudo apt-get install --install-recommends xserver-xorg-hwe-16.04
...per your https://forums.mauilinux.org/showthread.php?tid=24285&pid=41774#pid41774 -- (ie, giving me not only the kernel, but all the other stuff you explained therein]?


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - rocky7x - 21st February 2017

Well, the command will give you the whole HWE, but apart from the kernel it's the graphics stack, so new X server and drivers, which in this case is safe to install. Since you do not use Bumblebee, nothing additional for you to do.


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - kdemeoz - 21st February 2017

Stupid me. I am writing this post on my Lappy now, as i managed to break my Tower   Sad  Sequence to destruction:
1. In Maui Settings i changed back from Nouveau to the latest Nvidia driver, then rebooted.

2. i ran this in Konsole, then rebooted:
Code:
sudo prime-select intel

3. Tower would now only present a black screen after the Maui splash screen.

4. I toggled to tty, accessed grub with nano, & inserted "nomodeset" before "quiet splash", rebooted, & was again able to reach the login screen & login.

5. I was unable to understand if i was now truly using the Intel GPU, so i captured some Settings pics, intending to upload them here for assessment. However prior to that, i made the following bad decision...

6. I dumbly thought i would now remove kernel 4.9.9 [ok], but using  Synaptic rather than following the Ubuntu instructions https://wiki.ubuntu.com/Kernel/MainlineBuilds [not ok; dumb dumb dumb] -- i only realised my mistake a bit later, when it was too late.

7. So that's what i did, & was "surprised" that it only removed ONE file, "linux-image-4.9.9-040909-generic" [as best i can now recall, a couple of anguished hours later], not also "linux-headers-4.9.9-040909" & "linux-headers-4.9.9-040909-generic".

8. I rebooted, thus sealing my fate.

9. Tower now repeatedly hung at the 2nd splash-screen.

10. Trying to boot any of the other older installed kernels, including the recovery mode, still ended in splash-screen hang.

11. Now i assumed this misbehaviour was coz i crippled 4.9.9, by removing 1/3 of it but leaving 2/3 of it behind.

12. Therefore i decided to try, in tty, following the proper Ubuntu removal procedure "Uninstalling upstream kernels" at the link i pasted above. That is:
Code:
sudo apt-get remove linux-headers-4.9.9-040909 linux-headers-4.9.9-040909-generic linux-image-4.9.9-040909-generic
This has now removed "linux-headers-4.9.9-040909" & "linux-headers-4.9.9-040909-generic" [grep no longer finds them].

13. However i could not also remove "linux-image-4.9.9-040909-generic", due to the message it's "not installed, so not removed"... doubtless coz of my Synaptic idiocy.

14. Subsequent reboot attempts continue to hang at the 2nd splash-screen.

15. I returned to tty & repeated my... 
Code:
dpkg -l | grep "linux\-[a-z]*\-4.9.9
...which unhappily discovered that "linux-image-4.9.9-040909-generic" is still found [the other two are not, anymore]. Yet all attempts to remove via...
Code:
sudo apt-get remove linux-image-4.9.9-040909-generic
... continue to fail, with "not installed, so not removed".

16. Catch-22.

Other than toss myself off the roof, can anyone offer some recovery steps please?


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - leszek - 21st February 2017

The kernel removal is not the problem. It does not matter if you remove all the packages at once or one at a time. It should not leave you with that result.
The kernel itself just is in the big image package. The big headers package is only carrying header code for other modules or apps to compile against. So it does not contain anything necessary for running the kernel itself.

As for the error itself. Second splash means that you'll see a log screen (or when autologin is enabled your mouse cursor) and it hangs after logging in?
Maybe it is loading incompatible OpenGL compositor settings as you switched to using the Intel driver.
When it hangs after the splash can you press shift+alt+f12 to deactivate the effects (compositor) temporary?
Does your desktop show up then?

If it does not or I am mistaken about the second splash try changing back your command
Code:
sudo prime-select nvidia
and see if it makes it work again.

That's the only thing that comes to my mind reading your procedure that could break it.


RE: Tower's 1st [no, 3rd] Hard-Reset since clean-reinstall. - kdemeoz - 21st February 2017

PS -- As i typed the preceding log [all from confused tired memory, i hope it's all correct], i briefly wondered if maybe i should, still in tty, attempt to repeat the 4.9.9 downloads & installs per that linked page, then try again to fully remove, the proper way. However, i then remembered the Ubuntu page cautions that:
Quote:Preparing to install an upstream kernel

First, if one is using select proprietary or out-of-tree modules (ex. vitualbox, nvidia, fglrx, bcmwl, etc.) unless there is an extra package available for the version you are testing, you will need to uninstall the module first, in order to test the mainline kernel. If you do not uninstall these modules first, then the upstream kernel more than likely will not boot.

...& curses; as per my step #1, i changed back from Nouveau to the latest Nvidia driver [& still can't tell if the attempt to use the Intel driver did work or not]. Back when i did first install 4.9.9 on 14Feb, i was at that stage already using Nouveau, so there was no conflict with 4.9.9. Now in hindsight, i think tonight i first should have removed 4.9.9 [the proper way] whilst i was still using the Nouveau driver, & only then attempted the gpu driver change... after which i could have gone for 4.8 HWE [that now looks to be a looooooooooooooooooong way away].

It feels like i am snookered.