Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong temperature reading #25

Open
warped-rudi opened this issue Dec 20, 2014 · 22 comments
Open

Wrong temperature reading #25

warped-rudi opened this issue Dec 20, 2014 · 22 comments

Comments

@warped-rudi
Copy link
Contributor

On one of my CuBox-i4p (early model), the CPU temperature exposed via hwmon is bogus (i.e. lower than the ambient). I'd guess that it's off by about 20°C. This does not happen on a second CuBox-i4p (newer model) as well as on HB-i2ex and C1-solo

@linux4kix
Copy link
Owner

What are the temperatures it is reporting? Does it move up and down
properly when in use. Please use something like cpuburn to increase the
temp.

On Sat, Dec 20, 2014 at 10:33 AM, Rudi Ihle [email protected]
wrote:

On one of my CuBox-i4p (early model), the CPU temperature exposed via
hwmon is bogus (i.e. lower than the ambient). I'd guess that it's off by
about 20°C. This does not happen on a second CuBox-i4p (newer model) as
well as on HB-i2ex and C1-solo


Reply to this email directly or view it on GitHub
#25.

@warped-rudi
Copy link
Contributor Author

Yes, it does move up and down. I simulated the same load situation on both CuBoxes and the difference pretty much exactly 20°K.

@warped-rudi
Copy link
Contributor Author

This seems to be a problem of this particular box. I reverted to the kernel to a version before the thermal changes and the behavior is the same. Probably I didn't pay attention to this in the past. I even got negative temperature readings right after a cold boot. So either the sensor calibration data is bad, the sensor hardware is broken or the clocks are wrong...

@susisstrolch
Copy link

Same problem here - with an OE5RC3...
Old CBi4pro (Feb '14) - approx 33° when viewing HD content
New CBi4pro (Dec '14) - approx 64° - same content.
Seems we should chease Rabeeh about...

@linux4kix
Copy link
Owner

Are the Older CBi's dev units or purchased units?

On Mon, Dec 22, 2014 at 12:02 PM, susisstrolch [email protected]
wrote:

Same problem here - with an OE5RC3...
Old CBi4pro (Feb '14) - approx 33° when viewing HD content
New CBi4pro (Dec '14) - approx 64° - same content.
Seems we should chease Rabeeh about...


Reply to this email directly or view it on GitHub
#25 (comment)
.

@susisstrolch
Copy link

Both CBis are purchased ones...

Just did some testing:
Idling OE5RC5, HDMI, 1920x1080, 50Hz
old: core: 44°, case: 34°, clock 396000
new: core: 64°, case: 42°, clock 396000

case measured with PT100 probe.

dtb's are identical, kernel is 3.14.25

On 12/22/2014 12:56 PM, Jon Nettleton wrote:

Are the Older CBi's dev units or purchased units?

On Mon, Dec 22, 2014 at 12:02 PM, susisstrolch [email protected]
wrote:

Same problem here - with an OE5RC3...
Old CBi4pro (Feb '14) - approx 33° when viewing HD content
New CBi4pro (Dec '14) - approx 64° - same content.
Seems we should chease Rabeeh about...


Reply to this email directly or view it on GitHub

#25 (comment)
.


Reply to this email directly or view it on GitHub
#25 (comment).

@warped-rudi
Copy link
Contributor Author

@linux4kix: Mine is probably a dev unit. At least it was shipped as such in Jan'14.

@susisstrolch; Is there really such a difference in the case temperature or is that a typo?

@susisstrolch; Can you do a 'devmem 0x21bc4e0 32' That should show the sensor calibration data. I don't have mine handy right now, but I'd like to compare. If they are the same, there is a chance that they are wrong...

@linux4kix
Copy link
Owner

My guess is that the 25c calibration fuse is wrong.

On Mon, Dec 22, 2014 at 3:50 PM, Rudi Ihle [email protected] wrote:

@linux4kix https://github.com/linux4kix: Mine is probably a dev unit.
At least it was shipped as such in Jan'14.

@susisstrolch https://github.com/susisstrolch; I there really such a
difference in the case temperature or is that a typo?

@susisstrolch https://github.com/susisstrolch; Can you do a 'devmem
0x21bc4e0 32' That should show the sensor calibration data. I don't have
mine handy right now, but I'd like to compare. If they are the same, there
is a chance that they are wrong...


Reply to this email directly or view it on GitHub
#25 (comment)
.

@susisstrolch
Copy link

Yes, thats the real temp diff. Checked twice...

Calibration data is identical on both boxes:

CuBox-i4pro:~ # devmem 0x21bc4e0 32
0x5624D869

Cubox-I4:~ # ./devmem 0x21bc4e0 32
0x5694D969

On 12/22/2014 03:50 PM, Rudi Ihle wrote:

@linux4kix https://github.com/linux4kix: Mine is probably a dev
unit. At least it was shipped as such in Jan'14.

@susisstrolch https://github.com/susisstrolch; I there really such a
difference in the case temperature or is that a typo?

@susisstrolch https://github.com/susisstrolch; Can you do a 'devmem
0x21bc4e0 32' That should show the sensor calibration data. I don't
have mine handy right now, but I'd like to compare. If they are the
same, there is a chance that they are wrong...


Reply to this email directly or view it on GitHub
#25 (comment).

@warped-rudi
Copy link
Contributor Author

@linux4kix I think the problem is with 2b1601b . To me it looks like the new 'universal formula' should only be applied to parts that were calibrated according to this. Given the time of the commit, I suspect this applies to chips manufactured after Feb'14. I did a manual calculation with the calibration data of my 'bad' unit and got exactly the difference I observed. Now, the question is how to auto-detect that. The commit message says: 'there will be no hot point calibration data in fuse map from now on'. However, the 'good' units do have hot point data as well! Have not (yet) checked if they are valid and if the old formula will still work.

@susisstrolch 0x5624D869 != 0x5694D969, still puzzled about the temperature difference also 64°C looks a bit high for an idle device

@linux4kix
Copy link
Owner

Rudi, if you revert that patch does temperature reading work properly for
both units? Perhaps we need to test if there is hot point calibration data
and then only use the new formula if there isn't. We should also report
this to fsl as this is a patch pushed to upstream as well and could damage
older chips if pushed hard enough.

64C looks hot for idle, but it may be XBMC running idle which is far from
idle. In those circumstances 64C looks just about right to me.

On Tue, Dec 23, 2014 at 1:00 PM, Rudi Ihle [email protected] wrote:

@linux4kix https://github.com/linux4kix I think the problem is with
2b1601b
2b1601b
. To me it looks like the new 'universal formula' should only be applied to
parts that were calibrated according to this. Given the time of the commit,
I suspect this applies to chips manufactured after Feb'14. I did a manual
calculation with the calibration data of my 'bad' unit and got exactly the
difference I observed. Now, the question is how to auto-detect that. The
commit message says: 'there will be no hot point calibration data in fuse
map from now on'. However, the 'good' units do have hot point data as well!
Have not (yet) checked if they are valid and if the old formula will still
work.

@susisstrolch https://github.com/susisstrolch 0x5624D869 != 0x5694D969,
still puzzled about the temperature difference also 64°C looks a bit high
for an idle device


Reply to this email directly or view it on GitHub
#25 (comment)
.

@warped-rudi
Copy link
Contributor Author

Will test when I'm at home this evening.

@linux4kix
Copy link
Owner

great thanks.

On Tue, Dec 23, 2014 at 1:15 PM, Rudi Ihle [email protected] wrote:

Will test when I'm at home this evening.


Reply to this email directly or view it on GitHub
#25 (comment)
.

@susisstrolch
Copy link

Ooops - sloppy pattern matching w/o glasses...

With KODI stopped I get the following values:

CBi Old: 25,00°C @396MHz, i.MX6Q, silicon rev 1.2
CBi New: 40,68°C @396MHz, i.MX6Q, silicon rev 1.5

On 12/23/2014 01:00 PM, Rudi Ihle wrote:

@linux4kix https://github.com/linux4kix I think the problem is with
2b1601b
2b1601b
. To me it looks like the new 'universal formula' should only be
applied to parts that were calibrated according to this. Given the
time of the commit, I suspect this applies to chips manufactured after
Feb'14. I did a manual calculation with the calibration data of my
'bad' unit and got exactly the difference I observed. Now, the
question is how to auto-detect that. The commit message says: 'there
will be no hot point calibration data in fuse map from now on'.
However, the 'good' units do have hot point data as well! Have not
(yet) checked if they are valid and if the old formula will still work.

@susisstrolch https://github.com/susisstrolch 0x5624D869 !=
0x5694D969, still puzzled about the temperature difference also 64°C
looks a bit high for an idle device


Reply to this email directly or view it on GitHub
#25 (comment).

@linux4kix
Copy link
Owner

Okay those numbers look better. If reverting the patch fixes the older
silicon rev then we may be able to use that as the identifier for the
algorithm used.
On Dec 23, 2014 2:14 PM, "susisstrolch" [email protected] wrote:

Ooops - sloppy pattern matching w/o glasses...

With KODI stopped I get the following values:

CBi Old: 25,00°C @396MHz, i.MX6Q, silicon rev 1.2
CBi New: 40,68°C @396MHz, i.MX6Q, silicon rev 1.5

On 12/23/2014 01:00 PM, Rudi Ihle wrote:

@linux4kix https://github.com/linux4kix I think the problem is with
2b1601b
<
2b1601b6976a838029fd7695dabab189358acbc0>

. To me it looks like the new 'universal formula' should only be
applied to parts that were calibrated according to this. Given the
time of the commit, I suspect this applies to chips manufactured after
Feb'14. I did a manual calculation with the calibration data of my
'bad' unit and got exactly the difference I observed. Now, the
question is how to auto-detect that. The commit message says: 'there
will be no hot point calibration data in fuse map from now on'.
However, the 'good' units do have hot point data as well! Have not
(yet) checked if they are valid and if the old formula will still work.

@susisstrolch https://github.com/susisstrolch 0x5624D869 !=
0x5694D969, still puzzled about the temperature difference also 64°C
looks a bit high for an idle device


Reply to this email directly or view it on GitHub
<
#25 (comment)
.


Reply to this email directly or view it on GitHub
#25 (comment)
.

@susisstrolch
Copy link

Uuups...
strolch@strolch:~/Development/OpenELEC/tools/mkpg/linux-imx_3.14.x.git> git revert 2b1601b
error: could not revert 2b1601b... thermal: imx: update formula for thermal sensor
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add ' or 'git rm '
hint: and commit the result with 'git commit'

Are you talking about the linux-linaro-lsk-v3.14-mx6 branch?
How can I read the raw data (devmem xxx) of sensor values?

@linux4kix
Copy link
Owner

It probably doesn't revert cleanly. I can write a quick patch when I get
home for you to test with.
On Dec 23, 2014 3:22 PM, "susisstrolch" [email protected] wrote:

Uuups...
strolch@strolch:~/Development/OpenELEC/tools/mkpg/linux-imx_3.14.x.git>
git revert 2b1601b
2b1601b
error: could not revert 2b1601b
2b1601b...
thermal: imx: update formula for thermal sensor
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add ' or 'git rm '
hint: and commit the result with 'git commit'

Are you talking about the linux-linaro-lsk-v3.14-mx6 branch?


Reply to this email directly or view it on GitHub
#25 (comment)
.

@susisstrolch
Copy link

Not necessary - fixed - only comments are affected...

@susisstrolch
Copy link

Same as before with slightly different values...
Old: 25,0° / 34,6°C (no KODI / KODI sitting in OSD)
New: 40,6° / 50,8°C
So it would be really interesting to see the raw value from the thermal sensor...

@warped-rudi
Copy link
Contributor Author

Obviously my manual calculation was flawed. My test showed the same result as @susisstrolch experienced. There is only a small difference between the two formulas. Also both of my CuBoxes are of 'silicon revision 1.2'. So the question remains why the temperature readout of the older one is so low.

@susisstrolch: raw data are at 0x20c8180, bits [19:8] i.e, ((val >> 8) & 0xfff)

@susisstrolch
Copy link

Here are the raw ones:S
Old: 44°C - 0x4e954106 -> 65
New: 60°C - 0x4ef52906 -> 41

HBi2ex: 33° - 0x51657806 -> 120

Sure about the 0x20c8180?

On 12/23/2014 11:59 PM, Rudi Ihle wrote:

Obviously my manual calculation was flawed. My test showed the same
result as @susisstrolch https://github.com/susisstrolch experienced.
There is only a small difference between the two formulas. Also both
of my CuBoxes are of 'silicon revision 1.2'. So the question remains
why the temperature readout of the older one is so low.

@susisstrolch https://github.com/susisstrolch: raw data are at
0x20c8180, bits [19:8] i.e, ((val >> 8) & 0xfff)


Reply to this email directly or view it on GitHub
#25 (comment).

@warped-rudi
Copy link
Contributor Author

The data field is 12bits wide, not only 8.

moonman pushed a commit to moonman/linux-imx6-3.14 that referenced this issue Mar 8, 2015
commit 6ffa30d3f734d4f6b478081dfc09592021028f90 upstream.

Bruce reported seeing this warning pop when mounting using v4.1:

     ------------[ cut here ]------------
     WARNING: CPU: 1 PID: 1121 at kernel/sched/core.c:7300 __might_sleep+0xbd/0xd0()
    do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810ff58f>] prepare_to_wait+0x2f/0x90
    Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer ppdev joydev snd virtio_console virtio_balloon pcspkr serio_raw parport_pc parport pvpanic floppy soundcore i2c_piix4 virtio_blk virtio_net qxl drm_kms_helper ttm drm virtio_pci virtio_ring ata_generic virtio pata_acpi
    CPU: 1 PID: 1121 Comm: nfsv4.1-svc Not tainted 3.19.0-rc4+ linux4kix#25
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
     0000000000000000 000000004e5e3f73 ffff8800b998fb48 ffffffff8186ac78
     0000000000000000 ffff8800b998fba0 ffff8800b998fb88 ffffffff810ac9da
     ffff8800b998fb68 ffffffff81c923e7 00000000000004d9 0000000000000000
    Call Trace:
     [<ffffffff8186ac78>] dump_stack+0x4c/0x65
     [<ffffffff810ac9da>] warn_slowpath_common+0x8a/0xc0
     [<ffffffff810aca65>] warn_slowpath_fmt+0x55/0x70
     [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
     [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
     [<ffffffff810dd2ad>] __might_sleep+0xbd/0xd0
     [<ffffffff8124c973>] kmem_cache_alloc_trace+0x243/0x430
     [<ffffffff810d941e>] ? groups_alloc+0x3e/0x130
     [<ffffffff810d941e>] groups_alloc+0x3e/0x130
     [<ffffffffa0301b1e>] svcauth_unix_accept+0x16e/0x290 [sunrpc]
     [<ffffffffa0300571>] svc_authenticate+0xe1/0xf0 [sunrpc]
     [<ffffffffa02fc564>] svc_process_common+0x244/0x6a0 [sunrpc]
     [<ffffffffa02fd044>] bc_svc_process+0x1c4/0x260 [sunrpc]
     [<ffffffffa03d5478>] nfs41_callback_svc+0x128/0x1f0 [nfsv4]
     [<ffffffff810ff970>] ? wait_woken+0xc0/0xc0
     [<ffffffffa03d5350>] ? nfs4_callback_svc+0x60/0x60 [nfsv4]
     [<ffffffff810d45bf>] kthread+0x11f/0x140
     [<ffffffff810ea815>] ? local_clock+0x15/0x30
     [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
     [<ffffffff81874bfc>] ret_from_fork+0x7c/0xb0
     [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
    ---[ end trace 675220a11e30f4f2 ]---

nfs41_callback_svc does most of its work while in TASK_INTERRUPTIBLE,
which is just wrong. Fix that by finishing the wait immediately if we've
found that the list has something on it.

Also, we don't expect this kthread to accept signals, so we should be
using a TASK_UNINTERRUPTIBLE sleep instead. That however, opens us up
hung task warnings from the watchdog, so have the schedule_timeout
wake up every 60s if there's no callback activity.

Reported-by: "J. Bruce Fields" <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants