Dear all,
I have run into a problem when running some code implemented in the
Bioconductor panp-package (applied to my own expression data), whereby gene
expression values of known true negative probesets (x) are interpolated onto
present/absent p-values (y) between 0 and 1 using the *approxfun -
function*{stats}; when I have used R version 2.8, everything had
worked fine,
however, after updating to R 2.11.1., I got unexpected output (explained
below).
Please correct me here, but as far as I understand, the yleft and yright
arguments set the extreme values of the interpolated y-values in case the
input x-values (on whose approxfun is applied) fall outside range(x). So if
I run approxfun with yleft=1 and yright=0 with y-values between 0 and 1,
then I should never get any values higher than 1. However, this is not the
case, as this code-example illustrates:
are 2000 expression values ranging from ~ 3 to 7:
happen if input x-values lie outside range(x):
Warning message:
In approxfun(xNeg, yNeg, yleft = 1, yright = 0, rule = 2) :
collapsing to unique 'x' values
and can therefore lie outside range(xNeg):
[1] 0.000 6208.932
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 0.000e+00 2.774e-03 1.299e+00 3.164e-01 6.209e+03
So the resulting output PV object contains data ranging from 0 to 6208, the
latter of which lies outside yleft and is not anywhere close to extreme
y-values that were used to set up the interp-function. This seems wrong to
me, and from what I understand, yleft and yright are simply ignored?
I have attached a few histograms that visualize the data distributions of
the objects I xNeg, yNeg, AllExprs[,1] (== input x-values) and PV (the
output), so that it is easier to make sense of the data structures...
Does anyone have an explanation for this or can tell me how to fix the
problem?
Thanks a million for any help, best, Sam
R version 2.11.1 (2010-05-31)
x86_64-apple-darwin9.8.0
locale:
[1] en_IE.UTF-8/en_IE.UTF-8/C/C/en_IE.UTF-8/en_IE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] panp_1.18.0 affy_1.26.1 Biobase_2.8.0
loaded via a namespace (and not attached):
[1] affyio_1.16.0 preprocessCore_1.10.0
--
-----------------------------------------------------
Samuel Wuest
Smurfit Institute of Genetics
Trinity College Dublin
Dublin 2, Ireland
Phone: +353-1-896 2444
Web: http://www.tcd.ie/Genetics/wellmer-2/index.html
Email: wuests*******
------------------------------------------------------
The plots did not come through, see the posting guide for which attachments are allowed. It will be easier for us to help if you can send reproducible code (we can copy and paste to run, then examine, edit, etc.). Try finding a subset of your data for which the problem still occurs, then send the data if possible, or similar simulated data if you cannot send original data.
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow*******
801.408.8111
It looks like you have found a bug, I can confirm that with your data on m computer that I am getting nonsense results for some cases. I even foundthat when calling the function on element 164 of the input vector that I dn't even get consistent results, I ran it several times and many times gotdifferent results.
I looked at the code for the function and I don't think that the problem i in the R code portion, but is in the compiled C portion. You should sendthis in as a bug report.
One note on your code, you don't need sapply, you could just say interp(aprox.data$input) and it would give the vector of interpolations.
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow*******
801.408.8111
OK, I think that I figured out what is going on. You have some of your x alues that are very close to each other in value, but not exactly the same If we look at how many unique x values you have we get:
[1] 901
But inside the approxfun function the tapply function is used to collapse he x values and tapply coerces your x variable to a factor (which first corces to character, but not to the same precision as unique uses):
[1] 893
And this is number of values that the y vector is reduced to, so when R pases your x and y to the compiled C function it gets a y vector of length 83, but expects one of length 901, so there are 8 positions beyond the end f the vector that it is trying to use, but no real data is in those positins, what is is apparently somewhat random, so when you try to predict nearthe right most extreme of your data (but no beyond), the results are not pedictable.
The work around is to pre clean your data to have unique x values so that does not run into this problem. Another option would be to round the x vaiables to fewer decimal places so that the conversion to factor matches th unique values.
This should still be reported as a bug, but the R code could be fixed rathr than searching the compiled C code.
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow*******
801.408.8111
<p><br><br><br></p>
[[alternative HTML version deleted]]
SW> Hi Greg,
SW> thanks for the suggestion:
SW> I have attached some small dataset that can be used to reproduce the
SW> odd behavior of the approxfun-function.
SW> If it gets stripped off my email, it can also be downloaded at:
SW> http://bioinf.gen.tcd.ie/approx.data.Rdata
SW> Strangely, the problem seems specific to the data structure in my
SW> expression set, when I use simulated data, everything worked fine.
SW> Here is some code that I run and resulted in the strange output that I
SW> have described in my initial post:
>> ### load the data: a list called approx.data
>> load(file="approx.data.Rdata")
>> ### contains the slots "x", "y", "input"
>> names(approx.data)
SW> [1] "x" "y" "input"
>> ### with y ranging between 0 and 1
>> range(approx.data$y)
SW> [1] 0 1
>> ### compare ranges of x and input-x values (the latter is a small subset of 500 data points):
>> range(approx.data$x)
SW> [1] 3.098444 7.268812
>> range(approx.data$input)
SW> [1] 3.329408 13.026700
>>
>>
>> ### generate the interpolation function (warning message benign)
>> interp <- approxfun(approx.data$x, approx.data$y, yleft=1, yright=0, rule=2)
SW> Warning message:
SW> In approxfun(approx.data$x, approx.data$y, yleft = 1, yright = 0, :
SW> collapsing to unique 'x' values
>>
>> ### apply to input-values
>> y.out <- sapply(approx.data$input, interp)
>>
>> ### still I find output values >1, even though yleft=1:
>> range(y.out)
SW> [1] 0.000000 7.207233
I get completely different (and correct) results,
by the way the *same* you have in the bug report you've
submitted
( https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14377 )
and which does *not* show any bug:
[1] 0.0000000 0.9816907
Of course, I do believe that you've seen the above problems,
-- on 64-bit Mac ? as you report in sessionInfo() ? --
but I cannot reproduce them.
And also, you seem yourself to be able to get different results
for the same data... what are the circumstances?
Regards,
Martin Maechler, ETH Zurich
I think Greg Snow's analysis (linked from the bug report) is right.
length(unique(x)) can be different than length(levels(as.factor(x)))
(which is what tapply uses to determine the number of levels).
I've fixed it by replacing
y <- as.vector(tapply(y,x,ties))
with
y <- as.vector(tapply(y,match(x,x),ties))
since match() uses the same rules as unique.
An argument could be made that tapply should be changed instead, but it
is behaving as documented, so I didn't want to do that.
Duncan Murdoch
#-------------- on another 64 bit Mac -------------
[1] "x" "y" "input"
[1] 0 1
[1] 3.098444 7.268812
yright=0, rule=2)
Warning message:
In approxfun(approx.data$x, approx.data$y, yleft = 1, yright = 0, :
collapsing to unique 'x' values
[1] -5.143958e+284 9.816907e-01
function (v)
.C("R_approxfun", as.double(x), as.double(y), as.integer(n),
xout = as.double(v), as.integer(length(v)), as.integer(method),
as.double(yleft), as.double(yright), as.double(f), NAOK = TRUE,
PACKAGE = "stats")$xout
<environment: 0x1598c91b8>
# attempt to offer yright and yledft in the sapply call failed
Error in FUN(c(6.99984458535897, 4.85139079147721, 7.58922165833,
10.8135863246057, :
unused argument(s) (yright = 0, yleft = 1)
#Create yright and yleft in the calling environment:
[1] 0.0000000 0.9816907
Now it "works" as expected.
It seems to me that the yleft and yright values may not be properly
incorporated into the constructed function.
R version 2.11.1 Patched (2010-06-14 r52281)
x86_64-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid splines stats graphics grDevices
[6] utils datasets methods base
other attached packages:
[1] gplots_2.8.0 caTools_1.10 bitops_1.0-4.1
[4] gdata_2.8.0 gtools_2.6.2 sos_1.2-9
[7] brew_0.1-1 qpcR_1.3-2 minqa_1.1.9
[10] Rcpp_0.8.5 nlme_3.1-96 cluster_1.12.3
[13] rgl_0.91 minpack.lm_1.1-4 MASS_7.3-6
[16] chron_2.3-35 maps_2.1-4 gmodels_2.15.0
[19] rbenchmark_0.3 data.table_1.5 rms_3.0-0
[22] Hmisc_3.8-1 survival_2.35-8 plyr_1.1
[25] lattice_0.18-8
loaded via a namespace (and not attached):
[1] tools_2.11.1
--
David Winsemius
David Winsemius, MD
West Hartford, CT
SW> Hi Greg,
SW> thanks for the suggestion:
SW> I have attached some small dataset that can be used to reproduce the
SW> odd behavior of the approxfun-function.
SW> If it gets stripped off my email, it can also be downloaded at:
SW> http://bioinf.gen.tcd.ie/approx.data.Rdata
SW> Strangely, the problem seems specific to the data structure in my
SW> expression set, when I use simulated data, everything worked fine.
SW> Here is some code that I run and resulted in the strange output that I
SW> have described in my initial post:
>>> ### load the data: a list called approx.data
>>> load(file="approx.data.Rdata")
>>> ### contains the slots "x", "y", "input"
>>> names(approx.data)
SW> [1] "x" "y" "input"
>>> ### with y ranging between 0 and 1
>>> range(approx.data$y)
SW> [1] 0 1
>>> ### compare ranges of x and input-x values (the latter is a small subset of 500 data points):
>>> range(approx.data$x)
SW> [1] 3.098444 7.268812
>>> range(approx.data$input)
SW> [1] 3.329408 13.026700
>>>
>>>
>>> ### generate the interpolation function (warning message benign)
>>> interp <- approxfun(approx.data$x, approx.data$y, yleft=1, yright=0, rule=2)
SW> Warning message:
SW> In approxfun(approx.data$x, approx.data$y, yleft = 1, yright = 0, :
SW> collapsing to unique 'x' values
>>>
>>> ### apply to input-values
>>> y.out <- sapply(approx.data$input, interp)
>>>
>>> ### still I find output values >1, even though yleft=1:
>>> range(y.out)
SW> [1] 0.000000 7.207233
MM> I get completely different (and correct) results,
MM> by the way the *same* you have in the bug report you've
MM> submitted
MM> ( https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14377 )
MM> and which does *not* show any bug:
>> range(y.out)
MM> [1] 0.0000000 0.9816907
MM> Of course, I do believe that you've seen the above problems,
MM> -- on 64-bit Mac ? as you report in sessionInfo() ? --
MM> but I cannot reproduce them.
MM> And also, you seem yourself to be able to get different results
MM> for the same data... what are the circumstances?
I now see that you *did* mention the fact that you
see *different* results when you *RE*run the same code
on this data.
The subject (" ...yleft and yright ignored") is misleading in
any case. These are not at all ignored...,
but indeed (as Duncan Murdoch has noted on the bug website),
there *is* a bug,
so you are principally right in reporting -- thank you!
....
and I could also confirm that -- as you mentioned in your first
post -- this bug does not appear when using R 2.8.1 (at least on
my platform).
Martin
I see this too -- and it appears to be because as.factor() finds a
different number of levels in R-devel than it did in 2.8.1. In 2.8.1,
the level names are not unique, but in R-devel, they are, so there are
fewer of them.
In 2.8.1, I see
sort(levels(as.factor(approx.data$x)))[285:287]
[1] "3.97124402436476" "3.97124402436476" "3.97129741844245"
(notice the first two are identical), whereas in R-devel, we get
[1] "3.97124402436476" "3.97129741844245" "3.97408959567848"
from the same expression.
Duncan Murdoch
Duncan Murdoch
>>>>>>> "SW" == Samuel Wuest <wuests*******>
>>
SW> Hi Greg,
SW> thanks for the suggestion:
>>
SW> I have attached some small dataset that can be used to reproduce the
SW> odd behavior of the approxfun-function.
>>
SW> If it gets stripped off my email, it can also be downloaded at:
SW> http://bioinf.gen.tcd.ie/approx.data.Rdata
>>
SW> Strangely, the problem seems specific to the data structure in my
SW> expression set, when I use simulated data, everything worked fine.
>>
SW> Here is some code that I run and resulted in the strange output that I
SW> have described in my initial post:
>>
>> >> ### load the data: a list called approx.data
>> >> load(file="approx.data.Rdata")
>> >> ### contains the slots "x", "y", "input"
>> >> names(approx.data)
SW> [1] "x" "y" "input"
>> >> ### with y ranging between 0 and 1
>> >> range(approx.data$y)
SW> [1] 0 1
>> >> ### compare ranges of x and input-x values (the latter is a small subset of 500 data points):
>> >> range(approx.data$x)
SW> [1] 3.098444 7.268812
>> >> range(approx.data$input)
SW> [1] 3.329408 13.026700
>> >>
>> >>
>> >> ### generate the interpolation function (warning message benign)
>> >> interp <- approxfun(approx.data$x, approx.data$y, yleft=1, yright=0, rule=2)
SW> Warning message:
SW> In approxfun(approx.data$x, approx.data$y, yleft = 1, yright = 0, :
SW> collapsing to unique 'x' values
>> >>
>> >> ### apply to input-values
>> >> y.out <- sapply(approx.data$input, interp)
>> >>
>> >> ### still I find output values >1, even though yleft=1:
>> >> range(y.out)
SW> [1] 0.000000 7.207233
>>
>>
>> I get completely different (and correct) results,
>> by the way the *same* you have in the bug report you've
>> submitted
>> ( https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14377 )
>> and which does *not* show any bug:
>>
>>> range(y.out)
>> [1] 0.0000000 0.9816907
>>
>> Of course, I do believe that you've seen the above problems,
>> -- on 64-bit Mac ? as you report in sessionInfo() ? --
>> but I cannot reproduce them.
>>
>> And also, you seem yourself to be able to get different results
>> for the same data... what are the circumstances?
> I think Greg Snow's analysis (linked from the bug report) is right.
> length(unique(x)) can be different than length(levels(as.factor(x)))
> (which is what tapply uses to determine the number of levels).
> I've fixed it by replacing
> y <- as.vector(tapply(y,x,ties))
> with
> y <- as.vector(tapply(y,match(x,x),ties))
> since match() uses the same rules as unique.
> An argument could be made that tapply should be changed instead, but it
> is behaving as documented, so I didn't want to do that.
I think you've worked around the problem (and I had seen Greg's
analysis), thank you. BTW, the problem wasn't there in R 2.8.x,
as we since had improved factor() ....
{.. and btw, the same changes should happen to
approx(), spline(), splinefun() ..}
*However*
why on earth can the results become *random*, i.e.,
differ from one call to the other, with identical data in the
same R session?
I currently guess that sometimes the operands (of the linear
interpolation rational arithmetic) are kept in high-precision
registers and sometimes not, I must say I am puzzled that this
happens dynamically rather than determined by the exact compiler
flags at (R's) build time
??
(and yes, this thread should have been on R-devel rather than R-help
from the start ...)
Martin
I think it's because the y vector ends up shorter than x, after they've
been assumed to be set to the same length. So at some point the
function is looking at garbage in memory beyond the end of y, and that
will vary from run to run.
I didn't trace into the C code to check this, but that's what the
symptoms look like.
Duncan Murdoch
>>>>>>> Duncan Murdoch <murdoch.duncan*******>
>>
>> >>>>>>> "SW" == Samuel Wuest <wuests*******>
>> >>
SW> Hi Greg,
SW> thanks for the suggestion:
>> >>
SW> I have attached some small dataset that can be used to reproduce the
SW> odd behavior of the approxfun-function.
>> >>
SW> If it gets stripped off my email, it can also be downloaded at:
SW> http://bioinf.gen.tcd.ie/approx.data.Rdata
>> >>
SW> Strangely, the problem seems specific to the data structure in my
SW> expression set, when I use simulated data, everything worked fine.
>> >>
SW> Here is some code that I run and resulted in the strange output that I
SW> have described in my initial post:
>> >>
>> >> >> ### load the data: a list called approx.data
>> >> >> load(file="approx.data.Rdata")
>> >> >> ### contains the slots "x", "y", "input"
>> >> >> names(approx.data)
SW> [1] "x" "y" "input"
>> >> >> ### with y ranging between 0 and 1
>> >> >> range(approx.data$y)
SW> [1] 0 1
>> >> >> ### compare ranges of x and input-x values (the latter is a small subset of 500 data points):
>> >> >> range(approx.data$x)
SW> [1] 3.098444 7.268812
>> >> >> range(approx.data$input)
SW> [1] 3.329408 13.026700
>> >> >>
>> >> >>
>> >> >> ### generate the interpolation function (warning message benign)
>> >> >> interp <- approxfun(approx.data$x, approx.data$y, yleft=1, yright=0, rule=2)
SW> Warning message:
SW> In approxfun(approx.data$x, approx.data$y, yleft = 1, yright = 0, :
SW> collapsing to unique 'x' values
>> >> >>
>> >> >> ### apply to input-values
>> >> >> y.out <- sapply(approx.data$input, interp)
>> >> >>
>> >> >> ### still I find output values >1, even though yleft=1:
>> >> >> range(y.out)
SW> [1] 0.000000 7.207233
>> >>
>> >>
>> >> I get completely different (and correct) results,
>> >> by the way the *same* you have in the bug report you've
>> >> submitted
>> >> ( https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14377 )
>> >> and which does *not* show any bug:
>> >>
>> >>> range(y.out)
>> >> [1] 0.0000000 0.9816907
>> >>
>> >> Of course, I do believe that you've seen the above problems,
>> >> -- on 64-bit Mac ? as you report in sessionInfo() ? --
>> >> but I cannot reproduce them.
>> >>
>> >> And also, you seem yourself to be able to get different results
>> >> for the same data... what are the circumstances?
>>
>> > I think Greg Snow's analysis (linked from the bug report) is right.
>> > length(unique(x)) can be different than length(levels(as.factor(x)))
>> > (which is what tapply uses to determine the number of levels).
>>
>> > I've fixed it by replacing
>>
>> > y <- as.vector(tapply(y,x,ties))
>>
>> > with
>>
>> > y <- as.vector(tapply(y,match(x,x),ties))
>>
>> > since match() uses the same rules as unique.
>>
>> > An argument could be made that tapply should be changed instead, but it
>> > is behaving as documented, so I didn't want to do that.
>>
>> I think you've worked around the problem (and I had seen Greg's
>> analysis), thank you. BTW, the problem wasn't there in R 2.8.x,
>> as we since had improved factor() ....
>>
>> {.. and btw, the same changes should happen to
>> approx(), spline(), splinefun() ..}
>>
>> *However*
>> why on earth can the results become *random*, i.e.,
>> differ from one call to the other, with identical data in the
>> same R session?
>> I currently guess that sometimes the operands (of the linear
>> interpolation rational arithmetic) are kept in high-precision
>> registers and sometimes not, I must say I am puzzled that this
>> happens dynamically rather than determined by the exact compiler
>> flags at (R's) build time
> I think it's because the y vector ends up shorter than x, after they've
> been assumed to be set to the same length. So at some point the
> function is looking at garbage in memory beyond the end of y, and that
> will vary from run to run.
> I didn't trace into the C code to check this, but that's what the
> symptoms look like.
aah, of course! (it seem I'm enjoying too much the warm late summer
sun here at home in Zurich ..)
Now, I'm happy it's "the usual" thing..
Martin
Hi Martin,
indeed, as mentioned in the bug-report, the results are inconsistent,
and each time I rerun, I get different results... Sometimes, the range
is correct even on my machine, but mostly I get values >1 back:
Here is another run:
[1] "x" "y" "input"
Warning message:
In approxfun(approx.data$x, approx.data$y, yleft = 1, yright = 0, :
collapsing to unique 'x' values
[1] -1.360612e 81 2.151192e 249
....
This script was run on the same machine, same R-version ect.
this demonstrates the issue again, and when running the script again
before the bug-report, I had overlooked, that the range was indeed
correct...
I guess Greg Snow had suggested a fix for the bug correctly and the
issue should be resolved as posted by Duncan Murdoch. Thanks a
million!!
Best, Sam
On 11 September 2010 15:53, Martin Maechler <maechler*******> wote:
--
-----------------------------------------------------
Samuel Wuest
Smurfit Institute of Genetics
Trinity College Dublin
Dublin 2, Ireland
Phone: 353-1-896 2444
Web: http://www.tcd.ie/Genetics/wellmer-2/index.html
Email: wuests*******