mentby.com
Blog | Jobs | Help | Signup | Login

loading
Hello:

       How can one match names containing non-English characters that
appear differently in different but related data files?  For example, I
have data on Raúl Grijalva, who represents the third district of Arizona
in the US House of Representatives.  This first name appears as "Raúl"
in data read from one file and "Raul" from another.

       The ideal would convert both "Raúl" and "Raúl" to "Raul".  A
reasonable alternative would identify the non-English characters and
match on everything else ("^Ra" and "l$" in this case).  The files all
contain state and district, so "AZ-3" could be part of the solution.
However, the file also contains data on Grijalva's predecessor in that
office, Ben Quayle, so "AZ-3" is not enough.

       Thanks,
       Spencer

p.s.  My current data contains other similar cases, e.g.:

     Recipient     District
Raúl Grijalva   AZ House 3
Tony Cárdenas   CA House 29
Linda Sánchez   CA House 38
Raúl Labrador   ID House 1
André Carson    IN House 7
Bob Menéndez    NJ Senate
Ben Ray Luján   NM House 3
José Serrano    NY House 15
Nydia Velázquez NY House 7
Rubén Hinojosa  TX House 15

       These names all appear differently in another file I have. I've
written an ugly function that can identify "nonstandard characters".
I'm confident I can solve this problem.  However, I'm adding things like
this to the Ecdat package, and it would be more useful for others if I
made better use of other capabilities in R.
Hello:

       Do you have suggestions for how to aggregate a data.frame using
different functions on different columns?

       Consider the following example:

df2aggregate <- data.frame(id=rep(letters[1:4], each=2),
                            x =c(1:6, NA, NA),
                            y =c(NA, 1:6, NA),
                            a =c(NA, NA, LETTERS[1:6]),
                            stringsAsFactors=FALSE)

# Desired output:

ag1.2 <- data.frame(id=letters[1:4],
                     x =c(3, 7, 11, NA),
                     y =c(NA, 2.5, 4.5, NA),
                     a =c(NA, 'A', 'C', 'E'),
                     stringsAsFactors=FALSE)

       I'm thinking of writing a function Aggregate(x, by, FUN, ...),
where x = data.frame, by = vector of names of columns of x, and FUN = function that would accept as input a data.frame subset of x and would
return a data.frame FUNout, which would be combined using cbind(x[, by],
FUNout), then rbind over all such subset data.frames.  However, before I
write this, I'd like to make sure it doesn't already exist.  My current
plan is to add it to the Ecdat package.

       Suggestions?  Should I study "plyr"?  fortune(298) ;-)

       Thanks,
       Spencer

p.s.  library(sos); findFn('aggregate.data.frame') returned 4 matches,
none of which seemed to solve this problem. findFn('aggregate
data.frame') returned 133 matches in 71 package. findFn('aggregate')
returned 734 matches in 282 packages.  I failed to find anything useful
in the latter two and with other attempts using RSiteSearch, except for
a reference to plyr.

--
Spencer Graves, PE, PhD
President and Chief Technology Officer
Structure Inspection and Monitoring, Inc.
751 Emerson Ct.
San José, CA 95126
ph:  408-655-4567
web:  www.structuremonitoring.com

Read more »

You can try all the different "method" options for "maxLik", pick
the best answer, iterate once to make sure you can't get something
better.  If you want a hessian and don't get one from the best answer,
you can feed the best answer to the "hessian" function to get that.

       Also, if you get a singular hessian, that says that you cannot
estimate as many parameters as you have.  If you have p parameters and
the hessian is of rank p0 < p, that says you can only estimate p0
parameters independently.  You can use the "fnSubset" function to fix
(p-p0) parameters at specific values and see what you get from varying
the others.

       Estimating hessians can be error-prone whether you use numeric or
analytic derivatives.  The "compareDerivatives" function can help you
check analytic derivatives for software bugs.

       Hope this helps.
       Spencer

--
Spencer Graves, PE, PhD
President and Chief Technology Officer
Structure Inspection and Monitoring, Inc.
751 Emerson Ct.
San José, CA 95126
ph:  408-655-4567
web:  www.structuremonitoring.com
Group(s)
Profile Widget
Copy and paste this HTML code to your blog or website: