R manipulates and analyzing data structures readily

One of the main things I noticed when I originally started using R was how much better it was at running data though various functions or processing than other languages. It was clearly designed for doing this, which you would hope for in a data and statistics oriented langauge. It also inspired me to learn how to set up more functional/data oriented structures in other languages when I need them. It helped me figure out how to use Bash better, for example: Bash Coding Style.

Basic format of a function

Functions are going to be used in various ways throughout this tutorial, so a brief overview is in order. Functions in R do not require return statements, the last operation in the function body is returned if no return is specified.
Typical way to define a named function:

fcnName <- function(){ 
   ## content  
}

An anonymous function

I make use of these frequently when processing data with various map/apply function constructs, so it is useful to be aware of this syntax. Usually this is used within an apply/map call so there is no reason to name it.

function(x, y) {  
  c(x, y)  
}

Closures

A closure is a function with data attached from the enclosing environment.
I do not use these that often in R but it is still good to be aware of in general and in case I use one elsewhere in this write up.

power <- function(exponent) {  
  function(x) {  
    x ^ exponent  
  }
}

square <- power(2)  
square(4) – returns 16  
cube <- power(3)  
cube(4) – returns 64  

Memory in closures

You can create data persistence for a function by givin it closure on a variable whose purpose is to store data. Then function that ‘remembers’ the value or changes to the data structure. The example here is a simple counter that starts from 0.

new_counter <- function() {  
  i <- 0  
  function() {  
    i <<- i + 1 # use the I variable in the enclosing fcn  
    i  
  }
}
count1 <- new_counter()

each time this is run it will increment and remember the current count. This construct allows each closure produced to have independent tracking of internal variables.

Apply functions

These are very useful to me because they apply functions to all members of a data structure. This allows the data provided to vary greatly while still conducting the same operations. It also prevents any of the mistakes that might occur when walking through an unknown data structure.
The important things to remember when using apply functions is that they each have different ways of using the input data structure and creating the output.

Ways to use use functions with lapply

lapply applies a function to each element of a list and returns a list of the results. You can use an anonymous function or defined one, but the format differs for the two cases.

# use an anonymous function to call the   
# function that is passed as lapply gets each list item.  
lapply(compute_mean, function(x) { x+5 } )
# alternate form using a defined function and explicitly passing x.  
call_fun <- function(x) { x+5 }
lapply(compute_mean, call_fun, x)  

More ways to use lapply

You can get the value, index, or name of the current element of the list and operate on those instead of the values, so there are lots of different ways to use lapply.

lapply(data, function(), options)      - General format  
lapply(xs, function(x) {})             - pass the value  
lapply(seq_along(xs), function(i) {})  - pass the index  
lapply(names(xs), function(nm) {})     - pass the name  

Mutliple input functionals – Map and Mapply.

Use Map/Mapply when there are two or more lists/dataframes to process in parallel. The concept with both is the same; you want to use input from data structures with the same shape but cannot do it on each one individually becuase input is needed from both/all of them.
Map with some generated data:

# Create some fake data for demo purposes
xs <- replicate(5, runif(10), simplify = FALSE)  
ws <- replicate(5, rpois(10, 5) + 1, simplify = FALSE)  
# run both data sets through a function, taking first, second, third,  
# items of each at the same time.  
Map(weighted.mean, xs, ws)

For map the function is passed as the first argument unlike apply where it was the second argument.

Map with some fixed and some variables fcn inputs

If some of the arguments need to be fixed, use an anonymous function:

Map(function(x, w) weighted.mean(x, w, na.rm = TRUE), xs, ws)  

So in this case I am still using weighted mean, but I wrapped an anonymous function around it so I could give it arguments besides just the values from the two lists. I do this oftentimes when I want to pass in a single configuring value that will remain the same.

More Map and mapply examples

# these use the anonymous function pattern I showed above so that  
# they can configure fixed arguments to the function they are calling.  
curFilePaths <- Map(function(file) paste(path,file,sep=""),curFiles)
curData <<- Map(function (datapath) readBin(datapath,double(),size=configs$dataByte,n=configs$readNum,endian = "little"), curFilePaths)
invisible(Map(function(intercept) abline(v=intercept,lty=2,lwd=2.0),vertlocations))
# In this case I needed to pass an outside var (curTraces) to each call,  
# not map over it, so I also use the anonymous fcn pattern.  
groupindx<-Map(function(grouppat) grep(grouppat, curTraces), groupuse)

# The unname fcn wrapping the whole thing is just to fomat the mapply output
labels=unname(mapply(function(tracepath) sub(".dat","",sub(".*-","",tracepath)),names(curData)))

Map example with some test data
Above I showed how the format would work, but with these you can see it in action with some test data.

data1 <- (sin(seq(-20, 20, length=1000)) *10) + diffinv(rnorm(1000))   
data2 <- (sin(seq(-20, 20, length=1000)) *10) + diffinv(rnorm(1000))   
signalDiff <- Map(function(obs1,obs2) obs1-obs2,data1,data2)  
plot(seq(1,length(signalDiff),1),signalDiff,type="l")  

Map with nested calls -

The outer map passes to the inner map a single value set. Then, the inner map runs through its full set of mapping before the next data is grabbed in the outer function. This is useful the data in one set (the inner) needs to be applied to each part of some other data (the inner).
An example of how this works-
Set up some data, this will be used for the nested calls.

persistence <- sample(1:15,10,replace=TRUE)  
epochBounds <- matrix(c(seq(1,901,100),seq(100,1000,100)),ncol=2)  
data1 <- Map(function(persist) (sin(seq(-20, 20, length=1000)) *10) + (filter(rnorm(1000),   
filter=rep(1,persist), circular=TRUE)*2),persistence)  
data2 <- Map(function(persist) (sin(seq(-20, 20, length=1000)) *10) + (filter(rnorm(1000),   
filter=rep(1,persist), circular=TRUE)*2),persistence)  

Perform the nested Map operation

The outer map passed the first pair of data lists, and so on. The inner map uses the epochBounds to calculate means for different sections of the data liss, which are then returned. This was the means can be calculated for a bunch of arrays for the same sections in each one.

meanResult <- Map(function(trace1,trace2)   Map(function(minindx,maxindx)  c(mean(trace1[minindx:maxindx]), mean(trace2[minindx:maxindx])), epochBounds[,1], epochBounds[,2]), data1, data2)  

This can be tough to read

It is hard to read/figure out what is going on, but if you can remember that the inner map is passed each argument from the outer and then maps all its input for each one, it helps.

Matrix and array summary techniques:

apply() - use with matrices and arrays. Summarizes each dimension into
a single value based on a function

a <- matrix(1:20, nrow = 5)  
apply(a, 1, mean) – summarize rows  
apply(a, 2, mean) – summarize columns  

sweep() -
allows removal or adjustment of values with a summary statistic so it is
often used with apply for fast standardization.

x <- matrix(rnorm(20, 0, 10), nrow = 4)  
x1 <- sweep(x, 1, apply(x, 1, min), `-`)  
x2 <- sweep(x1, 1, apply(x1, 1, max), `/`)  

outer() -
runs an input function over every combination of the two inputs.
Outer(1:3, 1:10, ”*“)

Plyr package –

Seeks to create a standardized set of functions that dont have all the little differences and variations of the apply family. It more comprehensively covers the list of all possible R data steuctures in neat fashion. Here is how they cover the options:
list data frame array
list llply() ldply() laply()
data frame dlply() ddply() daply()
array alply() adply() aaply()

Manipulation of lists

R has the list manipulation function that you would expect from a language that focuses on list/functional processing.
Map() - Apply function to members of lists.
Reduce() - Reduce a vector to a single value. Takes any function that
accepts two values and then recursively applies to the entire vector.
Reduce(sum, 1:15)
Filter() - Select members of data frames/matrixes where the conditions are
true.
Find(), Position() - Very similar to Filter() but find returns the
first and position indicates the location of the first.