Quickly Generating Lots of Realistic Random Data in R


In this brief post, I show a trick for quickly assembling arbitrarily large samples of real world data by sampling from all of the data sets included in R packages.

First we need to install and load packages that come bundled with lots of datasets. I’ve selected 18 here. The Ipak function handles installing many packages at a time.

packages = c("UScensus2000tract","MPV","MindOnStats","PASWR","NetData",

ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)

The data() command will return a list of all the data sets for all the packages currently loaded in the environment.

datasets_names <- data()$results #Reports all of the data sets loaded in the current environment
datasets_list = list()
for(q in datasets_names[,"Item"]){
    data(list= list(q) )
    datasets_list[[q]] <- eval(parse(text=q))

Finally, I sample from all of the assembled data sets. Some data sets have more columns than others so I first have to randomly select a data set to draw from giving each an equal chance and then randomly sample a column from that data set.

n=1000 #Number of desired rows
k=1000 #Number of desired columns

data_generated = data.frame(x=1:n) #Get the dataframe started
while(count < k) {
    i=sample.int(length(datasets_list), size=1, replace = T)
    y1=sample.int(ncol(datasets_list[[i]]), size=1, replace = T)
    data_generated[,count+1] <- sample(datasets_list[[i]][,y1], size=n, replace = T)

dim(data_generated) #Check the final dimensions
head(data_generated) #Check the columns

The final product will be a n by k \leq k data frame with the original variable’s names. It’s potentially fewer than the requested number of k columns because some packages include data not stored as a data frame (e.g. lists or time series objects) which will be caught and excluded by the try statement. Sampling within a column is random so even if two variables from the same data set are included, their order will be shuffled breaking any joint distribution between them.  A nice (or potentially problematic for you application) property of sampling data this way is that it is very real world data. It can include missing values, nonfinite values, text, factors, unit IDs, and variables from lots of different distributions. With a little work, a function wrapper could allow you to include or exclude specific types based on your specific needs.


1 comment to Quickly Generating Lots of Realistic Random Data in R

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>