In this first post, I will show you a simple script that performs a R package frequency analysis of my most used R packages. By doing this analysis you will know which are your most used/called R packages of a collection of R files (.R and .Rmd files). I will use my R files as an example for you.

At the end of the post, I will show you the results that I got using the R files that I have been developed from 2012 until the middle of 2018, and I also will show you the R code that I used in case you want to perform the same analysis using you own R files.

We will cover these easy steps:

1) How to read the content of all the R script files, and how to look for the words: library() or require()
2) How to extract each package xxxxx that has been called with library(xxxxx) or require(xxxxx) and store it in a table
3) Then we will aggregate the packages
4) And finally, we will plot the results

Step 1: a listing of all files

First of all, you need to retrieve all the R and Rmd files that you want to use for package frequency analysis. Copy all the files in a folder, it can contain subfolders if you want. This folder in this example (for OS X) will be: ~/choose/your/working/directory/.

Then open RStudio and set your working directory as:

Then the next code will list all the files with extension “.R” or “.Rmd” inside ~/choose/your/working/directory/. All the path to these files will be stored in the vector FILES.

Step 2: reading the R scripts

The next piece of code will read (using a for-loop statement) all the contents of each script file, and will extract the name of the packages that have been called using library(xxx) or require(xxx). The result will be stored in the data frame WDfreq.

You will get a data frame showed below with these columns:

  • pk: R package
  • rep: number of package calls in each file
  • file: R or Rmd file
  • size: size of the file in bytes
  • mtime: last file modification date-time
  • year: last file modification year

Step 3: aggregating package names

Next, I will aggregate the packages within the same name (I will add up the rep column) in order to have a statistical approach to the most frequent packages I used. This can be done with the ddply function.

The result is a tidy table like:

Step 4: plotting results

Now we are int funny part!!. So we have our tidy table ready to do some plots. Below you will find the code and the results using ggplot2. I will plot my TOP 50 most used packages:

plot of chunk package_frequency_analysis-10

Finally, we can get really beautiful results creating a circle plot, I adapted the code from basic-circle-packing-with-one-level and hide-first-level-in-circle-packing. You maybe will need to install the package packcircles, the result is showed below.

plot of chunk package_frequency_analysis-11

Export plot

If you want to output the plot you can do:

And that’s all, I hope you enjoyed it!!


Session Info:

Appendix, all the code:

Share it!:

Leave a Reply

Your email address will not be published. Required fields are marked *