R - my way through R hell
14 May 2018
Being newbie or experienced on R, we all face many recurring issues with R syntax, R interpreter and R pitfalls. Patrick BURNS book is worth reading and allows to become really aware of many commons pitfalls. Nevertheless, I really believe that this ends up to be necessary but still insufficient. That’s why, I would like to share my way through R hell and describe here most important practices I stick to, and the reasons to do so.
My goals when creating a software, small or large
My goals may vary greatly from one need to another. Nevertheless, I produce code only when I want to experiment a new idea, or creating a new package.
To do so, I stick rigorously to following strategy
- keep it as simple as possible
- code the tests first, and automate the full test suite, to get higher reliability quickly
- implement the code in a neat way as soon as possible
- track computing performance
- document results and limits
Keep it as simple as possible
I often use the divide and conquer approach to split up the whole need, to a set of functionalities, autonomous or intertwined, depending of the case. Many well known tactics from software engineering are useful here, and not so commonly used by R programmers. I use following ones, and frankly, they bring great results in a really short period of time
-
File naming Convention
I use Camel capitalized filenames for source code related to a s3, s4 or R6 class, thus mimetic Java convention, 1 file per class, 1 class per file. For example, if I need to manage some file system operations, I will create a class namedFileSystem.R
that will hold all the methods that relate to file system handling. This result matches all the needs, about file system, short or long ones, and abstract the underlying operating system.
Along this file, I will also create another file namedFileSystem_UT.R
which holds the unit tests related to the class. Here I try to cover the full scope of methods declared in the former class, and to cover the full test coverage in order to manage effectively cyclomatic complexity. -
R Class patterns
I personally create onlyenvironment
orR6
Advanced Programming classes and completely avoid other types of classes. -
Do not mix R approaches
Whatever the problem I am trying to tackle, I determine from the start a strategy and decide of the packages to be used. For example, if I need to create diagrams, I could opt forR base
orlattice
orggplot2
orvery specialized R
diagrams. I tend to avoid any mix of these ones, as it turned out to be a pure brain torture to remind what are the options, good practices to apply, and pitfalls to avoid. That’s why I generally 85% of times useggplot2
. If you use it and create dedicated methods to produce diagrams, and applied recommendations from step 1, then it magically become much easier to reuse. You may restart from an existing implementation and just improve or modify it according to your need, thus gaining a lot of time. -
R functions R function have to be kept short. That’s mandatory to master their complexity and to get reusable and extensible results. My personal upper limit is 10 lines when coding under functional programming style. The limit comes more from understandability than from the number of lines. I tend to distinguish very clearly function parameters, which I often suffixes with
_<i|d[e|r6|f|fa|b|...>
to express their expected type. This allows several gains, at a very low cost. First, local variables and parameters can not anymore be confused. Second, as the pattern expressed a type, then it becomes easier to detect type mismatches. Third, it also sheds light to function signatures that really require type polymorphism.
Code the test first
There are many sites debating about the value of test/behavior driven test development. My way is much simpler. The test is the use case for your class. So, work on it, up to the time where syntax is a simple as possible YAGNI. That way, you will avoid great loss of times, due to coding of useless functionalities, and due to bug tracking and resolution. Moreover, this approach oblige you to express test data and to think to it, from the start. That’s a good point, generally under estimated in time and effort.
If you have a full battery of tests, then organize them, so you can fire and replay all|any|part of them easily. This helps a lot in non regression testing.
Implement the code in a neat way as soon as possible
Code should be easy to read, in particular when coming back on it several days, weeks or month later. Define and apply your style. Personally, I privilege constructs that shorten the code, in regards to
- functional rectitude
- usability
- performance
- security
That’s why I stick to {
on the same line as the function name, and uses return statement
only if not at the last line of the function. I prefer to rely on standard code behavior, wherever possible. This simplified a lot the coding process. I tend also to create a lot of 1-line functions.
Track computing performance
If you do not know where your code spend time, then you’re facing a huge issue. You may use functional programming techniques to ease that task. If you do not know, what costs the most in term of performance, then how could you optimize it ?
Document results and limits
You should be able to retrieve/reread results of previously ran tests, without having to rerun them. That’s a great loss of time to rerun a tests several days, weeks, or month later. In particular, if the test context (operating system, R version, package version) have changed. How messy t will be. That’s why, save results in rda format
or text format
. A simple result.txt
, capturing the R console output is generally sufficient. Think of it, and do it.