forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProfiling.rmd
122 lines (64 loc) · 7.52 KB
/
Profiling.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: Profiling and benchmarking
layout: default
---
```{r, echo = FALSE}
options(digits = 3)
```
# Profiling and performance optimisation {#profiling}
> "We should forget about small efficiencies, say about 97% of the time:
> premature optimization is the root of all evil" --- Donald Knuth.
Your code should be correct, maintainable and fast. Notice that speed comes last - if your function is incorrect or unmaintainable (i.e. will eventually become incorrect) it doesn't matter if it's fast. As computers get faster and R is optimised, your code will get faster all by itself. Your code is never going to automatically become correct or elegant if it is not already.
Like javascript, the vast majority of R code is poorly written and slow. This sounds bad but it's actually a positive! There's no point in optimising code until it's actually a bottleneck - most R code should be incredibly inefficient because even inefficient code is usually fast enough. If most R code was efficient, it would be a strong signal that R programmers are prematurely optimising, spend time making their code faster instead of solving real problems. Additionally, most people writing R code are not programmers. Many of don't have any formal training in programming or computer science, but are using R because it helps them solve their data analysis problems.
This means that the vast majority of R code can be re-written in R to be more efficient. This often means vectorising code, or avoiding some of the most obvious traps discussed in the [R inferno] (http://www.burns-stat.com/documents/books/the-r-inferno/). There are also other strategies like caching/memoisation that trade space for time. Otherwise, a basic knowledge of data structures and algorithms can help come up with alternative strategies.
This applies not only to packages, but also to base R code. The focus on R code has been making a useful tool, not a blazingly fast programming language. There is huge room for improvement, and base R is only going to get faster over time.
That said, sometimes there are times where you need to make your code faster: spending several hours of your day might save days of computing time for others. The aim of this chapter is to give you the skills to figure out why your code is slow, what you can do to improve it, and ensure that you don't accidentally make it slow again in the future. You may already be familiar with `system.time`, which tells you how long a block of code takes to run. This is a useful building block, but is a crude tool.
Making fast code is a four part process:
1. Profiling helps you discover which parts of your code are taking up the most time.
2. Microbenchmarking lets you experiment with small parts of your code to find faster approaches.
3. Timing helps you check that the micro-optimisations have a macro effect, and helps experiment with larger changes (like totally rethinking your approach)
4. A performance testing tool makes sure your code stays fast in the future (e.g. [Vbench](http://wesmckinney.com/blog/?p=373))
Along the way, you'll also learn about the most common causes of poor performance in R, and how to address them. Sometimes there's no way to improve performance within R, and you'll need to use C++, the topic of [Rcpp](#rcpp).
Having a good test suite is important when tuning the performance of your code: you don't want to make your code fast at the expense of making it incorrect. We won't discuss testing any further in this chapter, but we strongly recommend having a good set of test cases written before you begin optimisation.
Good exploration from Winston: http://rpubs.com/wch/3797
Find out what is slow. Then make it fast.
[Mature optimisation](http://carlos.bueno.org/optimization/mature-optimization.pdf) (PDF)
A recurring theme throughout this part of the book is the importance of differentiating between absolute and relative speed, and fast vs fast enough. First, whenever you compare the speed of two approaches to a problem, be very wary of just looking at a relative differences. One approach may be 10x faster than another, but if that difference is between 1ms and 10ms, it's unlikely to have any real impact. You also need to think about the costs of modifying your code. For example, if it takes you an hour to implement a change that makes you code 10x faster, saving 9 s each run, then you'll have to run at least 400 times before you'll see a net benefit. At the end of the day, what you want is code that's fast enough to not be a bottleneck, not code that is fast in any absolute sense. Be careful that you don't spend hours to save seconds.
## Performance profiling
R provides a built in tool for profiling: `Rprof`. When active, this records the current call stack to disk every `interval` seconds. This provides a fine grained report showing how long each function takes. The function `summaryRprof` provides a way to turn this list of call stacks into useful information. But I don't think it's terribly useful, because it makes it hard to see the entire structure of the program at once. Instead, we'll use the `profr` package, which turns the call stack into a data.frame that is easier to manipulate and visualise.
Example showing how to use profr.
Sample pictures.
Additionally, most people writing R code are not programmers. Many of don't have any formal training in programming or computer science, but are using R because it helps them solve their data analysis problems. This means that the vast majority of R code can be re-written in R to be more efficient. This often means vectorising code, or avoiding some of the most obvious traps discussed later in this chapter. There are also other strategies like caching/memoisation that trade space for time. Otherwise, a basic knowledge of data structures and algorithms can help come up with alternative strategies.
## Timing
## Performance testing
## Caching
`readRDS`, `saveRDS`, `load`, `save`
Caching packages
### Memoisation
A special case of caching is memoisation.
## Byte code compilation
R 2.13 introduced a new byte code compiler which can increase the speed of certain types of code 4-5 fold. This improvement is likely to get better in the future as the compiler implements more optimisations - this is an active area of research.
Using the compiler is an easy way to get speed ups - it's easy to use, and if it doesn't work well for your function, then you haven't invested a lot of time in it, and so you haven't lost much.
## Other people's code
One of the easiest ways to speed up your code is to find someone who's already done it! Good idea to search for CRAN packages.
RppGSL, RcppEigen, RcppArmadillo
Stackoverflow can be a useful place to ask.
### Important vectorised functions
Not all base functions are fast, but many are. And if you can find the one that best matches your problem you may get big improvements
cumsum, diff
rowSums, colSums, rowMeans, colMeans
rle
match
duplicated
Read the source code - implementation in C is usually correlated with high performance.
## Rewrite in a lower-level language
C, C++ and Fortran are easy. C++ easiest, recommended, and described in the following chapter.
## Brainstorming
Most important step is to brainstorm as many possible alternative approaches.
Can often recreate slow general purpose functions to fast special purpose functions. Can easily access faster languages.
Good to have a variety of approaches to call upon.
* Read blogs
* Algorithm/data structure courses (https://www.coursera.org/course/algs4partI)
* Book
* Read R code
We introduce a few at a high-level in the Rcpp chapter.