generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path09-io.Rmd
332 lines (217 loc) · 23.3 KB
/
09-io.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
# (PART) Common Tasks and Patterns {-}
# Input and Output
## General Input Considerations
So far, this text has been favoring the creation of small pieces of data *within our scripts.* The avoidance of reading in data from an external file has been avoided primarily for pedagogical purposes. In general, one might have
- data read in from a plain text file (e.g. `"my_data.csv"` or `"log_file.txt"` ),
- data read in from a database (e.g. MySQL, PostgreSQL, etc.), or
- data created in a script (either deterministically or randomly).
When discussing reading in data, this text mostly focuses on the first category. Here are the reasons for its doing so:
1. text-files are more readily-available to students than databases,
2. teaching the second category requires teaching SQL, and that would introduce conceptual overlap, and
3. the third category is programmatically self-explanatory.
::: {.rmd-details data-latex=""}
The third reason does not imply data created by code is unimportant. For example, it is the most common approach to create data used in **simulation studies.** Authors writing statistical papers need to demonstrate that their techniques work on "nice" data: data simulated from a *known* data-generating process. In a simulation study, unlike in the "real-world," you have access to the parameters generating your data, and you can examine data that might otherwise be unobserved or hidden. Further, with data from the real-world, there is no guarantee your model correctly matches the true model.
Can your code/technique/algorithm, at the very least, obtain parameter estimates that are "in-line" with the parameters your code is using to simulate data? Are forecasts or predictions obtained by your method accurate? These kinds of questions can often only be answered by simulating fake data. Programmatically, simulating data like this largely involves calling functions that we have seen before (e.g. `rnorm()` in R or `np.random.choice()` in Python). This may or may not involve setting a pseudorandom number seed, first, for reproducibility.
Also, *benchmark data sets* are often readily available through specialized function calls.
:::
Even though this chapter is written to teach you how to read in files into R and Python, you should not expect that you will know how to read in *all* data sets after reading this section. For both R and Python, there are an enormous amount of functions, different functions have different return types, different functions are suited for different file types, many functions are spread across a plethora of third party libraries, and many of these functions have an enormous amount of arguments. You will probably not be able to memorize everything. In my very humble opinion, I doubt you should want to.
Instead, **focus on developing your ability to identify and diagnose data input problems.** Reading in a data set correctly is often a process of trial-and-error. After attempting to read in a data set, always check the following items. Many of these points were previously mentioned in section \@(data-frames-in-r). Some apply to reading in text data more than reading in structured data from a database, and vice versa.
1. Check that **the correct column *separator* was used, or the correct "fixed-width format" was expected.** If mistakes are made, data frame columns are going to be combined or split apart in weird ways, and often the wrong types are going to be used for pieces of data (e.g. `"2,3"` instead of `2` and `3`.) Also, watch out for when separators are found inside data elements or column names. For example, sometimes it's unclear whether people's names in the "last, first" format can be stored in one or two columns. Also, text data might surprise you with unexpected spaces or other whitespace is a common separator.
2. Check that **the column names were parsed and stored correctly.** Column names should not be stored as data in R/Python. Functions that read in data should not expect column names when they don't exist in the actual file.
3. Check that **empty space and metadata was ignored correctly.** Data descriptions are sometimes stored in the same file as the data itself, and that should be skipped over when it's being read in. Empty space between column names and data shouldn't be stored. This can occur at the beginning of the file, and even at the end of the file.
4. **Check that type choice and recognition of special characters are performed correctly.** Are letters stored as strings or as something else such as an R `factor`? Are dates and times stored as a special date/time type, or as strings? Is missing data correctly identified? Sometimes data providers use outrageous numbers like $-9999$ to represent missing data--don't store that as a float or integer!
5. **Be ready to prompt R or Python to recognize a specific character encoding if you are reading in text data written in another language.** All text data has a character encoding, which is a mapping of numbers to characters. Any specific encoding will dictate what characters are recognizable in a program. If you try to read in data written in another language, the function you are using will likely complain about unrecognized characters. Fortunately, these errors and warnings are easily fixed by specifying a nondefault argument such as `encoding=` or `fileEncoding=`.
This is no small task. To make matters worse:
- you can't (or shouldn't) edit the raw data to suit your needs, to make it easier to read in. You have to work with what you are given. If you were allowed to edit, say, a text file you downloaded onto your own machine, you shouldn't--it will lead to code that doesn't run anywhere else. Additionally, if you abuse write privileges on your company's database, for example--that could be very dangerous as well.
- Data sets are often quite large, so manually checking each element is often impossible. In this situation you will have to resign yourself to checking the top and bottom of a data set, or maybe anticipate a specific place where problems are likely to appear.
## Reading in Text Files with R
You've seen examples of `read.csv()` used earlier in the book, so it should not surprise you that this is one of the most common ways to read in data in R. Another important function is `read.table()`.
If you look at the source code for `read.csv()` (type the name of the function without parentheses into the console and press the `<Enter>` key), you will see it calls `read.table()`. The primary difference between these functions is default arguments. **Mind the default arguments.** Do not be completely averse to writing a long-line of code to read in a data set correctly. Or do, and choose the function with the best default arguments.
Consider the ["Challenger USA Space Shuttle O-Ring Data Set"](https://archive.ics.uci.edu/ml/datasets/Challenger+USA+Space+Shuttle+O-Ring) from [@uci_data]. The first few rows of the raw text file^[Open raw text files with text editor programs, not with programs that perform any kind of processing. For instance, if you open it with Microsoft Excel, the appearance of the data will change, and important information helping you to read your data into R or Python will not be available to you.] looks like this.
```{bash, echo=FALSE, comment = NA}
head -n3 data/o-ring-erosion-only.data
```
It does not use commas as separators, and there is no header information, so `read.csv()` used with its default arguments will produce an incorrect result. It will miss the first row by counting it as a column name, and store everything in one column with the wrong type.
```{r, collapse = TRUE}
d <- read.csv("data/o-ring-erosion-only.data")
dim(d) # one row short, only 1 col
typeof(d[,1])
```
Specifying `header=FALSE` fixes the column name issue, but `sep = " "` does not fix the separator issue.
```{r, collapse = TRUE}
d <- read.csv("data/o-ring-erosion-only.data",
header=FALSE, sep = " ")
str(d)
```
One space is strictly one space. Some rows have two, though. This causes there to be two too many columns filled with `NA`s.
After digging into the documentation a bit further, you will notice that `""` works for "one or more spaces, tabs, newlines or carriage returns." This is why `read.table()`, with its default arguments, works well.
```{r, collapse = TRUE}
d <- read.table("data/o-ring-erosion-only.data")
str(d)
```
This data set has columns whose widths are "fixed", too. It is in "fixed width format" because any given column has all its elements take up a constant amount of characters. The third column has integers with two or three digits, but no matter what, each row has the same number of characters.
You may choose to exploit this and use a specialized function that reads in data in a fixed width format (e.g. `read.fwf()`). The frustrating thing about this approach, though, is that you have to specify what those widths are. This can be quite tedious, particularly if your data set has many columns and/or many rows. The upside though, is that the files can be a little bit smaller, because the data provider does not have to waste characters on separators.
In the example below, we specify widths that include blank spaces to the left of the digits. On the other hand, if we specified `widths=c(2,2,4,4,1)`, which includes spaces to the *right* of digits, then columns would have been recognized as `character`s.
```{r, collapse = TRUE}
d <- read.fwf("data/o-ring-erosion-only.data",
widths = c(1,2,3,4,3)) # or try c(2,2,4,4,1)
str(d)
```
If you need to read in some text data that does not possess a tabular structure, then you may need `readLines()`. This function will read in all of the text, separate each line into an element of a `character` `vector`, and will not make any attempt to parse lines into columns. Further processing can be accomplished using the techniques from section \@ref(an-introduction-to-regular-expressions).
```{r, collapse=TRUE}
html_data <- readLines("data/Google.html", warn = FALSE)
head(html_data, 1)
```
Some of you may have had difficulty reading in the above data. This can happen if your machine's default character encoding is different than mine. For instance, if your character encoding is ["GBK"](https://en.wikipedia.org/wiki/GBK_(character_encoding)), then you might get a warning message like "invalid input found on input connection." This message means that your machine didn't recognize some of the characters in the data set.
```{r, echo=FALSE}
options(encoding = "GBK") # change my encoding to GBK
```
These errors are easy to fix, though, so don't worry. Just specify an encoding argument in your function that reads in data.
```{r, collapse=TRUE, error=TRUE}
tmp <- read.table("data/Google.html", sep = "~",
fileEncoding = "UTF-8") # makes errors disappear
```
```{r, echo=FALSE}
options(encoding = "UTF-8")
```
## Reading in Text Files with Pandas
A [wide variety of different file formats can be read in with Pandas.](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) I will only mention a few functions here.
Recall R has `read.table()` and `read.csv()`, and that they are very similar. In Pandas\index{Pandas}, [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) and [`pd.read_table()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html) have a lot in common, too. Their primary difference is the default column separator, as well.
Recall the O-Ring data from above. The columns are *not* separated by commas, so if we treat it as a comma-separated file, the resulting Pandas `DataFrame` is going to be missing all but one of its columns.
```{python, collapse = TRUE}
import pandas as pd
d = pd.read_csv("data/o-ring-erosion-only.data")
d.shape # one column and missing a row
d.columns # column labels are data
```
By default, `pd.read_csv()` is expecting column labels, which is also a problem. Unlike R, though, the `header=` argument is not expected to be a Boolean. You will need to provide a `None`, instead. The separator needs to be just right, too.
```{python, collapse = TRUE}
pd.read_csv("data/o-ring-erosion-only.data",
header=None, sep = " ").head(2) # 1 space: no
pd.read_csv("data/o-ring-erosion-only.data",
header=None, sep = "\t").head(2) # tabs: no
pd.read_table("data/o-ring-erosion-only.data",
header=None).head(2) # default sep is tabs, so no
pd.read_csv("data/o-ring-erosion-only.data",
header=None, sep = "\s+").head(2) # 1 or more spaces: yes
```
Reading in fixed width files can be done in a way that is nearly identical to the way we did it in R. Here is an example.
```{python, collapse = TRUE}
d = pd.read_fwf("data/o-ring-erosion-only.data",
widths = [1,2,3,4,3], header=None) # try [2,2,4,4,1]
d.info()
```
If you had chosen `widths=[2,2,4,4,1]`, instead, then the trailing whitespace will cause Pandas to recognize a `dtype` of `object`. The reason it is not recognized as a string is because strings can be of different length, and all string types specify a maximum length. If you want to enforce a maximum length, there may be some speed advantages. In the below example, we use [`d.astype()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) to convert two columns' types to [`pd.StringDtype`](https://pandas.pydata.org/docs/reference/api/pandas.StringDtype.html#pandas.StringDtype).
```{python, collapse = TRUE}
d = pd.read_fwf("data/o-ring-erosion-only.data",
widths = [2,2,4,4,1], header=None)
list(d.dtypes)[:4]
d = d.astype({2:'string', 3:'string'})
list(d.dtypes)[:4]
```
Just like in R, you may run into an encoding issue with a file. For instance, the following will not work because the file contains Chinese characters. If you mostly work with UTF-8 files, you will receive a `UnicodeDecodeError` if you try to run the following code.
```{python, collapse = TRUE, error = TRUE, eval = FALSE}
pd.read_csv("data/message.txt")
```
However, the error messages disappear when you specify `encoding="gbk"`.^[A list of more options of encodings that are built into Python,are available [here.](https://docs.python.org/3/library/codecs.html#standard-encodings)]
```{python, collapse = TRUE, error = TRUE}
pd.read_csv("data/message.txt", encoding = "gbk")
```
You may also read in unstructured, nontabular data with Python. Use the built-in [`open()`](https://docs.python.org/3/library/functions.html#open) function to open up a file in read mode, and then use `f.readlines()` to return a `list` of strings.
```{python, collapse=TRUE}
f = open("data/Google.html", "r")
d = f.readlines()
d[:1]
print(type(d), type(d[0]))
```
## Saving Data in R
Storing data is important for saving your progress. For example, sometimes running a script that performs data cleaning can take a very long time. Saving your progress might free you from the responsibility of running that script many times.
In R, there are many options for storing data. I will mention two: writing data out to a plain text file, and saving a serialized object.
### Writing Out Tabular Plain Text Data in R
If you want to write out tabular data to a text file, use `write.table()` or `write.csv()`. There are two arguments that you must specify, at a minimum: the first argument is your R object (typically a `matrix` or `data.frame`), and the second argument is the file path on your hard drive.
Here is an example of writing out `d` to a file called `"oring_out.csv"`. I choose to include column names, but not row names. I also use commas to separate columns.
```{r, collapse = TRUE}
write.table(d, file = "data/oring_out.csv",
col.names = TRUE, row.names = FALSE, sep = ";")
```
The above will not print anything to the R console, but we can use a text editor to take a look at the raw text file on our hard drive. Here are the first three rows.
```{bash, echo=FALSE, comment = NA}
head data/oring_out.csv -n3
```
### Serialization in R
Alternatively you may choose to store your data in a **serialized** form. With this approach, you are still saving your data in a more permanent way to your hard drive, but it is stored in format that's usually more memory efficient.
::: {.rmd-details data-latex=""}
Recall that a common reason for writing out data is to save your progress. When you want to save your progress, it is important to ask yourself: "is it better to save my progress as a serialized object, or as a raw text file?"
When making this decision, consider *versatility.* On the one hand, raw text files are more versatile and can be used in more places. On the other hand, versatility is often bug prone.
For example, suppose you want to save a cleaned up `data.frame`. Are you sure you will remember to store that column of strings as `character` and not a `factor`? Does any code that uses this `data.frame` require that this column be in this format?
:::
For instance, let's save the object `d` in a file called `oring.rds`.
```{r, collapse = TRUE}
saveRDS(d, file = "data/oring.rds")
rm(d)
exists("d")
```
After it is saved with `saveRDS()`, we are free to delete the variable with `rm()`, because it can be read back in later on. To do this, call `readRDS()`. This is file has a special format that is recognized by R, so you will not need to worry about any of the usual struggles that occur when reading in data from a plain text file. Additionally, `.rds` files are typically smaller--`oring.rds` is only 248 bytes, while `"oring_out.csv"` is 332 bytes.
```{r, collapse = TRUE}
d2 <- readRDS(file = "data/oring.rds")
head(d2, 3)
```
You can serialize multiple objects at once, too! Convention dictates that these files end with the `.RData` suffix. Save your entire global environment with `save()` or `save.image()`, and bring it back with `load()` or `attach()`.
```{r, collapse = TRUE}
rm(list=ls()) # remove everything
a <- 1
b <- 2
save.image(file = "data/my-current-workspace.RData")
rm(list=ls())
load("data/my-current-workspace.RData")
ls() # print all objects in your workspace
```
## Saving Data in Python
### Writing Out Tabular Plain Text Data in Python
You can write out tabular data with a variety of [`DataFrame` methods that are named `to_*()`.](https://pandas.pydata.org/pandas-docs/stable/reference/io.html#input-output). [`pd.DataFrame.to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv) has a lot of common with `write.csv()` in R. Below we write out `d` to a file called `oring_out2.csv`.
```{python, collapse = TRUE}
import pandas as pd
d = pd.read_csv("data/o-ring-erosion-only.data",
header=None, sep = "\s+")
d.to_csv("data/oring_out2.csv",
header=True, index=False, sep = ",")
```
Here is how the first few rows of that file looks in a text editor.
```{bash, echo=FALSE, comment = NA}
head data/oring_out2.csv -n3
```
### Serialization in Python
Serialization functionality is readily available in Python, just like it is in R. In Python, the [`pickle`](https://docs.python.org/3/library/pickle.html) and `cPickle` libraries are probably the most commonly used. Serializing objects with these libraries is known as *pickling* an object.\index{pickling in Python}
Pandas has a [`.to_pickle()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html) wrapper method attached to every `DataFrame`. Once the pickled object is saved, the file can be read back into Python with [`pd.read_pickle()`](https://pandas.pydata.org/docs/reference/api/pandas.read_pickle.html#pandas.read_pickle). These functions are extremely convenient, because they call all the required `pickle` code and hide a decent amount of complexity.
Here is an example of writing out `d` and then reading the pickled object back in. In Python 3, the file suffix for pickled objects is usually `.pickle`, but there are many other choices.
```{python, collapse = TRUE}
d.to_pickle("data/oring.pickle")
del d
d_is_back = pd.read_pickle("data/oring.pickle")
d_is_back.head(2)
```
::: {.rmd-caution data-latex=""}
Unfortunately, `"oring.pickle"` is much larger (1,676 bytes) than the original text file `"o-ring-erosion-only.data"` (322 bytes). This is for two reasons. First, the original data set is small, so the overhead of pickling this object is relatively pronounced, and second, we are not taking advantage of any compression. If you use something like `d_is_back.to_pickle("data/oring.zip")` it will become smaller.
:::
In Python, unlike in R, it is more difficult to serialize all of the objects you currently have in memory. It is possible, but it will likely require the use of a third-party library.
Speaking of third-party code, there are many that provide alternative serialization solutions in both R and Python. I do not discuss any in this text. However, I will mention that some of them may provide combinations of the following: an increase in read and write speed, a decrease in required memory, improved security^[The [documentation for `pickle`](https://docs.python.org/2/library/pickle.html) mentions that the library is "not secure against erroneous or maliciously constructed data" and recommends that you "[n]ever unpickle data received from an untrusted or unauthenticated source."], improved human readability and interoperability between multiple programming languages. If any of these sound potentially beneficial, I encourage you to conduct further research.
## Exercises
### R Questions
1.
Consider again the data set called `"gspc.csv"`, which contains daily open, high, low and close values for the S\&P500 Index.
a) Read in this data set as a `data.frame`, and call it `myData`. Do not include the code that achieves this in your assignment submission.
b) Write out this object as `myData.rds`. After you are finished, remove `myData` from memory. Do not include the code that achieves this in your assignment submission.
c) Read in `myData.rds`, and store the variable as `financialData`. *Do* include the code that achieves this in your project submission. Make sure this code assumes that `myData.rds` is in the same folder as the code file `io_lab.R`.
### Python Questions
1.
We will use the `"Google.html"` data set mentioned in the chapter.
a) Use `open()` to open the `"Google.html"` file. Store the output of the function as `my_file`.
b) Use the `.readlines()` method of the file to write the contents of the file as a `list` called `html_data`
c) Coerce the `list` to a `DataFrame` with one column called `html`
d) Create a `Series` called `nchars_ineach` that stores the number of characters in each line of text. Hint: the [`Series.str` attribute has a lot of helpful methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str).
e) Create an `int`-like variable called `num_div_tags` that holds the total number of times the phrase "`<div>`" appears in the file.
2.
Consider the data set called `"gspc.csv"`, which contains daily open, high, low and close values for the S\&P500 Index.
a) Read in this data set as a `DataFrame`, and call it `my_data`. Do not include the code that achieves this in your assignment submission.
b) Write out this object as `"my_data.pickle"`. After you are finished, remove `my_data` from memory. Do not include the code that achieves this in your assignment submission.
c) Read in `"my_data.pickle"`, and store the variable as `financial_data`. *Do* include the code that achieves this in your project submission. Make sure this code assumes that `"my_data.pickle"` is in the same folder as the code file `io_lab.py`.