-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmodule2.qmd
435 lines (292 loc) · 35 KB
/
module2.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
---
title: "Operational Synthesis"
---
## Learning Objectives
After completing this module, you will be able to:
- **Identify** characteristics of reproducible coding / project organization
- **Explain** benefits of reproducibility (to your team and beyond)
- **Summarize** the advantages of creating a defined contribution workflow
- **Explain** how synthesis teams can use GitHub to collaborate more efficiently and reproducibly
- **Understand** best practices for preparing and analyzing data to be used in synthesis projects
## Introduction
Here are a few serviceable definitions of what research is:
- "Creative and systematic work undertaken in order to increase the stock of knowledge" *from the [2015 Frascati Manual](https://doi.org/10.1787/9789264239012-en)*
- "Studious inquiry or examination," and especially "investigation or experimentation aimed at the discovery and interpretation of facts" *from the [Merriam-Webster dictionary](https://www.merriam-webster.com)*
To these basic definitions of research, **our definition of synthesis research adds
_collaborative work_, and the _integration and analysis of a wide range of data sources_**, to achieve a more complete, generalizable, or useful research result. In Module 1 we discussed many of the collaborative considerations for synthesis research, including creating a diverse and inclusive team, asking synthesis-ready scientific questions (often broad in scope or spatial scale), and finding suitable information (or data) from a wide variety of sources to answer those questions. Once the synthesis team moves into the operational phase of research, which includes the integration and analysis of data, there are some key activities that must happen:
- **Cleaning and harmonizing data** to make it usable
- **Analyzing data** to answer questions
- **Interpreting** the results of your analysis
- **Writing** the papers or **creating** other research products
We've already seen that creating a collaborative, inclusive team can set the stage for successful synthesis research. Each of the operational activities above will also benefit from this mindset, and in this module we highlight some of the most important considerations and practices for a *team science* approach to the nuts-and-bolts of synthesis research.
## Reproducibility Practices
<p align="center">
<img src="images/mod2_repro.png" alt="PhD comics strip where a profesor reassures a graduate student that they won't have to start code on a project from scratch but then goes on to describe the many reproducibility issues with the code to the graduate student's increasing dismay" width="90%"/>
</p>
Making one's work "reproducible"--particularly in code contexts--has become increasingly popular but is not always clearly defined. For the purposes of this short course, we believe that reproducible work:
- Uses scripted workflows for all interactions with data
- Contains sufficient documentation for those outside of the project team to navigate the project's contents
- Contains detailed metadata for all data products
- Allows anyone to recreate the entire workflow from start to finish
- Leads to modular, extensible research projects. Adding data from a new site, or a new analysis, should be relatively easy in a reproducible workflow.
### {{< fa file-pen >}} Contributing
- Create a formal plan for collaborating _with which your whole team agrees_
- Quarantine external inputs
- Plan for "future you"
- Communicate to your collaborators whenever you're working on a specific script to avoid conflicting edits
### {{< fa file-lines >}} Documenting
::::{.columns}
:::{.column width="55%"}
- One folder per project
- Further organize content via sub-folders
- Make file names informative and intuitive
- Avoid spaces and special characters in file names
- Follow a consistent naming convention throughout
- Good names should be machine readable, human readable, and sorted in a useful way
- Use READMEs to record organization rules / explanation
- Keep a log of where source data came from.
- Where did you search?
- What search terms did you use?
- List the dataset identifiers you downloaded/used
- Ideally, include downloading data as part of your scripted workflow
:::
:::{.column width="5%"}
:::
:::{.column width="40%"}
:::{style="font-size:0.9em;"}
_Example project structure:_
{{< fa folder >}} project_new
|-- {{< fa file-lines >}} README.txt
|-- {{< fa folder >}} 01_grant_management
|-- {{< fa folder >}} 02_project_coordination
|-- {{< fa folder >}} 03_documentation
|-- {{< fa folder >}} 04_participant_tracking
|-- {{< fa folder >}} 05_data
| |-- {{< fa file-lines >}} README.txt
| |-- {{< fa folder >}} hydrology
| L {{< fa folder >}} water_chemistry
|-- {{< fa folder >}} 06_src
| |-- {{< fa file-lines >}} README.txt
| |-- {{< fa folder >}} data_aggregation
| |-- {{< fa folder >}} data_harmonization
| L {{< fa folder >}} modeling
L {{< fa folder >}} 06_publications
L {{< fa folder >}} biogeochemistry
:::
:::
::::
### {{< fa laptop-code >}} Coding
- Use a [version control](https://lter.github.io/eco-data-synth-primer/module2.html#version-control) system
- Load libraries/packages explicitly
- Track (and document) software versions
- Namespace functions (if not already required by your coding language)
- *E.g.,* `dplyr::mutate(mtcars, hp_disp = hp / disp)`
- Use relative file paths that are operating system-agnostic
- Balance how descriptive object names are with striving for concise names
- _Use comments_ in the code!
- Consider custom functions
- For scripts that need to be run in order, consider adding step numbers to the file name
### {{< fa people-group >}} Synthesizing
How does reproducibility in synthesis considerations differ from individual / non-synthesis applications?
- Judgement calls need to be made / agreed to as a group
- But "defer to the doers"
- Increased emphasis on contribution guidelines / planning being formalized
- More communication needs
- Must ensure that every team member has sufficient access to the project files
- Its best to keep track of who contributed what, so that everyone gets credit. This can be challenging in practice.
::: {.panel-tabset}
### **Reproducibility Questions**
In groups of 3-5, discuss the following questions:
- What elements of reproducibility **from our list** have you used/are interested in using?
- Which feel unreasonable or confusing?
- What activities do you do in your own work to ensure reproducibility that **our list is missing**?
:::
## Version Control
In all scientific research, the data work (cleaning, harmonizing, analyzing) and the writing are iterative processes. The process and products change over time and usually require a series of revisions. In *synthesis* research, the process can become even more complex because the team is usually large and multiple people are contributing data, analysis, writing, revisions, and more. Using **version control** helps manage this complexity by recording changes, tracking individual contributions, and ensuring that things can be rolled-back to an earlier state if needed.
<p align="center">
<img src="images/mod2_final.png" alt="PhD comics strip of a graduate student prematurely titling a document 'final.doc' and then undergoing a series of revisions with progressively more complicated and less informative final names" width="80%"/>
</p>
Like this comic shows here, you might have several drafts of your paper before the finalized version. With a version control system, all the revisions in each draft are saved. Version control systems provide a framework for preserving these changes without cluttering your computer with all of the files that precede the final version.[^1]
Using version control enhances your workflow by allowing you to:
- maintain a descriptive history of your research project’s development while keeping a clean workspace
- no more cryptic file names or commented-out lines of code to track your progress
- collaborate with team members and merge everyone's edits together
- explore bugs or new features without disrupting your team members’ work.[^2]
[^1]: Lyon, N. J., Chen, A., Brun, J. (2023). Collaborative Coding with GitHub. LNO Scientific Computing Team. https://nceas.github.io/scicomp-workshop-collaborative-coding/.
[^2]: Poulsen, C. V. & Chen, A. (2024). NCEAS coreR for Delta Science Program. NCEAS Learning Hub. https://learning.nceas.ucsb.edu/2024-06-delta.
### Vocabulary
<p align="center">
<img src="images/git_local.png" alt="Image showing how Git is used from a person's local computer" width="50%"/>
<img src="images/github_cloud.png" alt="Image showing how GitHub is an online platform that hosts Git repositories" width="46%"/>
</p>
Here are some brief definitions for a selection of fundamental version control vocabulary terms.
- **Version control system**: software that tracks iterative changes to your code and other files
- **Repository**: the specific folder/directory that is being tracked by a version control system
- **Git**: a popular open-source distributed version control system
- **GitHub**: a website that allows users to store their Git repositories online and share them with others
### GitHub
While this section of the module focuses on [GitHub](https://github.com/), there are several other viable alternatives for working with Git individually or as part of a larger team (e.g., [GitLab](https://about.gitlab.com/), [GitKraken](https://www.gitkraken.com/), etc.). Any of these may be viable option for your team and we focus on GitHub here only to ensure a standard backdrop for the case studies we'll discuss shortly.
There are a _lot_ of GitHub tutorials that exist already so, rather than add our own variant to the list, we'll work through part of one created by the Scientific Computing team of the [National Center for Ecological Analysis and Synthesis](https://www.nceas.ucsb.edu/) (NCEAS).
**See the workshop materials [here](https://nceas.github.io/scicomp-workshop-collaborative-coding/github.html).**
Given the time restrictions for this short course, we'll only cover how you engage with GitHub directly through the GitHub website. However, your chosen software for writing code will _certainly_ have a method of connecting to GitHub/etc., so if this topic is of interest it will be beneficial for you to search out the relevant tutorial.
## Data Preparation
The scientific questions being asked in synthesis projects are usually broad in scope, and it is therefore common to bring together many datasets from different sources for analysis. The datasets selected for analysis (source data) may have been collected by different people, in different places, using different methods, as part of different projects... or all of the above. Typically, some amount of data **cleaning** - filtering or removing unwanted observations - and data **harmonization** - putting data together in common structures, file formats, and units of measurement - is necessary before analysis can begin. This process can be easy or difficult depending on the quality of the source data, the differences between source data, and how much **metadata** (see callout below) is available to understand them.
::: {.panel-tabset}
### **Data Preparation Group Questions**
- How many of you work directly with data in your day-to-day?
- What percentage of the time that you spend working on data is spent on data cleaning?
- How much on metadata creation?
- How much on data preparation?
:::
:::{.callout-tip collapse="true"}
### More about Metadata
Metadata is "data about the data," or information that describes **who** collected the data, **what** was observed or measured, **when** the data were collected, **where** the data were collected, **how** the observations or measurements were made, and **why** they were collected. Metadata provide important contextual information about the origin of the data and how they can be analyzed or used. They are most useful when attached or linked to the data being described, and data and related metadata together are commonly referred to as a *dataset*.
Metadata for ecological research data are well described in Michener et al (1997),[^3] but there are many other kinds of metadata with different purposes.[^4] If you are publishing a research dataset and have questions about metadata, ask a data manager for your project, or staff at the repository you are working with, for help. Either can typically provide guidance on creating metadata that will describe your data and be useful to the community (here is [one example](https://edirepository.org/resources/creating-metadata-for-publication)). We'll return to the subject of metadata in Module 3.
[^3]: Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. (1997), NONGEOSPATIAL METADATA FOR THE ECOLOGICAL SCIENCES. Ecological Applications, 7: 330-342. https://doi.org/10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2
[^4]: Mayernik, M.S. and Acker, A. (2018), Tracing the traces: The critical role of metadata within networked communications. Journal of the Association for Information Science and Technology, 69: 177-180. https://doi.org/10.1002/asi.23927
:::
### Cleaning Data
When assembling large datasets from diverse sources, as in synthesis research, not all the source data will be useful. This may be because there are real or suspected errors, missing values, or simply because they are not needed to answer the scientific question being asked (wrong variable, different ecosystem, etc.). Data that are not useful are usually excluded from analysis or removed altogether. Data cleaning tends to be a stepwise, iterative process that follows a different path for every dataset and research project. There are some standard techniques and algorithms for cleaning and filtering data, but they are beyond the scope of this course. Below are a few guidelines to remember, and more in-depth resources for data cleaning are found at the end of this section.
1. **Always preserve the raw data**. Chances are you'll want to go back and check the original source data at least once.
2. **Use a scripted workflow to clean and filter the raw data**, and follow the usual rules about reproducibility (comments, version control, functionalization).
3. **Consider using the concept of data processing "levels,"** meaning that defined sets of data flagging, removal, or transformation operations are applied consistently to the data in stepwise fashion. For example, incoming raw data would be labeled "level 0" data, and "level 1" data is reached after the first set of processing steps is applied.
4. **Spread the data cleaning workload around!** Data cleaning typically demands a HUGE fraction of the total time devoted to working with data,[^5][^6][^7] and it can be tedious work. Make sure the team shares this workload equitably.
[^5]: Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10
[^6]: [New York Times, 2014](http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html)
[^7]: [Anaconda State of Data Science Report, 2022](https://www.anaconda.com/resources/whitepapers/state-of-data-science-report-2022/)
### Data Harmonization
Data harmonization is the process of bringing different datasets into a common format for analysis. The harmonized data format chosen for a synthesis project depends on the source data, analysis plans, and overall project goals. It is best to make a plan for harmonizing data BEFORE analysis begins, which means discussing this with the team in the early stages of a synthesis project. As a general rule, it is also wise to use a scripted workflow that is as reproducible as possible to accomplish the harmonization you need for your project. Following this guidance lets others understand, check, and adapt your work, and will also make it much, much easier to bring new data and analysis methods into the project.
Data harmonization is hard work that sometimes requires trial and error to arrive at a useful end product. At the end of this section are some additional data harmonization resources to help you get started. Looking at a simple example might also help.
::: {.callout-note appearance="simple" icon="false"}
### Example: Harmonizing grassland biomass data
In the figure below, two datasets from different LTER sites have been harmonized into one file for analysis. We don't have all the metadata here, but based on the column naming we can assume that the file on the left (`Konza_harvestplots.txt`, yellow) contains columns for the sampling date, a plot identifier, a treatment categorical value, and measured biomass values. The filename suggests that the data come from the Konza Prarie LTER site. The file on the right (`2022_clips.csv`, blue) has a column denoting the LTER site, in this case Sevilleta LTER (abbreviated as SEV), as well as similar date, plot identifier, treatment, and measured biomass columns.
There are some similarities and some differences in these two source files. A harmonized file (`lter_grass_biomass_harmonized.txt`) appears below.
<p align="center">
<img src="images/data_harmonization_example.png" alt="Image showing two small data tables with different column names and data structure being modified to be able to combine them into a single, larger table" width="90%"/>
</p>
Take a minute to look at the harmonized file and consider **how** these data were harmonized. Then, answer the question below.
:::
::: {.callout-caution collapse="true" icon="false"}
### What changes were made to data structure, variable formatting, or units in these data?
Even though the source data files were similar, several important changes were made to create the harmonized file. Among them:
- The **site** column was preserved and contains the "SEV" and "KNZ" categorical values denoting which LTER site is observed in each row.
- The dates in the **date** column were converted to a standard format (*YYYY-MM-DD*). Incidentally, this matches the International Organization for Standardization (ISO) recommendation for date/time data, making it generally a good choice for dates.
- A new **rep** column was created to identify the replicate measurements at each site. This is essentially a standard version the plot identifier columns (**PlotID** and **plot**) in the original data file. Also note that the original values are preserved in the new **plot_orig** column.
- The treatment columns from the original data files (**Treatment** and **trt**) were standardized to one column with "C" and "F" categorical values, for control and fertilized treatments, respectively.
- The biomass values from Konza were converted to units grams per meter squared (g/m^2^) because the original Konza measurements were for total biomass in 2x2 meter plots. *Note that this conversion is for illustration purposes only* - we can't be sure if this conversion is correct without it being spelled out in the metadata, or asking the data provider directly.
:::
### A Word about Harmonized Data Formats
Above, we have discussed several aspects of selecting a **data format**. There are at least three related, but not exactly equivalent, concepts to consider when formatting data. First, formats describe the way data are structured, organized, and related within a data file. For example, in a tabular data file about biomass, the measured biomass values might appear in one column, or in muiltiple columns. Second, the values of any variable can be represented in more than one format. The same date, for example, could be formatted using text as "July 2, 1974" or "1974-07-02." Third, format may refer to the *file format* used to hold data on a disk or other storage medium. File formats like comma separated value text files (CSV), Excel files (.xlsx), JPEG images, are commonly used for research data, and each has particular strengths for certain kinds of data.
A few guidelines apply:
1. For formatting a tabular dataset, **err towards simpler data structures**, which are usually easier to clean, filter, and analyze. Long-format tables, or tidy data [^5], is one common recommendation for this.
2. When choosing a file format, **err towards open, non-proprietary file formats** that more people know and have access to. Delimited text files, such as CSV files, are a good choice for tabular data.
3. **Use existing community standards** for formatting variables and files as long as they suit your project methods and scientific goals. Using ISO standards for date-time variables, or species identifiers from a taxonomic authority, are good examples of this practice.
4. **There is no perfect data format!** Harmonizing data always involves some judgement calls and tradeoffs.
When choosing a destination format for the harmonized data for a synthesis project, the **audience** and **future uses** of the data are also an important consideration. Consider how your synthesis team will analyze the data, as well as how the world outside that team will use and interact with the data once it is published. Again, there is no one answer, but below are a few examples of harmonized destination formats to consider.
:::{.panel-tabset}
### Long (Tidy)
Here our grassland biomass data is in long format, often referred to as "tidy" data. Data in this format is generally easy to understand and use. There are three rules for tidy data:
1. Each column is one variable.
2. Each row is one observation.
3. Each cell contains a single value.
![visual representation of the tidy data structure](images/tidy-data.png){fig-alt="Image of a table of data in 'tidy' format (i.e., where each column is one variable, each row is one observation, and each cell contains a single value)"}
**Advantages**: clear meaning of rows and columns; ease in filtering/cleaning/appending
**Disadvantages**: not as human-friendly so it can be difficult to assess the data visually
**Possible file formats**: Delimited text (tab delimited shown here), spreadsheets, database tables
![The harmonized grassland data, in long format.](images/long_data_example.png){width="75%" fig-alt="An example of tidy data in long format"}
### Wide (Untidy)
In this dataset, our grassland data has been restructured into wide format, often referred to (sometimes unfairly) as "messy" or "untidy" data. Note that the biomass variable has been split into two columns, one for control plots and one for fertilized plots.
**Advantages**: easier for some statistical analyses (ANOVA, for example); easier to assess the data visually
**Disadvantages**: may be more difficult to clean/filter/append, multiple observations per row; more likely to contain empty (NULL) cells
**Possible file formats**: Delimited text (tab delimited shown here), spreadsheets, database tables
![The harmonized grassland data, restructured into wide format with biomass values in control and fertilized columns.](images/wide_data_example.png){width="75%" fig-alt="An example of tidy data in wide format"}
### Relational Database
Below is an example of how we might structure our grassland data in a relational database. The schema consists of three tables that house information about sampling events (when, where data were collected), the plots from which the samples are collected, and the biomass values for each collection. The schema allows us to define the data types (e.g., text, integer), add constraints (e.g., values cannot be missing), and to describe relationships between tables (keys). Relational formats are [normalized](https://en.wikipedia.org/wiki/Database_normalization) to reduce data redundancy and increase data integrity, which can help us to manage complex data[^13].
![example grassland database schema](images/grassland_schema.drawio.png){fig-alt="An example of relational data where several tables are linked by a shared column even though they have differing variables and numbers of observations"}
**Advantages**: reduced redundancy, greater integrity; community standard; powerful extensions (e.g., store and process spatial data); many different database flavors to meet specific needs
**Disadvantages**: significant metadata needed to describe and use; more complex to publish; learning curve
**Possible file formats**: Database stores, can be represented in delimited text (CSV)
A richer example is a schematic of the related tables that comprise the [ecocomDP](https://ediorg.github.io/ecocomDP/)[^8] harmonized data format for biodiversity data. Eight tables are defined, along with a set of relationships between tables (keys), and constraints on the allowable values in each table.
![The ecocomDP schema. Each table has a name (top cell) and a list of columns. Shaded column names are primary keys, hashed columns have constraints, and arrows represent relations between keys/constraints in different tables.](images/ecocomDP_schema.jpg){width="75%" fig-alt="Another example of a relational set of tables implicitly connected by shared index columns. Curved lines indicate which tables are connected to one another and which columns define that relationship"}
[^8]: O'Brien, Margaret, et al. "ecocomDP: a flexible data design pattern for ecological community survey data." Ecological Informatics 64 (2021): 101374. https://doi.org/10.1016/j.ecoinf.2021.101374
[^13]: Zimmerman, N. 2016. [Hand-crafted relational databases for fun and science](https://carpentries.org/blog/2016/12/hand-crafted-databases/)
### Cloud-native
There are many possibilities to make large synthesis datasets available and useful in the cloud. These require specialized knowledge and tooling, and reliable access to cloud platforms.
**Advantages**: easier access to big (high volume) data, can integrate with web apps
**Disadvantages**: less familiar/accessible to many scientists, few best practices to follow, costs can be higher
**Possible file formats**: Parquet files, object storage, distributed/cloud databases
![A few of the cloud-native technologies that might be useful for synthesis research products.](images/cloud_native.png){width="75%" fig-alt="A stylized cloud containing the logos for Parquet, S3, PostgreSQL, and Apache Arrow"}
### Other...
There are many, many other possible harmonized data formats. Here are a few possible examples:
* DarwinCore archives for biodiversity data
* Organismal trait databases
* Archives of cropped, labeled images for training machine or deep learning models
* Libraries of standardized raster imagery in Google Earth Engine
:::
## Data Analysis
Once the team has found sufficient source data, then cleaned, filtered, and harmonized countless datasets, and documented and described everything with quality metadata, it is finally time to analyze the data! Great! Load up R or Python and get started, and then tell us how it goes. We simply don't have enough time to cover all the ins and outs of data analysis in a three-hour course. However, we have put a few helpful resources below to get you started, and many of the best practices we have talked about, or will talk about, apply:
1. **Document your analysis steps and comment your code**, and generally try to make everything reproducible.
2. **Use version control** as you analyze data.
3. **Give everyone a chance!** Analyzing data is challenging, exciting, and a great learning opportunity. Having more eyes on the analysis process also helps catch interesting results or subtle errors.
## Synthesis Group Case Studies
**Estimated time: 10 min**
To make some of these concepts more tangible, let's consider some case studies. The following tabs contain GitHub repositories for real teams that have engaged in synthesis research and chosen to preserve and maintain their scripts in GitHub. Each has different strengths and you may find that facets of each feel most appropriate for your group to adopt. There is no single "right" way of tackling this but hopefully parts of these exemplars inspire you.
:::{.panel-tabset}
### Example 1
LTER SPARC Group: [Soil Phosphorus Control of Carbon and Nitrogen](https://lternet.edu/working-groups/soil-p-control-of-c-and-n/)
Stored their code here: {{< fa brands github >}} [lter / **lter-sparc-soil-p**](https://github.com/lter/lter-sparc-soil-p)
**Highlights**
- Straightforward & transparent numbering of workflow scripts
- File names also reasonably informative even without numbering
- Simple README in each folder written in human-readable language
- Custom `.gitignore` safety net
- Controls which files are "ignored" by Git (prevents accidentally sharing data/private information)
### Example 2
LTER Full Synthesis Working Group: [The Flux Gradient Project](https://lternet.edu/working-groups/the-flux-gradient-project/)
Stored their code here: {{< fa brands github >}} [lter / **lterwg-flux-gradient**](https://github.com/lter/lterwg-flux-gradient)
**Highlights**
- _Extremely_ consistent file naming conventions
- Strong use of sub-folders for within-project organization
- Top-level README includes robust description of naming convention, folder structure, and order of scripts in workflow
- Active contribution to code base by nearly all group members
- Facilitated by strong internal documentation and consenus-building prior to choosing this structure
### Example 3
LTER Full Synthesis Working Group: [From Poles to Tropics: A Multi-Biome Synthesis Investigating the Controls on River Si Exports](https://lternet.edu/working-groups/from-poles-to-tropics-a-multi-biome-synthesis-investigating-the-controls-on-river-si-exports/)
Stored their code here: {{< fa brands github >}} [lter / **lterwg-silica-spatial**](https://github.com/lter/lterwg-silica-spatial)
**Highlights**
- Files performing similar functions share a prefix in their file name
- Use of GitHub "Release" feature to get a persistent DOI for their codebase
- Separate repositories for each manuscript
- Nice use of README as pseudo-bookmarks for later reference to other repositories
:::
For more information about LTER synthesis working groups and how you can get involved in one, click [here](https://lternet.edu/synthesis/).
## Additional Resources
### Data Preparation
**Data cleaning and filtering resources**
* Data cleaning is complicated and varied, and entire books have been written on the subject.[^9][^10] For some general considerations on cleaning data, see EDI's "[Cleaning Data and Quality Control](https://edirepository.org/resources/cleaning-data-and-quality-control)" resource
* [OpenRefine](https://openrefine.org/) is an open-source, cross-platform tool for iterative, scripted data cleaning.
* In the R language, the `tidyverse` libraries (particularly `tidyr` and `dplyr`) are often used for data cleaning, as are additional libraries like [`janitor`](https://sfirke.github.io/janitor/).
* In Python, `pandas` and `numpy` libraries provide useful data cleaning features. There are also some stand-alone cleaning tools like [`pyjanitor`](https://pyjanitor-devs.github.io/pyjanitor/) (started as a re-implementation of the R version) and [`cleanlab`](https://docs.cleanlab.ai/stable/index.html) (geared towards machine learning applications).
* Both the R and Python data science ecosystems have excellent documentation resources that thoroughly cover data cleaning. For R, consider starting with Hadley Wickham's *R for Data Science* book chapter on [data tidying](https://r4ds.hadley.nz/data-tidy),[^11] and for python check Wes McKinney's *Python for Data Analysis* book chapter on [data cleaning and preparation](https://wesmckinney.com/book/data-cleaning).[^12]
[^9]: Osborne, Jason W. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage publications, 2012.
[^10]: Van der Loo, Mark, and Edwin De Jonge. Statistical data cleaning with applications in R. John Wiley & Sons, 2018. https://doi.org/10.1002/9781118897126
[^11]: Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. R for data science. " O'Reilly Media, Inc.", 2023. https://r4ds.hadley.nz/
[^12]: McKinney, Wes. Python for data analysis. " O'Reilly Media, Inc.", 2022. https://wesmckinney.com/book
**Data harmonization resources**
* For R and Python users, there are, again, excellent documentation resources that thoroughly cover data harmonization techniques like data filtering, reformatting, joins, and standardization. In Hadley Wickham's *R for Data Science* book, the chapters on [data transforms](https://r4ds.hadley.nz/data-transform) and [data tidying](https://r4ds.hadley.nz/data-tidy) are a good place to start. In Wes McKinney's *Python for Data Analysis* book, the chapter on [data wrangling](https://wesmckinney.com/book/data-wrangling) is helpful.
* A nice [article in "The Analysis Factor"](https://www.theanalysisfactor.com/wide-and-long-data/) describes wide vs long data formats and when to choose which (TLDR, it depends on your statistical analysis plan).
### Data Analysis
* Harrer, M. _et al._ [Doing Meta-Analysis with R: A Hands-On Guide](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/). **2023**. _GitHub_
* Once again, for R and Python users, the same two books mentioned above provide excellent beginning guidance on data analysis techniques (exploratory analysis, summary stats, visualization, model fitting, etc). In Wickham's *R for Data Science* book, the chapter on [exploratory data analysis](https://r4ds.hadley.nz/eda) will help. In McKinney's *Python for Data Analysis* book, try the chapters on [plotting and visualization](https://wesmckinney.com/book/plotting-and-visualization) and [the introduction to modeling](https://wesmckinney.com/book/modeling).
### Courses, Workshops, and Tutorials
- [Synthesis Skills for Early Career Researchers](https://lter.github.io/ssecr/) (SSECR) course. **2024**. LTER Network Office
- [Reproducible Approaches to Arctic Research Using R](https://learning.nceas.ucsb.edu/2024-02-arctic/) workshop. **2024**. Arctic Data Center & NCEAS Learning Hub
- [Collaborative Coding with GitHub](https://nceas.github.io/scicomp-workshop-collaborative-coding/) workshop. **2024**. NCEAS Scientific Computing team
- [Coding in the Tidyverse](https://nceas.github.io/scicomp-workshop-tidyverse/) workshop. **2023**. NCEAS Scientific Computing team
- [Shiny Apps for Sharing Science](https://njlyon0.github.io/asm-2022_shiny-workshop/) workshop. **2022**. Lyon, N.J. _et al._
- [Ten Commandments for Good Data Management](https://dynamicecology.wordpress.com/2016/08/22/ten-commandments-for-good-data-management/). **2016**. McGill, B.
### Literature
- Todd-Brown, K.E.O., _et al._ [Reviews and Syntheses: The Promise of Big Diverse Soil Data, Moving Current Practices Towards Future Potential](https://bg.copernicus.org/articles/19/3505/2022/). **2022**. _Biogeosciences_
- Borer, E.T. _et al._ [Some Simple Guidelines for Effective Data Management](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/0012-9623-90.2.205). **2009**. _Ecological Society of America Bulletin_
### Other
- Better commit messages with [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/?utm_source=pocket_mylist)