Evidence for hierarchical representations of written and spoken words from an open-science human neuroimaging dataset
Code, results, and manuscript drafts for Banerjee et al
Tables and text files
dutch_celex_database_updatedv2.csv
Contains phonetic pronunciations of Dutch words in the CELEX database. For more information, see Sun & Poeppel 2023.
SUBTLEX-NL with pos and Zipf.xlsx
Contains word frequency measures of Dutch words in the SUBTLEX database, the most up-to-date version of the CELEX database. Zipf
contains log-transformed values of FREQCOUNT
, the number of word occurrences in the corpus. For more information, visit OSF.
MOUS_audio_onset_offsets.xlsx
Onset times of words in each audio file play in the speech listening part of the experiment.
subtlex_phonetics.xlsx
The intersection of the CELEX database and SUBTLEX databases, contains phonetics and occurrence counts of most words in Dutch.
MOUS_word_syllable_frequencies
contains the occurrence counts of each syllable in every word presented in the MOUS experiment.
stimuli.txt
The sentences and word lists used in both the reading and speech listening experiments of the MOUS study.
bigram_counts.csv
Cumulative bigram occurrences (per million) in the SUBTLEX text corpus.
syllable_counts.csv
Cumulative syllable occurrences (per million) in the SUBTLEX text corpus.
Code
master_table.ipynb
Generates bigram, syllable, and word frequency statistics for every word presented in the MOUS experiments.
Auditory
source_auditory_trancription.py
Takes in an auditory subject's events.tsv
file and an output filename and tabulates the onset times and words played during that subject's scan. Generates transcription files that are saved in each subject's source subdirectory, e.g. sub-A2002_transcription.csv
.
source_auditory_transcription_loop.ipynb
Runs the above over all auditory subjects.
calculate_syllable_frequencies.m
Takes in a 'transcription' generated by the above script and returns the syllable frequencies. Creates two .csv files, e.g., sub-A2002_transcription_syllables_raw.csv
, which contains all onset times (including words for which syllable segmentations couldn't be sourced) and sub-A2002_transcription_syllables_processed
, which only preserves the onset times and frequencies for words with available syllable segmentations.
SPM_auditory_word_frequency_1st_level.m
Runs SPM12 first-level analysis for Word Frequency across all auditory subjects. For a primer on this technique, see Andy's Brain Book
SPM_auditory_word_frequency_2nd_level.m
Runs SPM12 group-level analysis for word frequency.
SPM_auditory_syllable_frequency_1st_level.m
Runs SPM12 first-level analysis for Syllable Frequency across all auditory subjects.
SPM_auditory_syllable_frequency_2nd_lvel.m
Runs group-level analysis for syllable frequency.
Visual
source_visual_transcription.m
converts an events.tsv file to a cleaned CSV containing onset time and word presented.
source_visual_transcription_loop.ipynb
Runs the above function in a loop over all visual subjects.
calculate_word_frequencies_visual.ipynb
generates CSV files containing both word frequency and minimum bigram frequency info for all words in the study.
SPM_visual_word_frequency_1st_level.m
Runs SPM12 first-level analysis for Word Frequency across all visual subjects.
SPM_visual_word_frequency_2nd_level.m
Runs SPM12 group-level analysis for Word Frequency.
SPM_visual_bigram_frequency_1st_level.m
Runs SPM12 first-level analysis for Bigam Frequency across all visual subjects.
SPM_visual_bigram_frequency_2nd_level.m
Runs SPM12 group-level analysis for bigram frequency.