Final Assignment
Data
For this assignment you will be using the Human Connectome Project. It comes in 3 different files and has a corresponding data dictionary (where it explains what each variable means)
- Download hcp-demos.csv
- Download hcp-personality.csv
- Download hcp-cognitive.csv
- Download hcp-data-dictionary.csv
About the HCP Dataset
The Human Connectome Project (HCP) was a landmark study for cognitive neuroscience, headed by Wash U and the University of Minnesota. It is a large-scale multi-center project that enabled the study of individual differences in neural functioning that subserve higher-order cognitive functions (Van Essen et al., 2013). The HCP collected a ton of data on 1200 participants, including genetic, physiological, self-report, behavioral, and neuroimaging (structural MRI, resting state functional MRI, and task activation functional MRI). This makes it among the largest and richest publicly available cognitive neuroscience datasets in existence.
For the purposes of this class and this final project, I am providing you with an extremely pared down version of this dataset. You should use the data dictionary for more information on each column.
You can find the grading rubric on Canvas. This assignment is worth 25 points total
- Each of the 4 sections are worth 6.25 points each in total
- 4 of the points will be for accuracy (including code and interpretations)
- 2.25 points will be for pretty code (is it easy to read, do your naming conventions make sense, do you go to new lines when appropriate, is it easy to reproduce your code etc.).
- You MUST use a RNotebook/RMarkdown for this assignment. You will submit 2 files:
- The knitted .html file (do not turn in a
nb.html
file) - Your working .Rmd file
- The knitted .html file (do not turn in a
- You should write your code following the principles of open science and reproducibility. That is, you need to write your code in a way that other people can understand and reproduce your code. Think carefully about what is needed for someone else to reproduce your code!
I strongly suggest that you read through the entire assignment before starting. Good luck!
Part 1 – Data Wrangling
- Load in whatever packages you’ll be using for the rest of the assignment (I do not want to see loading packages anywhere else in your final document except Part 1)
- Import the 3 datasets
- You must tidy them up so that they are merged into a single data.frame that looks like the image below. It must look exactly like this! No extra columns, no weird names etc. The column names must match what you see in the image. The format of your columns must also match (especially
Subject
andAge
columns). Please see below for hints on how to do this. - Call your data.frame
hcp
Here’s what it needs to look like:
Hints for Part 1
Demographic Dataset
- This one contains socioeconomic status values. You are not concerned with those values. You only want
Subject
,Gender
, andAge
columns. - At some point, you might need to use the following line in your code. You should copy/paste this directly. Where you put it is up to you. It replaces dots with dashes.
mutate_at(vars("Age"), ~str_replace(string = ., pattern = "\\.", replacement = "-"))
(please note that you don’t have to use this, but depending on the approach you take, you might find it helpful) - One of the age groups should be
36+
. You might need to recode a variable in order to do this.
Cognitive Dataset
- All of the cognitive tasks used a form of counterbalancing. Participants were assigned to be in either group A or group B. Each participant was only assigned to 1 counterbalance group (that is, if subject 1 was assigned to counterbalance group A, they were in that same group for all of the cognitive tasks).
- You don’t care about the counterbalancing and do not want to keep it in your final dataset.
Personality Dataset
- Nothing extra here. Normal tidying.
General Hints
- If you are going to get rid of
NA
values, be extremely mindful of when you do this tidyverse
functions can have suffixes like_at
and_if
- You must merge each of these 3 datasets into a single data.frame. We didn’t go over this explicitly, but part of this class is being able to use Google effectively. My tip is that if you are using one framework (like
tidyverse
vs.base R
), stay in that framework while merging.
Part 2 – Describing Your Data
Pick 5 of the variables from the dataset that you find particularly interesting for the following section. Your job is to explore each of these 5 variables individually (that is, you are not looking at the relationship between any of these variables).
- Make a figure showing if the variable is normally distributed. This should be a single figure with 5 subplots (do not overlay onto the same figure and do not make 5 individual plots; yes, we went over how to do this)
- Write some formatted text describing your results (aka text to accompany the figure, like you would in a manuscript)
- Report at least 1 measure of central tendency and 1 measure of variability. Put these in the figure itself (top or bottom corner).
- To accomplish this goal, you must use
ggplot2
- HINT: There’s a very easy way to do this, but you need to think about the layout/format of your data.
Part 3 – Statistical Modeling
Pick 3 independent variables and 1 dependent variable and run a multiple regression. This is the same as a simple (univariate) regression, but now you have more predictors. Do NOT run any interaction terms. Pick your variables based on a hypothesis you have about these data.
- In formatted text, describe the hypothesis. What are you testing?
- Run the analysis in a code chunk and print out the results
- In formatted text, interpret the findings. Note when you have more than 1
in your regression – excluding the intercept – we add the phrase “holding all other predictors constant”.- Interpret each regression coefficient
- Interpret the omnibus test
- Interpret the model fit
- One of the assumptions of regression is that the fitted values (meaning the predicted values) should NOT correlate at all with the residual values. If you plot them, there should be a very flat horizontal line at 0. That’s because the fitted (aka predicted) values include the exact same residuals in their predictions. However, if you see that there is a correlation between the fitted and residual values, that indicates that your regression has some issues (could be a number of different things, including a missing predictor). Plot the fitted (X-axis) vs. residuals (Y-axis) and describe what you see. Do this using
ggplot2
notbase R
. HINT: you will need to extract both fitted and residuals values. Yes, we went over ways to get these.
Part 4 – Simulating Power
In Psychology and Neuroscience, we aim for 80% statistical power. You are going to simulate the power of the OMNIBUS test of the regression you ran above.
- You’ll need to simulate data for all IVs and DVs. Pick the simulation function based on the original data. Think carefully about the output of
rnorm()
vs.rpois()
vs. any other function you are considering. When simulating the data, you want the simulated variables to have the same properties as the original variables. That is, you should get different individual values, but if you looked at the summary statistics of the simulated data and the original data, they should be very similar. - You will run your regression with your simulated data 5000 times such that you get an output vector of 5000 values. You are NOT simulating anything else from the regression.
- HINT: go back to the peer assignments and look for clues
- In formatted text, describe:
- What is statistical power? Why is it important?
- What was the power of this regression you ran in the section above? Interpret it.
CONGRATULATIONS!
Once this is submitted, you will have officially completed the class. Thank you for a wonderful semester! You’ve all worked really hard, and hopefully you feel like you’ve gained a new, useful skill. You can always email me if you have R
or stats related questions. Keep in touch and best of luck!