Assignment 1

Due by 11:30am on Monday, September 16, 2024

Assignment

Introduction: Data Cleaning & Management

The following are important steps to take for any code you might run. You want to make sure things look the way they ought to look, and that you (and whoever is reading your code) knows what is happening.

Here’s your to-do list:

Import your data. This needs to eventually be a line of code at the top of your script.
In 1-2 sentences, describe your dataset. No need for a ton of detail, just generally, what’s in it.
Using code, find the object class of the first 5 column vectors of your data set. Then find the object class of the entire data set. Describe if these make sense, or are there any that were unexpected? If they were unexpected, why?
Check out your variables that are text strings or characters. Do you want them to remain this way, or do you want to convert them into categorical variables, called factors? If they should be treated as categorical variables, please do so using the factor() function.

Here’s a HINT for the last item:

new_column <- factor(data$column) makes a completely new object outside of the dataset.
data$new_column <- factor(data$column) makes a new column that lives within the existing dataset.
data$column <- factor(data$column) overwrites the original column.

Results: Summary Statistics – Part 1

One of the first items in the “Results” section of an academic paper is to look at summary statistics of the variables of interest. Here’s your shot!

Here’s your to-do list:

For 2 of your numeric/integer variables, get the mean, standard deviation, and range. If you don’t know the function names, Google it!
For your categorical variable, get very quick counts of each level (e.g., how many observations are there in group 1 vs. group 2).

Here are some HINTS:

In your functions, pay attention to missing data!
The easiest function to use for your second to-do item was shown in the slides under “commonly used functions”.

Results: Summary Statistics – Part 2

Another thing we check in the Results section of a paper is whether our variables of interest might differ based on something we don’t want. For example, if all of our male participants were notably older than all of our female participants, it might screw up the interpretation of our results.

Here’s your to-do list:

Make a smaller datasets that only contain items from each level of your categorical variable. For example, let’s say the factor “treatment_group” had 2 levels: treatment and control. Make a new data.frame that is something like data_treatment and another new data.frame that is data_controls. Do this via INDEXING and LOGICAL OPERATORS. See hints below for an example what is meant by a smaller dataset.
One thing you might notice is that when you make your new mini data.frames, if you look at the factor variable, it still has the original number of levels from the bigger data.frame – even though the other levels are not in there. Overwrite that variable to be a factor again (same thing as in the Data Cleaning section above). After you should have only 1 level (ex: all treatment or all controls).
Now repeat the same summary statistics from above (for the numeric/integers only) but on each of your new mini data.frames. Do these numbers seem dramatically different or are they similar? What does that mean for your analysis?

Here is a HINT:

Logical operators result in TRUE or FALSE. You can use this within indexing so that anything that evaluates to TRUE will be either the rows or the columns that are kept.
If this is your original dataset…

Then the smaller datasets might look like these:

Congrats, you’ve finished!

Submit both your code and your data file to Canvas.