Project

Time to put everything you learned in this class into action. In an exploratory data analysis (EDA) you are just looking at (exploring) the data and learning about the data and possible relationships between variables. This is not a formal statistical analysis, you cannot make any claims about groups being statistically different. This is just descriptive. You are allowed and encouraged to hypothesize why you observe certain relationships or data characteristics, just be sure not to draw any conclusions from the data.

See prior examples

This is not a research class, and so we suggest that you take a look at some projects from prior students to get an idea of what is expected and what a final project should look like.

Instructions

Using your data set of choice, pose a brief research question that explores the relationship between 2-3 variables.

Use markdown headers to make the following sections

Introduction: A short introduction/description of the data.
- Specifically mention the 2-3 variables you are going to explore.
- What is your research question? What are you interested in finding out more about?
Univariate Exploration: Describe each of the variables under consideration.
- This means calculate summary statistics appropriate for your data type (N(%) or mean(sd)) and make a graphic
- Describe what you see in the graphic, and what the numbers say in a sentence or two.
Bivariate Exploration: Comparison between two variables of interest.
- Calculate grouped summary statistics as appropriate. This is often the most often forgotten part
- Create an appropriate bivariate visualization
- You can go further and explore more than two variables at a time using paneling, but be sure to explain what you learn from each graph.
- Describe the patterns you are noticing, with summary numbers, in a sentence or two.
Conclusion: What did you find? If you had a prior hypothesis, does the data seem to support it? Remember this is NOT a statistical analysis.

This is a very vague set of instructions for a reason. I want you to explore and choose a pair of variables that you find interesting. Create tables, graphics, grouped summary statistics (mean of the continuous variable across levels of the categorical variable). Whatever you need to do to understand the relationship between these two measures.

Use the grading rubric at the end of this document for guidance as to what you should present, in what order, and level of detail you need to present.

Data

You have a choice here. If you are currently working on some data that you would like to explore, talk with your instructor to get your data set approved. As long as it has more than 4 variables in it, and at least 30 observations it should be fine.

If you do not have your own data, you can choose from one of the following data sets, all of which can be downloaded from the Data page of Dr. D’s teaching course website.

Note that you can’t use the following data sets because we’ve used them too much already: diamonds, NCbirths, email, or penguins.

Guidelines

Render early and often. As often as every time you include a new R code chunk.
Spell check your report prior to submission using RStudio.
Re-read your report and edit for clarification and removing duplicated information.
Remove superfluous code and output (i.e. printing a data set to the screen).
This is to be independent work. Papers that are too similar will receive no credit.
Look at the grading rubric to help you decide the level of detail required.

Peer Review

After the submission deadline, your analysis projects will be randomly assigned to two other people to peer review and score. This means you will also score and provide feedback on 2 reports. Your instructor will also score all projects for your class section.

PR Instructions for Spring 26

How to conduct Peer Reviews

On the project due date, Canvas will automatically assign peer reviews to everyone that has submitted the project on time.

See this help page on how to find and complete peer reviews in Canvas.

Use the rubric to assign a score in each area
Using the commenting feature provide 4 comments for each project.
- Two positive: What specific features did they include that you liked or found helpful?
- Two improvements: What can they do different or better next time? Did you find a bug in their code?

Late submissions

If you submit your EDA late, I will have to manually assign you projects to review. This may take a day or two, please feel free to ask and remind me if you are waiting too long.

Grading

The criteria below is what you will be graded on. Below each criteria is an example of the points awarded for the level of competency. Use this criteria when you score your peers reports.

1. Data Description

Provide a description of the data set and the variables of interest.

(Novice) There is no description or the description is a copy of the help file.
(Competent) There is a minor description of the data but not enough to understand what is being measured or compared.
(Proficient) The data description is clear and concise, it is clear to me what data is being analyzed and where it was obtained.

2. Univariate Description

Fully describe the distribution of each variable by itself

(Novice) There are no numerical or graphical summaries provided.
(Competent) Only numeric or only graphical summaries were created, but no textual description.
(Proficient) The variable was fully described using both numeric and graphical summary methods.

3. Bivariate Comparison

Describe the relationship between the two chosen variables.

(Novice) No comparison was made, or the variables were compared, but inappropriate graphics or summary statistics were created.
(Competent) The variables were compared using appropriate graphical methods and grouped summary statistics were created, but nothing was discussed.
(Proficient) The variables were compared using appropriate graphical methods and a short textual explanation of what the summaries showed.

4. Organization / Grammar

How well does the report read? How well organized is it? Was it checked for grammar and spelling mistakes?

(Novice) Only R code, output is present. There is no discussion of results. Tons of extra R code that is not relevant to the discussion is present. Markdown headers were not used.
(Competent) An attempt was made to discuss the results, but the explanations are not in a report format or there are some large grammar and/or spelling problems. Some R code that is not relevant to the analysis question at hand is being displayed. Markdown headers were used to create sections.
(Proficient) The report was spell written in well edited, full English sentences, and spell checked prior to submission. The report flowed well and followed the required order of discussion topics with markdown headers used successfully.