Putting it all together

Time to put everything you learned in this class into action. In an exploratory data analysis (EDA) you are just looking at (exploring) the data and learning about the data and possible relationships between variables. This is not a formal statistical analysis, you cannot make any claims about groups being statistically different. This is just descriptive. You are allowed and encouraged to hypothesize why you observe certain relationships or data characteristics, just be sure not to draw any conclusions from the data.

Instructions

Using your data set of choice, pose a brief research question that explores the relationship between 2-3 variables.

Use markdown headers to make the following sections

  1. Introduction: A short introduction/description of the data.
    • Specifically mention the 2-3 variables you are going to explore.
    • What is your research question? What are you interested in finding out more about?
  2. Univariate Exploration: Describe each of the variables under consideration.
    • This means calculate some summary statistics (N(%) or mean(sd)) and make a graphic
  3. Bivariate Exploration: Comparison between two variables of interest.
    • Calculate grouped summary statistics as appropriate. This is often the most often forgotten part
    • You can go further and explore more than two variables at a time using paneling, but be sure to explain what you learn from each graph.
  4. Conclusion: What did you find? If you had a prior hypothesis, does the data seem to support it? Remember this is NOT a statistical analysis.

All descriptions (univariate and bivariate) must be done using graphics, summary statistics, and words.

This is a very vague set of instructions for a reason. I want you to explore and choose a pair of variables that you find interesting. Create tables, graphics, grouped summary statistics (mean of the continuous variable across levels of the categorical variable). Whatever you need to do to understand the relationship between these two measures.

Use the grading rubric at the end of this document for guidance as to what you should present, in what order, and level of detail you need to present.

Data

You have a choice here. If you are currently working on some data that you would like to explore, talk with your instructor to get your data set approved. As long as it has more than a few variables in it, and at least 30 observations it should be fine.

If you do not have your own data, you can choose from one of the following data sets, all of which can be downloaded from the Data page of Dr. D’s teaching course website. Here are some viable choices:

  • Email Spam: Characteristics of emails used to predict if the email is spam or not.
  • HIV: Data on adolescent children living with HIV positive parents.
  • Depression: Level of depression (cesd), health care, and demographic characteristics.
  • High School and Beyond: Educational, vocational, and personal development of elementary and high school students.
  • Police Shootings: Characteristics of individuals killed by police in 2015.

Any other data sets require instructor approval. Data sets such as the plants or arm strength data sets are insufficiently complex for this project. Also you can’t use the dsmall, diamonds, or NCbirths data sets because we’ve used them too much already.

Guidelines

  • Render early and often. As often as every time you include a new R code chunk.
  • Spell check your report prior to submission using RStudio.
  • Re-read your report and edit for clarification and removing duplicated information.
  • Remove superfluous code and output (i.e. printing a data set to the screen).
  • This is to be independent work. Papers that are too similar will receive no credit.
  • Look at the grading rubric to help you decide the level of detail required.

Scoring Rubric

review these with others. we’re not really teaching how to interpret.

The criteria below is what you will be graded on.

Criteria Novice Competent Proficient
Data Description: Provide a description of the data set and the variables of interest. There is no description or the description is a copy of the help file. There is a minor description of the data but not enough to understand what is being measured or compared. The data description is clear and concise, it is clear to me what data is being analyzed and where it was obtained.
Univariate Description: Fully describe the distribution of each variable by itself There are no numerical or graphical summaries provided. Only numeric or only graphical summaries were created, but no textual description. The variable was fully described using both numeric and graphical summary methods. This information was summarized below the output in a paragraph form.
Bivariate Comparison: Describe the relationship between the two chosen variables. No comparison was made, or the variables were compared, but inappropriate graphics or summary statistics were created. The variables were compared using appropriate graphical methods and grouped summary statistics were created, but nothing was discussed. The variables were compared using appropriate graphical methods and a short textual explanation of what the summaries showed.
Organization / Grammar: How well does the report read? How well organized is it? Was it checked for grammar and spelling mistakes? Only R code, output is present. There is no discussion of results. Tons of extra R code that is not relevant to the discussion is present. Markdown headers were not used. An attempt was made to discuss the results, but the explanations are not in a report format or there are some large grammar and/or spelling problems. Some R code that is not relevant to the analysis question at hand is being displayed. Markdown headers were used to create sections. The report was spell written in well edited, full English sentences, and spell checked prior to submission. The report flowed well and followed the required order of discussion topics with markdown headers used successfully.