stat480

Project two

Data

The United States Bureau of Justices publishes crime data on various levels of aggregation in their Data Online program. We have picked part of the data (homicide numbers on a state level from 1976 to 2005) for you to investigate closer.

In this project, you will perform a complete organisation, selection and summary of the data set. You will practice the R skills you learned in class.

We have provided a file (homicides.csv) that contains data about number of homicide victims by state and year for the years 1976 to 2005. Additional data on victims includes age (in several categories), race, and gender. A second file (crime.csv) contains data on crime numbers and rates for each state between 1960 and 2005. You are encouraged to draw in other data; the US Bureau of Justice or the US Census Bureau might be good places to look for additional data. If you would like to investigate some other aspect of the data, you can, but you'll need to tidy up the data yourself. This is a good idea if you want to get really good marks!

Deliverables

There are two parts to this project: In the first, you'll work out how and which data you will use. There is much more data available online, so at first you will need to to some research in terms of the direction you might go in your analysis. As a second step discuss questions you aim to answer, and download/incorporate the data necessary to answer them. In the second part, you'll try and answer your questions using the techniques, based on reshaping and graphics, as discussed in class.

Here are some questions to think about:

Deadlines

Grading rubric

Overall grade breakdown:

The grading rubric that I'll use is available as a pdf, and is described in more detail below.

Introduction

The purpose of the introduction is to introduce the data set, provide some context, and guide me as to what to expect from the rest of the report. You may find it easiest to write last, after the rest of the report. It should be about a page in length.

Questions and findings

You should have approximately four or five main questions and associated findings, each which may be broken down further in more specific minor questions. Some of these questions will occur to you immediately upon looking at the data, and some will require considerable considerable exploration before they occur to you. To get to the four questions that you report on, I'd expect you to have had 20 or more questions. A lot of the time you will run into a dead end, or the answer to your question will turn out to be uninteresting or obvious. It is always disappointing not to report on something that you spend time working on, but it does make for a better report. You might want to briefly mention some of the dead ends you went down to demonstrate that you've done more than just the obvious.

A good way to present this material is to have one plot or table on a page, along with an accompanying description of what the plot tells you. Don't forget to use headings to break up the sections. You may need multiple plots and tables for each question.

Like your homeworks, I will assess the questions and findings based on the three criteria of curiosity, scepticism and organisation.

In all real data sets you will need to spend a lot of time cleaning up the data - fixing incorrect values, dealing with missing values etc. Don't forget to give a brief description of what you did - that could count as one of your 4-5 questions/findings.

Conclusion

The conclusion should summarise your findings. Rather than just repeating what you've already said, try and weave your findings together into a consistent story. You should also reflect a little on other questions that the exploration raised, and what you would do next. Do you need to collect more data? Or collect data in a different way?

Presentation

I'll also mark the general presentation of the project. This is divided into three parts: text, tables and graphics. Tables and graphs should follow the guidelines we have discussed in class. If you're struggling with the writing, the writing and media help centre can help. You can also read over your past assignments to find the things that we do and don't like.

Reproducibility

Last, but not least, your report should include an (electronic) appendix which allows the reader to reproduce your findings. For this project, this would be the csv files containing the (additional) data and a commented(!) R script to reproduce your findings.

Some good examples from 2007

To give you some idea of what a great report looks like, here are some examples from previously classes.