Go Back

Project Overview

This project utilized data “Behavioral Risk Factors Surveillance System for Selected Metropolitan Area Risk Trends (SMART) for 2002-2010” or the brfss_smart2010 dataset from package p8105.datasets to explore the self-rated overall health among US population, with a particular focus on New York State residents. The project utilized R version 4.3.1 and R packages dplyr, ggplot2, tidyverse, and p8105.datasets.

After cleaning, the BRFSS dataset has 10625 observations and 23 variables. Variables include: year, location_abbr, location_desc, class, topic, question, response, sample_size, data_value, confidence_limit_low, confidence_limit_high, display_order, data_value_unit, data_value_type, data_value_footnote_symbol, data_value_footnote, data_source, class_id, topic_id, location_id, question_id, resp_id, geo_location. The type of data value in this dataset is Crude Prevalence, and the unit of data value is %. The topic of focus is “Overall Health”, question is “How is your general health?”, and responses include “Excellent, Very good, Good, Fair, Poor”.

Visualizations and EDA

In 2002, MA, NJ, PA states were observed at 7 or more locations. In 2010, CA, FL, MA, MD, NC, NE, NJ, NY, OH, TX, WA states were observed at 7 or more locations. From 2002 to 2010, there is a significant increase in the number of states that have 7 or more observation locations.

Spaghetti Plot of State Average Prevalence over Time, limited to responses of “Excellent”

The average crude prevalence across locations within a state fluctuates across years, but overall remains fairly stable. There is no distinguishable trend in the average data values across years.

Among all data available, UT in 2002 has the highest average crude prevalence, WV in 2005 has the lowest average crude prevalence.

Plot of prevalence distribution for responses “Poor” to “Excellent” among locations in NY State, 2006 and 2010

Based on the plots, we can see that Poor responses had the lowest crude prevalence across all locations in NY State for both 2006 and 2010, and as the order of responses increases towards Very good, the crude prevalence also increases. However, for responses of Excellent, we see a decreased crude prevalence distribution compared to the previous Good and Very good.

We can see that in 2006, responses Good and Very good have comparable crude prevalence distributions, although the median for Very good is slightly higher. When it comes to 2010, the crude prevalence for responses Very good is apparently higher than that of Good. Responses Fair and Poor in 2010 have wider spread compared to that of 2006.