Go Back

Project Overview

This project examines NYC ZIP code-level population changes using USPS Change of Address (COA) data. Raw data are imported from HERE. Since COA may not accurately reflect NYC’s counties, boroughs, and neighborhoods, a supplementary Zip Codes (ZIP) dataset is imported from HERE. I cleaned, merged, and performed exploratory analysis to identify population shifts, investigate data quality, and reveal demographic trends while addressing dataset limitations. The raw COA dataset contains 5 variables and 11845 observations, while the raw Zip Codes dataset includes 7 variables and 324 observations. For this project, I utilized R version 4.3.1, and R packages tidyr, readxl, dplyr, janitor, lubridate, and ggplot2.

Data Wrangling

Data preparation involved importing files and combining sheets in COA. Column names were cleaned, and new variables (year and net_change in COA, borough in ZIP) were created. Various joining methods were compared for data consistency. A left_join merged COA and ZIP, warning many-to-many relationships. Checks were conducted for duplicates in COA and ZIP. borough names were updated, and duplicates for specific zip codes were addressed.

The final tidy dataset has 10 columns and 11845 rows.

Variables and examples:

  • year (numeric): 2018, 2019, 2020, 2021, 2022.
  • month (POSIXct, POSIXt): 2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01, 2018-05-01, 2018-06-01.
  • city (character): NEW YORK, STATEN ISLAND, BRONX, GLEN OAKS, FLORAL PARK, LONG ISLAND CITY. There are 78 unique entries.
  • borough (character): Manhattan, Staten Island, Bronx, Queens, Brooklyn. There are 5 unique entries.
  • county_name (character): New York, Richmond, Bronx, Queens, Kings. There are 5 unique entries.
  • neighborhood (character): Chelsea and Clinton, Lower East Side, Lower Manhattan, NA, Gramercy Park and Murray Hill, Greenwich Village and Soho. There are 43 unique entries. Additionally, there are 1288 missing entries in this variable.
  • total_perm_out (numeric): range 0 to 1772, mean = 267.6287041.
  • total_perm_in (numeric): range 0 to 1187, mean = 215.7765302.
  • net_change (numeric): range -983 to 744, mean = -51.8521739.
  • zip_code (character): 10001, 10002, 10003, 10004, 10005, 10006. There are 237 unique entries.

Data Quality Assessment

Table Comparing ‘city’ to ‘borough’

borough BRONX BROAD CHANNEL BROOKLYN BROOKLYN HEIGHTS FAR ROCKAWAY ROCKAWAY BEACH BOWLING GREEN CANAL STREET NEW YORK NYC ROOSEVELT ISL ROOSEVELT ISLAND SECHEDATY WALL STREET ARVERNE ASTORIA AUBURNDALE BAYSIDE BAYSIDE HILLS BEECHHURST BELLE HARBOR BELLEROSE BELLEROSE MANOR BREEZY POINT BRIARWOOD CALVERTON CAMBRIA HEIGHTS CAMBRIA HTS COLLEGE POINT CORONA DOUGLASTON EAST ELMHURST ELMHURST FLORAL PARK FLUSHING FOREST HILLS FRESH MEADOWS GLEN OAKS GLENDALE HOLLIS HOWARD BEACH JACKSON HEIGHTS JACKSON HTS JAMAICA KEW GARDENS KEW GARDENS HILLS LAURELTON LITTLE NECK LONG IS CITY LONG ISLAND CITY MASPETH MIDDLE VILLAGE MIDDLE VLG NEPONSIT NEW YORK CITY OAKLAND GARDENS OZONE PARK QUEENS VILLAGE QUEENS VLG REGO PARK RICHMOND HILL RIDGEWOOD ROCKAWAY PARK ROCKAWAY POINT ROSEDALE S OZONE PARK S RICHMOND HL SAINT ALBANS SOUTH OZONE PARK SOUTH RICHMOND HILL SPRINGFIELD GARDENS SPRNGFLD GDNS ST ALBANS SUNNYSIDE WHITESTONE WOODHAVEN WOODSIDE STATEN ISLAND
Bronx 1500 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Brooklyn NA 5 2369 1 6 49 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Manhattan NA NA NA NA NA NA 1 4 3477 1 4 4 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Queens NA NA NA NA 96 NA NA NA NA NA NA NA NA NA 56 230 2 135 6 4 11 60 2 47 16 1 58 1 60 60 39 117 63 68 309 59 107 52 11 59 60 56 1 372 62 4 28 79 19 120 58 58 1 4 1 38 116 165 8 60 54 70 40 13 59 8 9 57 49 47 30 2 1 51 55 59 59 NA
Staten Island NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 720

The table above (5 observations) reveals data quality issues: ‘New York’ mismatched with both ‘Manhattan’ and ‘Brooklyn,’ indicating inconsistent city-to-borough relationships. ‘ROOSEVELT ISLAND’ appearing as ‘ROOSEVELT ISL’ in the manhattan_table suggests data transformation problems. Variations like ‘QUEENS VILLAGE’ and ‘QUEENS VILLAGE, QUEENS’ point to city name inconsistencies. ‘BOWLING GREEN’ and ‘CANAL STREET’ in the ‘city’ variable may signal data entry errors.

top 5 most common cities in Manhattan borough

city n
NEW YORK 3477
CANAL STREET 4
ROOSEVELT ISL 4
ROOSEVELT ISLAND 4
BOWLING GREEN 1

top 5 most common cities in Queens borough

city n
JAMAICA 372
FLUSHING 309
ASTORIA 230
QUEENS VILLAGE 165
BAYSIDE 135

Tables above reaffirmed data quality issues mentioned earlier.

There are 58 ZIP codes with less than 60 observations, and 1288 observations with missing neighborhood data, indicating potential data gaps due to factors like sparse data collection, inconsistent reporting, demographic variations, and data source limitations.

EDA and Visualization

Average Net Change Table, by year and borough

borough 2018 2019 2020 2021 2022
Bronx -46.303333 -48.01667 -72.65333 -66.10000 -53.19000
Brooklyn -46.184265 -51.68323 -110.67206 -76.83811 -55.37759
Manhattan -41.967422 -52.78477 -126.43461 -38.97550 -46.58806
Queens -26.640479 -29.29128 -48.28437 -45.37178 -30.77854
Staten Island -9.846154 -9.12500 -10.54483 -22.54861 -16.29861

The table reveals trends in average net change by borough and year. Annual fluctuations, particularly significant decreases in 2020, are observed. Brooklyn consistently exhibits the highest negative average net change, contrasting with Staten Island’s stability. The Bronx and Queens display varying average net change, reflecting population fluctuations, possibly influenced by the COVID-19 pandemic in 2020. These trends may stem from demographic shifts and housing patterns.

Five lowest values of net change across all observed data

zip_code neighborhood year month net_change
10022 Gramercy Park and Murray Hill 2020 5 -983
10009 Lower East Side 2020 7 -919
10016 Gramercy Park and Murray Hill 2020 6 -907
10016 Gramercy Park and Murray Hill 2020 7 -855
10009 Lower East Side 2020 6 -804

Five highest values of net change before 2020

zip_code neighborhood year month net_change
11101 Northwest Queens 2018 4 360
11101 Northwest Queens 2018 6 344
11101 Northwest Queens 2018 5 300
10001 Chelsea and Clinton 2018 7 225
11201 Northwest Brooklyn 2018 4 217

Trend plot of neighborhood-level average net change by month

Neighborhood-Level average net change is mostly below zero across all boroughs, over 5 years. The mean net change is -51.8521739. Overall, Manhattan has the lowest average net change, Staten Island has the highest. Average net change is stable in 2018, 2019, and 2022. However, a substantial drop occurred between 2020 and 2021, especially in Manhattan.

Limitations

The dataset has many limitations, including data quality issues, borough mismatches, many-to-many relationships, limited time range, local influences, and potential seasonal patterns. It lacks demographic and socioeconomic data. Thorough preprocessing and data integration are needed for a holistic view of population changes.

Method koRpus stringi
Word count 556 547
Character count 3989 3989
Sentence count 62 Not available
Reading time 2.8 minutes 2.7 minutes