Data Collection
Python was used to scrape the data off Redfin from listings in the cities of Bolingbrook and Naperville. The requests-
html library was used specifically for this website. Since Redfin is a modern website and uses JavaScript to render its
pages, a more powerful web scraper library had to be used to parse the data. This required to render the website first,
and then extract the complete HTML code. First, an array of links was established that held all the pages of listings
on the website. The URLs to these pages were formatted with a base URL and a page number, so individual listings
had to be found through each page in the array. Using the web scraper, the html data was parsed to identify each link
to the properties displayed on the page, which were again collected and stored. Given the large volume of links to
render and scrape, the code was constructed to run asynchronously. On each page, data collection was relatively
simple. All Redfin listings share a similar format, so by analyzing the patterns in the HTML and CSS of one Redfin
page, it could be applied to other pages. However, sometimes homeowners do not put all data on their listings, so
some data points are empty. Additional steps required to process this data will be discussed in the next section. The
attributes that were scraped consist of: square footage, property type, year built, price, number of car spaces, address,
high school, beds, baths (half and full), heating, cooling, number of carpet rooms, number of hardwood rooms,
basement, basement square footage, basement description, and tax annual amount. Entries that did not have the data
point given were simply put in as null. These data entries are stored in CSV files categorized by city and name.
Data Preprocessing
The CSV data had to be cleaned for further analysis. The variables collected were square footage, property type, year
built, price, number of car spaces, address, high school, beds, baths (half and full), heating, cooling, number of carpet
rooms, number of hardwood rooms, basement, basement square footage, basement description, and tax annual amount.
The variable square footage is very important for a property. Data transformation was done by only removing the
dollar sign and turning it into an integer. For property type, there were three categories: townhouse, condo/co-op, and
single residential. To make analysis easier, additional categories were made for categorical variables like home type.
Each column had a 0 or 1 indicating whether a property was a certain type. For example, a condo would have a 1 in
the condo column and a zero in both the single residential and townhouse columns. Another two important metrics
were bedrooms and bathrooms. Bathrooms specifically were separated into two categories: half and full. Heating,
cooling, and basement description were all treated as categorical variables, where the operation that was applied to
the property type variable was also applied to these categories as well. However, these categories are not binomial
categories but rather multinomial features, thus requiring additional processing.
Using regex, a library that specializes in parsing text, separation of broader categories was done. For example,
heating had a variety of descriptions and combinations with common terms, like “natural gas.” Therefore, three main
categories were made up instead: natural gas, baseboard, and an “other” category. For cooling, the columns were
zoned, central air, and “other” once again. For the basement variable, there were a lot more, but the main ones once
again were none, full, partial, English, and walk-out. Number of carpet and number of hardwood rooms were also
useful in this analysis. They were combined to make a number of rooms columns to avoid bias, as a zero-carpet house
could still hold high value if it had more hardwood floors instead. With these transformed variables, some outliers
could be removed to improve the data. For example, properties with a value of $2,500,000 or greater were removed.
Properties were also limited to a ten thousand square foot limit. Finally, any homes without a listing price were
removed, as it would interfere in the modeling process. The table below depicts the averages of the housing attributes
as organized by year and variable.
Data Analysis
From table 1, it can be seen that over the course of the past four years, most of the variables remain the same in
averages. The one exception to this rule is the property price, proving a hot market trend for the real estate market in
Naperville. The largest jump in price occurred in the past year (2021-2022). This drastic increase in price variation
also resulted in slightly more volatility as described below. Using a correlation function, the most important features
were analyzed. Tax annual amount and square footage appeared as two strong variables in price prediction. Single
family residential homes supported a higher house price and condos supported a strong lower correlation. Bathrooms