1 Predicting housing prices and analyzing real estate markets in the Chicago suburbs using machine learning

2025-04-27 1 0 313.85KB 7 页 10玖币

侵权投诉

Predicting housing prices and analyzing real estate markets

in the Chicago suburbs using machine learning

Kevin Xu1, Hieu Nguyen2

1Neuqua Valley High School, Naperville, Illinois

2Mentor, University of Connecticut, Storrs, CT

ABSTRACT

The pricing of housing properties is determined by a variety of factors. However, post-pandemic markets have

experienced volatility in the Chicago suburb area, which have affected house prices greatly. In this study, analysis

was done on the Naperville/Bolingbrook real estate market to predict property prices based on these housing attributes

through machine learning models, and to evaluate the effectiveness of such models in a volatile market space.

Gathering data from Redfin, a real estate website, sales data from 2018 up until the summer season of 2022 were

collected for research. By analyzing these sales in this range of time, we can also look at the state of the housing

market and identify trends in price. For modeling the data, the models used were linear regression, support vector

regression, decision tree regression, random forest regression, and XGBoost regression. To analyze results,

comparison was made on the MAE, RMSE, and R-squared values for each model. It was found that the XGBoost

model performs the best in predicting house prices despite the additional volatility sponsored by post-pandemic

conditions. After modeling, Shapley Values (SHAP) were used to evaluate the weights of the variables in constructing

models. The code and data files can be found at https://github.com/ GeometricBison/HousePriceML.

Introduction

In real estate markets, appraisals of home value are essential for conducting business. To estimate the price of a

property, many details about the property must be considered. Attributes like square footage or number of rooms can

easily sway the price of a house up or down. The goal of this project is to construct a machine learning model that can

estimate the price of a property in suburban Chicago through evaluating its attributes. For example, in most cases, an

increase in square footage would certainly increase the price of a property. However, diving deeper, not all variables

are created equal. The model would have to consider the different weights of the variables and consider whether one

variable would be more influential to the price than another. Another goal of this project is to analyze the real estate

market by using models to identify trends. To accomplish this, models were constructed for different years from 2018

up until July of 2022. For example, in a volatile housing market year like 2022, prices have shot up tremendously as

demand for homes becomes higher and higher. However, comparison to more stable markets before the pandemic can

reveal insights on market volatility and whether prices are high or low. By comparing models, trends can be analyzed

over the years for the real estate market.

In order to conduct this project, three major steps were involved in the process. Data had to be extracted from

Redfin.com, a realtor database website, the data then had to be transformed and analyzed, and finally a model would

have to be applied to the newly formed data to estimate the price. A script was used to load pages from Redfin, parse

through its contents to find relative data, and to write the contents to an excel file which held all the data in an organized

manner. This was all possible due to the requests-html library, a Python package that specializes in rendering

JavaScript based websites for data. Data was then categorized by its sell date and city location. Analysis was done on

the data, which included how important the attribute was for the housing price and transforming categorical data points

into a numerical format which could be interpreted by the model.

Currently, machine learning has proved to itself to be very useful in this market sphere, as shown by

companies employing these techniques like Zillow’s Zestimate and Redfin’s home appraisal. In this study, analysis

was done with models like support vector regression, XGBoost models, and random forest regression. Comparisons

between model effectiveness can be detailed in the results section. The organization of the paper is as follows: section

2 will go over data collection, section 3 will discuss data preprocessing, section 4 will describe data analysis, section

5 will cover modeling, section 6 will detail the results of the study, and section 7 will describe variable importance.

Data Collection

Python was used to scrape the data off Redfin from listings in the cities of Bolingbrook and Naperville. The requests-

html library was used specifically for this website. Since Redfin is a modern website and uses JavaScript to render its

pages, a more powerful web scraper library had to be used to parse the data. This required to render the website first,

and then extract the complete HTML code. First, an array of links was established that held all the pages of listings

on the website. The URLs to these pages were formatted with a base URL and a page number, so individual listings

had to be found through each page in the array. Using the web scraper, the html data was parsed to identify each link

to the properties displayed on the page, which were again collected and stored. Given the large volume of links to

render and scrape, the code was constructed to run asynchronously. On each page, data collection was relatively

simple. All Redfin listings share a similar format, so by analyzing the patterns in the HTML and CSS of one Redfin

page, it could be applied to other pages. However, sometimes homeowners do not put all data on their listings, so

some data points are empty. Additional steps required to process this data will be discussed in the next section. The

attributes that were scraped consist of: square footage, property type, year built, price, number of car spaces, address,

high school, beds, baths (half and full), heating, cooling, number of carpet rooms, number of hardwood rooms,

basement, basement square footage, basement description, and tax annual amount. Entries that did not have the data

point given were simply put in as null. These data entries are stored in CSV files categorized by city and name.

Data Preprocessing

The CSV data had to be cleaned for further analysis. The variables collected were square footage, property type, year

built, price, number of car spaces, address, high school, beds, baths (half and full), heating, cooling, number of carpet

rooms, number of hardwood rooms, basement, basement square footage, basement description, and tax annual amount.

The variable square footage is very important for a property. Data transformation was done by only removing the

dollar sign and turning it into an integer. For property type, there were three categories: townhouse, condo/co-op, and

single residential. To make analysis easier, additional categories were made for categorical variables like home type.

Each column had a 0 or 1 indicating whether a property was a certain type. For example, a condo would have a 1 in

the condo column and a zero in both the single residential and townhouse columns. Another two important metrics

were bedrooms and bathrooms. Bathrooms specifically were separated into two categories: half and full. Heating,

cooling, and basement description were all treated as categorical variables, where the operation that was applied to

the property type variable was also applied to these categories as well. However, these categories are not binomial

categories but rather multinomial features, thus requiring additional processing.

Using regex, a library that specializes in parsing text, separation of broader categories was done. For example,

heating had a variety of descriptions and combinations with common terms, like “natural gas.” Therefore, three main

categories were made up instead: natural gas, baseboard, and an “other” category. For cooling, the columns were

zoned, central air, and “other” once again. For the basement variable, there were a lot more, but the main ones once

again were none, full, partial, English, and walk-out. Number of carpet and number of hardwood rooms were also

useful in this analysis. They were combined to make a number of rooms columns to avoid bias, as a zero-carpet house

could still hold high value if it had more hardwood floors instead. With these transformed variables, some outliers

could be removed to improve the data. For example, properties with a value of $2,500,000 or greater were removed.

Properties were also limited to a ten thousand square foot limit. Finally, any homes without a listing price were

removed, as it would interfere in the modeling process. The table below depicts the averages of the housing attributes

as organized by year and variable.

Data Analysis

From table 1, it can be seen that over the course of the past four years, most of the variables remain the same in

averages. The one exception to this rule is the property price, proving a hot market trend for the real estate market in

Naperville. The largest jump in price occurred in the past year (2021-2022). This drastic increase in price variation

also resulted in slightly more volatility as described below. Using a correlation function, the most important features

were analyzed. Tax annual amount and square footage appeared as two strong variables in price prediction. Single

family residential homes supported a higher house price and condos supported a strong lower correlation. Bathrooms

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1PredictinghousingpricesandanalyzingrealestatemarketsintheChicagosuburbsusingmachinelearningKevinXu1,HieuNguyen21NeuquaValleyHighSchool,Naperville,Illinois2Mentor,UniversityofConnecticut,Storrs,CTABSTRACTThepricingofhousingpropertiesisdeterminedbyavarietyoffactors.However,post-pandemicmarketshaveexp...

展开>> 收起<<

1 Predicting housing prices and analyzing real estate markets in the Chicago suburbs using machine learning.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Predicting housing prices and analyzing real estate markets in the Chicago suburbs using machine learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: