1 Predicting housing prices and analyzing real estate markets in the Chicago suburbs using machine learning

2025-04-27 1 0 313.85KB 7 页 10玖币
侵权投诉
1
Predicting housing prices and analyzing real estate markets
in the Chicago suburbs using machine learning
Kevin Xu1, Hieu Nguyen2
1Neuqua Valley High School, Naperville, Illinois
2Mentor, University of Connecticut, Storrs, CT
ABSTRACT
The pricing of housing properties is determined by a variety of factors. However, post-pandemic markets have
experienced volatility in the Chicago suburb area, which have affected house prices greatly. In this study, analysis
was done on the Naperville/Bolingbrook real estate market to predict property prices based on these housing attributes
through machine learning models, and to evaluate the effectiveness of such models in a volatile market space.
Gathering data from Redfin, a real estate website, sales data from 2018 up until the summer season of 2022 were
collected for research. By analyzing these sales in this range of time, we can also look at the state of the housing
market and identify trends in price. For modeling the data, the models used were linear regression, support vector
regression, decision tree regression, random forest regression, and XGBoost regression. To analyze results,
comparison was made on the MAE, RMSE, and R-squared values for each model. It was found that the XGBoost
model performs the best in predicting house prices despite the additional volatility sponsored by post-pandemic
conditions. After modeling, Shapley Values (SHAP) were used to evaluate the weights of the variables in constructing
models. The code and data files can be found at https://github.com/ GeometricBison/HousePriceML.
Introduction
In real estate markets, appraisals of home value are essential for conducting business. To estimate the price of a
property, many details about the property must be considered. Attributes like square footage or number of rooms can
easily sway the price of a house up or down. The goal of this project is to construct a machine learning model that can
estimate the price of a property in suburban Chicago through evaluating its attributes. For example, in most cases, an
increase in square footage would certainly increase the price of a property. However, diving deeper, not all variables
are created equal. The model would have to consider the different weights of the variables and consider whether one
variable would be more influential to the price than another. Another goal of this project is to analyze the real estate
market by using models to identify trends. To accomplish this, models were constructed for different years from 2018
up until July of 2022. For example, in a volatile housing market year like 2022, prices have shot up tremendously as
demand for homes becomes higher and higher. However, comparison to more stable markets before the pandemic can
reveal insights on market volatility and whether prices are high or low. By comparing models, trends can be analyzed
over the years for the real estate market.
In order to conduct this project, three major steps were involved in the process. Data had to be extracted from
Redfin.com, a realtor database website, the data then had to be transformed and analyzed, and finally a model would
have to be applied to the newly formed data to estimate the price. A script was used to load pages from Redfin, parse
through its contents to find relative data, and to write the contents to an excel file which held all the data in an organized
manner. This was all possible due to the requests-html library, a Python package that specializes in rendering
JavaScript based websites for data. Data was then categorized by its sell date and city location. Analysis was done on
the data, which included how important the attribute was for the housing price and transforming categorical data points
into a numerical format which could be interpreted by the model.
Currently, machine learning has proved to itself to be very useful in this market sphere, as shown by
companies employing these techniques like Zillow’s Zestimate and Redfin’s home appraisal. In this study, analysis
was done with models like support vector regression, XGBoost models, and random forest regression. Comparisons
between model effectiveness can be detailed in the results section. The organization of the paper is as follows: section
2 will go over data collection, section 3 will discuss data preprocessing, section 4 will describe data analysis, section
5 will cover modeling, section 6 will detail the results of the study, and section 7 will describe variable importance.
2
Data Collection
Python was used to scrape the data off Redfin from listings in the cities of Bolingbrook and Naperville. The requests-
html library was used specifically for this website. Since Redfin is a modern website and uses JavaScript to render its
pages, a more powerful web scraper library had to be used to parse the data. This required to render the website first,
and then extract the complete HTML code. First, an array of links was established that held all the pages of listings
on the website. The URLs to these pages were formatted with a base URL and a page number, so individual listings
had to be found through each page in the array. Using the web scraper, the html data was parsed to identify each link
to the properties displayed on the page, which were again collected and stored. Given the large volume of links to
render and scrape, the code was constructed to run asynchronously. On each page, data collection was relatively
simple. All Redfin listings share a similar format, so by analyzing the patterns in the HTML and CSS of one Redfin
page, it could be applied to other pages. However, sometimes homeowners do not put all data on their listings, so
some data points are empty. Additional steps required to process this data will be discussed in the next section. The
attributes that were scraped consist of: square footage, property type, year built, price, number of car spaces, address,
high school, beds, baths (half and full), heating, cooling, number of carpet rooms, number of hardwood rooms,
basement, basement square footage, basement description, and tax annual amount. Entries that did not have the data
point given were simply put in as null. These data entries are stored in CSV files categorized by city and name.
Data Preprocessing
The CSV data had to be cleaned for further analysis. The variables collected were square footage, property type, year
built, price, number of car spaces, address, high school, beds, baths (half and full), heating, cooling, number of carpet
rooms, number of hardwood rooms, basement, basement square footage, basement description, and tax annual amount.
The variable square footage is very important for a property. Data transformation was done by only removing the
dollar sign and turning it into an integer. For property type, there were three categories: townhouse, condo/co-op, and
single residential. To make analysis easier, additional categories were made for categorical variables like home type.
Each column had a 0 or 1 indicating whether a property was a certain type. For example, a condo would have a 1 in
the condo column and a zero in both the single residential and townhouse columns. Another two important metrics
were bedrooms and bathrooms. Bathrooms specifically were separated into two categories: half and full. Heating,
cooling, and basement description were all treated as categorical variables, where the operation that was applied to
the property type variable was also applied to these categories as well. However, these categories are not binomial
categories but rather multinomial features, thus requiring additional processing.
Using regex, a library that specializes in parsing text, separation of broader categories was done. For example,
heating had a variety of descriptions and combinations with common terms, like “natural gas.” Therefore, three main
categories were made up instead: natural gas, baseboard, and an “other” category. For cooling, the columns were
zoned, central air, and “other” once again. For the basement variable, there were a lot more, but the main ones once
again were none, full, partial, English, and walk-out. Number of carpet and number of hardwood rooms were also
useful in this analysis. They were combined to make a number of rooms columns to avoid bias, as a zero-carpet house
could still hold high value if it had more hardwood floors instead. With these transformed variables, some outliers
could be removed to improve the data. For example, properties with a value of $2,500,000 or greater were removed.
Properties were also limited to a ten thousand square foot limit. Finally, any homes without a listing price were
removed, as it would interfere in the modeling process. The table below depicts the averages of the housing attributes
as organized by year and variable.
Data Analysis
From table 1, it can be seen that over the course of the past four years, most of the variables remain the same in
averages. The one exception to this rule is the property price, proving a hot market trend for the real estate market in
Naperville. The largest jump in price occurred in the past year (2021-2022). This drastic increase in price variation
also resulted in slightly more volatility as described below. Using a correlation function, the most important features
were analyzed. Tax annual amount and square footage appeared as two strong variables in price prediction. Single
family residential homes supported a higher house price and condos supported a strong lower correlation. Bathrooms
摘要:

1PredictinghousingpricesandanalyzingrealestatemarketsintheChicagosuburbsusingmachinelearningKevinXu1,HieuNguyen21NeuquaValleyHighSchool,Naperville,Illinois2Mentor,UniversityofConnecticut,Storrs,CTABSTRACTThepricingofhousingpropertiesisdeterminedbyavarietyoffactors.However,post-pandemicmarketshaveexp...

展开>> 收起<<
1 Predicting housing prices and analyzing real estate markets in the Chicago suburbs using machine learning.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:313.85KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注