Skip to content

House price prediction for Pakistan’s real estate market using PySpark. This project applies linear regression to analyze factors like location, size, and amenities, supporting informed decision-making.

Notifications You must be signed in to change notification settings

evitanegaraputri4/House-Price-Prediction-Using-Spark-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

Business Understanding

In the rapidly growing real estate market, accurate house price predictions are crucial. The real estate industry faces the critical task of determining appropriate pricing for houses by analyzing factors such as location, size, amenities, and market trends. Effective price predictions provide valuable insights to investors, buyers, and real estate professionals, enabling informed decision-making and improved investment strategies

Porject Objective

  • The goal of this project is to predict house prices using the Pakistan House 2023.

  • The dataset contains eight columns, with house price as the dependent variable and seven other features as independent variables. Access the Dataset here

  • A linear regression model was implemented using PySpark's machine learning library to predict house prices.

  • Evaluation metrics, such as MAE, RMSE, and MSE, were used to assess the model's performance on both the training and test datasets.

Project Overview :

This project involved several key steps to predict house prices using the Pakistan House 2023 dataset:

  • Data Import and Initialization: Set up Spark for local execution and imported necessary libraries such as NumPy, Matplotlib, and PySpark.
  • Data Pre-Processing
    • Data Cleaning: Removed unnecessary characters and converted string columns into appropriate integer types.
    • Handling Missing and Inconsistent Data: Applied the dropna method to remove rows with missing values and ensured consistent data across relevant columns.
    • Feature Encoding: Transformed categorical variables (e.g., property type, city, and purpose) into numerical form using StringIndexer and OneHotEncoder.
  • Handling Outlier: Applied the Interquartile Range (IQR) method to detect and remove outliers in continuous variables.
  • Standard Scaller: Scaled the feature set using StandardScaler to normalize features, ensuring that each feature contributes equally to model training.
  • Train-Test Split: Divided the dataset into 70% for training and 30% for testing to evaluate model performance on unseen data.
  • Building Model and Evaluation
    • Building a linear regression model using PySpark.
    • Evaluated performance using metrics such as MAE, RMSE, and MSE for both training and test data.
    • Visualized predicted vs. actual prices, showing a linear relationship with some prediction errors.

Project Result

  • The linear regression model exhibited a positive correlation between actual and predicted house prices for both training and test datasets.
  • Evaluation metrics (MAE, RMSE, and MSE) showed similar error values for both training and test sets, reflecting consistent model performance, though the error magnitudes indicate significant deviations from actual prices.
  • Although the model demonstrated a linear relationship between actual and predicted prices, the relatively high error values highlight potential limitations, possibly due to feature insufficiency or model complexity.

Conclusion

The project successfully applied PySpark to predict house prices. Despite a clear linear relationship, high error values suggest further improvements are needed, such as refining feature selection and model complexity for better accuracy in future predictions.

About

House price prediction for Pakistan’s real estate market using PySpark. This project applies linear regression to analyze factors like location, size, and amenities, supporting informed decision-making.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published