k-NN Regressor in Excel: A Hands-On Introduction to Machine Learning with Real-World Challenges and Insights
The Machine Learning “Advent Calendar” Day 1: k-NN Regressor in Excel Today’s entry kicks off a series exploring machine learning and deep learning concepts using Excel. We begin with the k-Nearest Neighbors (k-NN) regressor — one of the simplest yet most insightful models. It’s a perfect starting point, especially for those new to machine learning, and even experienced practitioners can find subtle lessons here. The core idea of k-NN is intuitive: to predict a value for a new data point, find the k most similar training examples and take their average. Think of it like asking your neighbors for the price of a similar apartment. This real-world logic is exactly what k-NN implements. We’ll use the California Housing Dataset, which contains block group-level data from California, with the target variable being the median house value (MedHouseVal). Features include median income, house age, average rooms, bedrooms, population, average occupants, and geographic coordinates. Starting with a single feature, we can predict the house value for a new observation, say with median income = 10 (in $10,000 units). The process is straightforward: calculate the absolute difference in income, rank the training data by this distance, select the 3 closest (k=3), and average their target values. In Excel, this is done using RANK, IF, and SUMPRODUCT functions. The result is a simple, transparent prediction. When we extend to two features, such as income and house age, we use Euclidean distance. But here’s a critical insight: the scale of the features matters. If one feature (e.g., population) is in the thousands and another (e.g., income) in the tens, the larger-scale feature dominates the distance calculation. This can lead to misleading results. Even more problematic is the case of mixed data types. The California dataset includes both continuous variables (like income) and geographic coordinates. Using standard Euclidean distance on raw latitude and longitude is not meaningful because degrees are not linear in real-world distance. A better approach is to use actual geographic distance, which can be calculated using the haversine formula. This makes the model more accurate and interpretable. The same issue arises with categorical features. In the diamond price dataset, clarity (e.g., VVS1, SI2) is categorical. If we use one-hot encoding, all categories are treated as equally distant from each other, which distorts the true hierarchy in diamond grading. This leads to poor neighbor selection. A better approach is ordinal encoding, where we assign numerical values based on expert knowledge — for example, IF = 10, VVS1 = 9, VVS2 = 8, and so on. This preserves the natural order and ensures that the distance reflects real-world quality differences. A key takeaway: k-NN is not a model in the traditional sense. It has no training phase. The “model” is the entire dataset. The only hyperparameter is k, and the model’s performance depends entirely on the choice of distance metric and feature scaling. The real challenge lies in feature scaling. k-NN has no built-in way to normalize or weight features. It cannot “understand” that one feature is in dollars and another in yen, or that a 1-unit change in income is more meaningful than a 1-degree change in latitude. The user must handle this manually. This is why k-NN, while simple and interpretable, is rarely used in production without preprocessing. It’s a great teaching tool, but in practice, it highlights the need for more robust models that can handle feature scaling, distance definition, and data heterogeneity automatically. In the end, k-NN teaches us a powerful lesson: the quality of a model is not just in its algorithm, but in how we prepare the data and define similarity. The next steps in this series will build on this foundation, showing how more advanced models address these limitations.
