Predicting Urban Walking Risk with Spatial-Temporal Machine Learning Using H3 Grids and Tweedie Regression
The project, called StreetSense, aims to enhance urban walking safety by using spatial-temporal machine learning to predict risk along walking routes in San Francisco. The core idea emerged from a personal experience: after dinner in downtown San Francisco, the author wanted to walk home but found that Google Maps offered no way to prioritize safety over speed. While the app could show the fastest route, it couldn’t help answer a more nuanced question—how to choose a path that feels safer at night, especially in unfamiliar areas. The problem was framed as: given a start point, end point, day of the week, and time of day, what is the expected risk along a walking route? The goal was to move beyond simple distance or time metrics and instead incorporate how risk varies across different parts of the city and at different times. To build this, the author used San Francisco’s publicly available police incident reports from the Open Data Portal, spanning from 2018 to the present. These records included incident type, location, time, and description. A key challenge was that not all incidents carry the same level of risk. To address this, the author used a large language model to assign severity scores—on a 0 to 10 scale—across three dimensions: threat level, physical harm potential, and emotional distress. These scores were combined into an overall severity signal for each incident type. To handle spatial data, the author adopted Uber’s H3 geospatial indexing system, which divides the globe into hexagonal cells. This approach provides uniform neighbor distances and enables consistent aggregation of risk at the block and neighborhood level. Time was encoded using sine and cosine transformations to capture cyclical patterns—such as the transition from midnight to 1 a.m.—and incidents were grouped into 3-hour time windows to balance granularity and data reliability. The final model used XGBoost, chosen for its ability to handle non-linear patterns in tabular data, deliver fast inference, and manage complex interactions between spatial and temporal features. Because the target variable—expected risk—was zero-inflated and right-skewed (many cells had no incidents, but a few had high-risk events), the author used Tweedie regression. This model effectively captures the sum of a random number of random-sized events, making it ideal for crime data where both frequency and severity matter. The output is not a binary “safe/unsafe” label but an expected risk score. This allows the system to reflect both rare high-severity events and sustained patterns of lower-severity incidents. For example, a quiet neighborhood with a single high-risk incident may have a low expected risk, while a busy area with frequent low-severity incidents can have a higher score. The model was deployed as a web application using the Google Maps API. Routes are color-coded based on risk percentiles: green (safe), yellow (moderately safe), orange (moderately risky), and red (risky). The app can also suggest alternative routes if the detour is within 15% of the original duration—helping users avoid high-risk areas without adding excessive time. The results were tested on a real-world route from Chinatown to Market & Van Ness. At 9 a.m. on a Tuesday, the route showed mostly green and yellow segments. By 11 p.m. on a Saturday, the model detected higher risk in certain blocks and rerouted the user through safer streets, demonstrating the system’s ability to adapt to time-of-day context. While the model is based on historical data and reflects past patterns—not future outcomes—it provides valuable context for decision-making. It’s not meant to replace personal judgment but to support it. Future improvements could include incorporating real-time data, pedestrian traffic patterns, lighting conditions, or even user feedback to refine predictions. Ultimately, StreetSense aims to help people navigate cities more confidently, not by labeling places as safe or dangerous, but by offering a nuanced, data-driven understanding of risk that evolves with time and place.
