Taking my Portfolio out of Daycare

I didn’t want to develop a thesis for this trading strategy, way too much thinking. Instead, I let the data organize itself.

Unsupervised Trading is a portfolio algorithm that clusters S&P 500 stocks into momentum regimes using K-Means, then constructs optimized portfolios from the highest-momentum cluster each month. Instead of hard-coding rules like “buy stocks with RSI above 70,” the model learns which stocks behave similarly across dozens of features and lets those groupings drive allocation. This post walks through the full pipeline.

Read the disclaimer at the foot of the page, I am not responding to lawsuits.

The Idea

Traditional stock screening is like sorting books by genre or page count. You pick one dimension and draw a line. Stocks really aren’t one-dimensional, which is why they are essentially impossible to predict. A stock’s behavior is the product of its volatility, momentum, liquidity, factor exposures, and a dozen other characteristics simultaneously.

Clustering does something different. It looks at all of those dimensions at once and says: “these 30 stocks are behaving similarly right now.” Not because someone defined the rules, but because the data naturally groups that way. Each month, the clusters reform. Stocks move between groups as market conditions shift.

The pipeline looks like this:

S&P 500 Universe (8 years of daily data)
        │
   Feature Engineering
   (6 technical indicators + 6 return horizons + 5 factor betas)
        │
   Monthly Aggregation
   (Top 150 most liquid stocks)
        │
   K-Means Clustering
   (4 clusters, RSI-anchored centroids)
        │
   Regime Selection
   (High-momentum cluster)
        │
   Portfolio Optimization
   (Max Sharpe ratio via Efficient Frontier)
        │
   Monthly Rebalance → Compare to SPY

Every component feeds the next. No step is optional.

Data Acquisition

The strategy starts by pulling every current S&P 500 constituent from Wikipedia, then downloading eight years of daily OHLCV data from Yahoo Finance:

headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
response = requests.get(url, headers=headers)
tables = pd.read_html(StringIO(response.text))
sp500 = tables[0]

symbols_list = sp500['Symbol'].unique().tolist()

end_date = '2026-03-29'
start_date = pd.to_datetime(end_date) - pd.DateOffset(365*8)

df = yf.download(tickers=symbols_list, start=start_date, end=end_date)

This gives us roughly 500 stocks with daily open, high, low, close, and volume data going back to early 2018. Feature engineering will shape it into something a clustering algorithm can work with.

Feature Engineering

Raw prices don’t tell a clustering algorithm much. A $500 stock and a $50 stock look completely different in price space even if they’re exhibiting the same momentum behavior. Features normalize and transform the data into a representation where “similar behavior” can actually be measured by distance.

Kinda like describing people. Height and weight alone are crude descriptors. But add resting heart rate, VO2 max, flexibility, reaction time, sleep patterns, and whatever else humans have and you start to see natural clusters. The richer the feature set, the more meaningful the clusters.

Garman-Klass Volatility

Standard volatility measures use only closing prices. Garman-Klass is smarter: it uses the full OHLC bar, extracting more information from the same trading day:

df['garman_klass_vol'] = ((np.log(df['high']) - np.log(df['low']))**2) / 2 - \
                         (2*np.log(2) - 1) * ((np.log(df['close']) - np.log(df['open']))**2)

The first term captures the day’s trading range (high-to-low). The second adjusts for the open-to-close move. Together, they produce a volatility estimate that’s up to eight times more efficient than close-to-close volatility. Fewer observations to get the same precision. For a monthly model that only sees ~21 data points per stock per month, this is massive.

RSI (Relative Strength Index)

RSI measures how much of a stock’s recent price movement has been upward versus downward, normalized to a 0-100 scale:

df['rsi'] = df.groupby(level=1)['close'].transform(
    lambda x: pandas_ta.rsi(close=x, length=20)
)

An RSI of 70+ signals strong upward momentum: the stock has been mostly going up over the last 20 days. Below 30 signals the opposite. RSI becomes the primary axis for clustering.

Bollinger Bands

Bollinger Bands wrap a moving average with bands at two standard deviations above and below. A stock near its upper band is stretched relative to its recent history; near the lower band, it’s compressed:

df['bb_low'] = df.groupby(level=1)['close'].transform(
    lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:,0]
)
df['bb_mid'] = df.groupby(level=1)['close'].transform(
    lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:,1]
)
df['bb_high'] = df.groupby(level=1)['close'].transform(
    lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:,2]
)

Using log-transformed prices (np.log1p) before computing the bands ensures the width is proportional rather than absolute. A 2% move looks the same whether the stock is at $50 or $500.

ATR and MACD

ATR (Average True Range) captures how much a stock moves in absolute terms each day, accounting for gaps between sessions. MACD measures the convergence and divergence of two moving averages, signaling trend direction and strength. Both are computed per stock using pandas_ta and added as features.

Dollar Volume

Dollar volume, price times shares traded, serves a dual purpose. First, it filters the universe: only the 150 most liquid stocks each month enter the model. Second, the filter uses a rolling 5-year average to avoid recency bias:

data['dollar_volume'] = (data.loc[:, 'dollar_volume']
    .unstack('ticker')
    .rolling(5*12, min_periods=12)
    .mean()
    .stack()
    .sort_index())

data['dollar_vol_rank'] = data.groupby('date')['dollar_volume'].rank(ascending=False)
data = data[data['dollar_vol_rank'] < 150]

Filtering on a long-term average means a stock doesn’t drop out of the universe just because it had one quiet month.

Multi-Horizon Momentum Returns

A single return lookback captures one slice of a stock’s trajectory. A stock that’s up 20% over 12 months but down 5% this month tells a different story than one that’s up 20% over 12 months and up 5% this month.

To give the clustering algorithm this kind of depth perception, returns are computed over six horizons:

def calculate_returns(df):
    outlier_cutoff = 0.005
    lags = [1, 2, 3, 6, 9, 12]

    for lag in lags:
        df[f'return_{lag}m'] = (df['close']
                               .pct_change(lag)
                               .pipe(lambda x: x.clip(
                                   lower=x.quantile(outlier_cutoff),
                                   upper=x.quantile(1-outlier_cutoff)))
                               .add(1)
                               .pow(1/lag)
                               .sub(1))
    return df

Each return is winsorized at the 0.5th and 99.5th percentiles. This clips extreme outliers that would distort the clustering. The .pow(1/lag) annualizes the return so that a 12-month cumulative return is comparable in scale to a 1-month return. Without this normalization, longer-horizon features would dominate shorter ones simply by having larger values.

Fama-French Factor Betas

Technical indicators describe what a stock is doing. Factor betas describe why, or at least, which systematic risk factors explain its returns.

The Fama-French five-factor model decomposes stock returns into exposures to five well-documented drivers:

Factor	What It Captures	Analogy
Mkt-RF	Market beta — how much the stock moves with the market	The tide that lifts (or sinks) all boats
SMB	Size — small caps vs. large caps	Speedboat vs. cargo ship
HML	Value — cheap stocks vs. expensive stocks	Buying the fixer-upper vs. the new build
RMW	Profitability — robust earnings vs. weak earnings	The business that prints cash vs. the one burning it
CMA	Investment — conservative reinvestment vs. aggressive expansion	The company saving for a rainy day vs. the one building a new factory

Factor data is pulled directly from Kenneth French’s data library and merged with individual stock returns. Then, rolling 24-month OLS regressions estimate each stock’s exposure to each factor:

betas = (factor_data.groupby(level=1, group_keys=False)
         .apply(lambda x: RollingOLS(
             endog=x['return_1m'],
             exog=sm.add_constant(x.drop('return_1m', axis=1)),
             window=min(24, x.shape[0]),
             min_nobs=len(x.columns)+1)
         .fit(params_only=True)
         .params
         .drop('const', axis=1)))

RollingOLS from statsmodels runs a separate linear regression for each 24-month window, producing a time series of betas for each stock. A stock with a rising market beta is becoming more sensitive to broad market moves. A stock whose SMB beta flips from positive to negative is transitioning from small-cap behavior to large-cap behavior.

These betas are lagged by one month before joining the feature set. The model uses last month’s estimated exposures to predict next month’s behavior, avoiding look-ahead bias.

After all feature engineering, each stock-month observation is described by 18 features: Garman-Klass volatility, RSI, three Bollinger Band levels, ATR, MACD, six return horizons, and five factor betas.

K-Means Clustering

With 150 stocks and 18 features each month, the question becomes: which stocks are behaving alike right now?

K-Means answers this by partitioning the stocks into k groups such that each stock belongs to the cluster whose center (centroid) it’s closest to in 18-dimensional feature space. The algorithm iterates between assigning stocks to their nearest centroid and updating the centroids to the mean of their assigned stocks, until the assignments stabilize.

Why Not Just Sort by RSI?

You could skip all of this and just buy the top 30 stocks by RSI each month. That would be a momentum strategy. But it would ignore everything else. A stock with RSI 75 and collapsing volume is a very different animal from one with RSI 75 and surging volume. Clustering considers all 18 features simultaneously, finding groups of stocks that are similar across every dimension, not just one.

It’s the difference between picking basketball players by height alone versus evaluating height, wingspan, vertical leap, speed, and shooting percentage together. You’d get very different rosters.

Imagine if Curry went undrafted cause he was small, the French would’ve run away with gold in 2024

RSI-Anchored Centroids

Standard K-Means uses random or k-means++ initialization, letting the algorithm find whatever groupings minimize total distance. This works, but produces clusters that are statistically optimal without being financially meaningful.

Instead, the centroids are pre-defined along the RSI axis:

target_rsi_values = [30, 45, 55, 70]

initial_centroids = np.zeros((len(target_rsi_values), 18))
initial_centroids[:, 1] = target_rsi_values  # RSI is feature index 1

This seeds four clusters at RSI 30 (oversold), 45 (weak), 55 (neutral), and 70 (strong momentum). All other feature dimensions start at zero. K-Means then refines these centroids during fitting, but the initialization biases the algorithm toward financially interpretable groupings.

The clustering is refit independently each month:

def get_clusters(df):
    df['cluster'] = KMeans(n_clusters=4,
                           random_state=0,
                           init=initial_centroids).fit(df).labels_
    return df

data = data.dropna().groupby('date', group_keys=False).apply(get_clusters)

Each month, the composition of every cluster changes. A stock in Cluster 3 (high momentum) in January might slide into Cluster 2 (neutral) by March if its momentum fades. The clusters reflect the current state of the market.

Portfolio Construction

Clustering tells us which stocks belong together. Portfolio optimization tells us how much to bet on each one.

Selecting the Momentum Cluster

The strategy selects Cluster 3, the group seeded around RSI 70, each month.

The hypothesis: stocks clustered around strong momentum characteristics will continue outperforming in the following month. This is momentum persistence. Stocks that have been going up tend to keep going up, at least over the next one to twelve months.

filtered_df = data[data['cluster'] == 3].copy()
filtered_df = filtered_df.reset_index(level=1)
filtered_df.index = filtered_df.index + pd.DateOffset(1)

The DateOffset(1) is subtle but critical, it shifts the selected stocks forward by one month. If we cluster stocks based on January data, we trade them in February. This prevents look-ahead bias. The model never acts on information it wouldn’t have had at the time of the trade.

Maximum Sharpe Ratio Optimization

Given the selected stocks, the next question is allocation. Equal weighting is the naive approach. If one stock is far more volatile than the others, equal weighting concentrates risk without concentrating return.

The Efficient Frontier finds the portfolio weights that maximize the Sharpe ratio: the ratio of excess return to risk:

def optimize_weights(prices, lower_bound=0):
    returns = expected_returns.mean_historical_return(prices=prices, frequency=252)
    cov = risk_models.sample_cov(prices=prices, frequency=252)

    ef = EfficientFrontier(expected_returns=returns,
                           cov_matrix=cov,
                           weight_bounds=(lower_bound, .1),
                           solver='SCS')

    weights = ef.max_sharpe()
    return ef.clean_weights()

The weight_bounds=(lower_bound, .1) constraint caps any single stock at 10% of the portfolio. This prevents the optimizer from going all-in on whichever stock has the highest estimated return. Diversification is enforced.

When optimization fails, which happens when the selected stocks are too correlated or too few, the strategy falls back to equal weights:

try:
    weights = optimize_weights(optimization_prices, lower_bound=round(1/(len(cols)*2), 3))
except:
    weights = {col: 1/len(cols) for col in cols}

Calculating Portfolio Returns

Each month, the strategy downloads fresh daily prices for the selected stocks, computes daily log returns, and applies the optimized weights:

returns_dataframe = np.log(new_df['Close']).diff()

for start_date in fixed_dates.keys():
    # ... compute weights for this month's selected stocks
    portfolio_df = pd.concat([portfolio_df, returns_df], axis=0)

The use of log returns (np.log().diff()) rather than simple returns (pct_change()) is deliberate. Log returns are additive across time, you can sum daily log returns to get a cumulative return without compounding artifacts. This makes the final performance visualization cleaner and more mathematically sound.

Benchmarking Against SPY

A strategy is only interesting relative to the alternative. The benchmark is SPY representing a passive buy-and-hold of the same universe the strategy selects from:

spy = yf.download(tickers='SPY', start='2018-01-01', end=dt.date.today())
spy_ret = np.log(spy[['Close']]).diff().dropna().rename({'Close': 'SPY Buy&Hold'}, axis=1)

portfolio_df = portfolio_df.merge(spy_ret, left_index=True, right_index=True)

Cumulative returns are computed and plotted side by side:

portfolio_cumulative_return = np.exp(np.log1p(portfolio_df).cumsum()) - 1

portfolio_cumulative_return.plot(figsize=(16, 6))
plt.title('Unsupervised Learning Trading Strategy Returns Over Time')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
plt.ylabel('Return')

This is the final output. A single chart showing whether the monthly clustering and optimization machinery actually adds value over doing nothing.

To spoil it for you, it traded above the S&P during covid then averaged weak results 2023 and beyond.

Why Unsupervised?

Why would we even go in the direction of unsupervised learning here? I was joking at the start about being too lazy to hypothesize, there really is a fair reason.

Regime detection versus point prediction. Supervised models try to predict exact outcomes: “this stock will return 3.2% next month.” That’s an incredibly hard problem with noisy data. And when they’re wrong, you’re allocating capital based on a confident but incorrect prediction.

Unsupervised learning side-steps this entirely. It instead says: “these 30 stocks are currently behaving like high-momentum assets across every dimension I can measure.” You’re betting that the regime persists, not that a specific number is correct.

It’s like weather forecasting.

Clustering identifies the pattern; the portfolio optimization makes the trade.

Disclaimer

This project is for educational and research purposes only. It is not financial advice. Past performance of any backtested strategy does not guarantee future results. Always conduct your own due diligence before making investment decisions.

Thanks for reading this far, despite how smart I may sound (lol) this algo didn’t yield any results worth writing home about. Once I make one that does yall certainly won’t be finding it here.

This project is open source at unsupervised-trading-v1.