Skip to content

2024

Analysing Every Y Combinator Batch Ever

TL:DR

Y Combinator (YC) is betting on more founders than ever. New YC startups are showing signs of efficiency, requiring smaller teams to run. Source Code

Why Scrape YC?

Y Combinator's company directory is a gold mine of data.

With over 4,933 startups and counting, there is an opportunity to uncover valuable insights and trends. The growing list of companies allows us to extract crucial patterns into technical-founder-driven venture capital (VC).

Each company has a unique story.

But they all have a beginning, involving the initial plot, characters, and conflict. For YC startups, it's where purpose-driven founders receive the resources and mentorship to rapidly solve customer problems.

Then comes the journey.

Founders venture into the unknown, not knowing the road ahead. No one knows how it will end. And yet, with the uncertainty, all stories end in one of three ways: Acquired, Active or Inactive.

The YC Directory holds the story of every startup they have ever funded. Each has a unique set of moments, creating one of the most beautiful movements of purpose-driven technical founders.

That's why I want to find out how these stories are shaping over time.

The Method

The idea for this project came from (YC W24) Gumloop's Scrape YC Directory example. I tried to run it, but the process felt too inefficient given the excessive number of tokens required to run for a small YC batch.

Using Python only, I wanted to extract the information of every startup in YC's public directory.

Python Stack

The following Python packages made this possible:

  • Language Model - Useful for data extraction. GPT-4o mini works best.
  • Selenium - Web driver for extracting raw text body and links.
  • Instructor/Pydantic - Structured outputs powered by language models.

Scraping Tactic

  1. For each YC batch, load the whole company directory by scrolling to the page end. Extract all Company YC URLs to a CSV file YC_URLs.csv. Exclude irrelevant URLs.

  2. For each YC company URL, scrape the page content and links

  3. Using Instructor and Pydantic, parse the data into the defined pydantic Founders and YC_Company model.

  4. Save the scraped data to a CSV file YC_Directory.csv.

The Collected Data

As a start, I gathered every YC company's high-level information...

  • Name
  • YC Batch
  • Status
  • Team Size
  • URL

...and use it to discover secrets behind technical-founder-driven VC.

I can easily expand this to cover founder count, HQ location, industry and so on. But that's for the future.

The Results

The dataset contains 4,933 companies broken down into four status categories:

  • Active: 3,537 companies (71.7%)
  • Inactive: 815 companies (16.5%)
  • Acquired: 564 companies (11.4%)
  • Public: 17 companies (0.3%)

An observation...

Peter Thiel was right. Over a long period of time, startups follow a Power Law distribution. VC is all about making a lot of small bets. But how can founders do the same thing for themselves?

YC Batch Size Trend

The increasing batch size shows that YC is succeeding in achieving its mission: helping startups grow.

2021 was the largest cohort with 728 companies in Batches W21 and S21. But YC then limited batch sizes to around 250 companies. This could be the optimal size for startup accelerators.

Survival Rate by Batch

The survival rate over time shows the natural attrition in the startup ecosystem, where only the most viable companies persist over time.

This can be called the Law of Company Progression.

Total Team Size by Batch

Despite YC funding more startups than ever, total batch team size has plateaued, indicating a trend towards leaner, more efficient technical-driven startups. This is obvious from Batch S22. Future datasets should monitor this trend closely.

Team Size Percentage by Status

Publicly listed YC companies take the majority of team size by batch. Inactive startups tend to die small. This could suggest that startups that stay small, die small.

Future Work

I believe this data is only scraping the surface of YC's company directory. Looking at this data over time will allow us to view the progression of batches naturally.

The project is open-source, so please feel free to make changes to the directory.

On Definite Optimism

Human beings have the ability to distinctly remember inflection points in their lives.

I remember a specific moment during my second year of college. I loved numbers, so I opted into studying finance, specifically capital markets.

I sat down with myself to discuss my future: What can I do after I graduate? What value can I provide to society? What can I do to make my family proud?

I had nothing tangible.

I guarantee that I wasn't the only student that felt like this. Most colleges foster indefinite students, and I feel like I was one of them. Yet, it is not our fault. We go to college for a search of optimism and purpose, but many struggle to find either.

That's where I put myself on a path to learn something valuable: programming.

That moment changed my life completely. It turned me from an indefinite observer into a definite doer. I could build my own dreams.

Peter Thiel's Zero to One is a book that keep coming back to me. Whenever I read the it again, I unlock a new perspective. It's like knowledge is it's own language. It takes time to master. The more I learn, the more I realise how much I don't know.

Definite optimists make the world better through certainty. They know where to go, how and why. Our world needs more of them. We need to give indefinite people the skills to become definite optimists and build the future of tomorrow, today.

Pyzam

I love listening to mixtapes. You get to find new songs that would be perfect for your playlist.

A problem that I had was not being able to instantly find songs in whatever platlist I was listening on my PC.

Who is Pyzam for?

For DJs looking to integrate Shazam their set. For Daft Punk fans looking to find new samples.

Secrets discovered

Shazam can identify using a file that is up to 12 seconds long. You can make Shazam requests of up to 20 times per minute.

If you have ffmpeg installed, try pyzam now:

pip install pyzam

pyzam --microphone