Snowflake ID

Ever wondered how Twitter/X creates IDs for their millions of tweets that are posted every day?

Simple problem, right, increment the ID as new tweets arrive, right? Simple enough until you have multiple servers generating IDs at the same time. Now you need a central authority (which is slow) or deal with collisions. X came up with a genius solution for this problem, Snowflake ID's. Let's looks at them and see how they work.

The problem

We need IDs that “uniquely” identify our posts, and these IDs have to be unique, And we should be able to generate them across multiple machines with no collision.

Some Existing Solutions.

Auto Increment.

If you have ever worked on distributed systems, you know this is a headache (where head breaks into two). It's hard to maintain order and will break across machines.

UUID (Universally Unique Identifier)

On paper, they sound amazing - unique, works across machines, everything that we were looking for.

But here is the main issue with UUID

They are huge (16 Bytes), which means indexes are going to be 2x larger.
They are unordered, especially v4, they are basically random, which means insert goes to a random place, and no natural time ordering.

Centralized ID System

In this system, you have one machine that manages and issues ID’s. Again, sounds good on paper, but at scale, you have multiple machines sending 1000’s of requests all waiting for an ID. Not to mention the network overhead for creating each ID.

Now every single request depend on one service, which is a complete no go for large scale distributed systems.

The Solution

One solution to all the problem mentioned above is Snowflake ID. Let's look at what a Snowflake ID is and how it solves our problems.

A Snowflake ID is a 64 bit integer, that consists of the three parts, - The timestamp (31 bits) - The Worker ID (10 bits) - The Sequence Number (12 bits)

Snowflake Header Image

* Image stolen from Wiki

Let's break each field down.

The timestamp

As the name suggest it is the time in milliseconds when the ID was generated. The timestamp can be the UNIX epoch timestamp, or it could be a custom timestamp calculated from a custom epoch. For example twitter uses 1288834974657 as its epoch.

Worker ID / Machine ID

The worker ID again, as the name suggest the ID of the worker that generated the ID.

Sequence Number

Basically, it's the n’th ID generated in those milliseconds, i.e if two ID’s are generated in the same millisecond, then this counter is incremented.

How to make one

TIMESTAMP_SHIFT = 22 // (10 + 12)
WORKER_ID_SHIFT = 12 // (sequence bits)

ts = time.now()
worker_id = 12
sequence_number = 0

snow_flake = (ts << TIMESTAMP_SHIFT)
            | (worker_id << WORKER_ID_SHIFT)
            | (sequence_number)

Since the sequence_number is only 12 bits, there is a hard limit on how many IDs can be generated by a machine in a millisecond. If the limit is reached, then the worker waits until the next millisecond to continue generating the ID.

Since a major portion of the ID is made of the timestamp, it gives them a natural ordering, which enables you to easily sort them.

And each machine/worker has its own ID making them unique across machines.

Znow

Znow is a zig library that I made to generate Snowflake ID. You can checkout the code to see how a Snowflake generator is implemented.

See you in the next post.