Optimizing random row selection in SQL

One simple way to improve the discoverability of content in applications is by providing a random browsing feature. The unpredictability afforded by this function can spark serendipity, allowing users to make creative connections between pieces of content. It is a way to make apps more engaging, although it must be noted that, in practice, depending on the characteristics and sources of the content, the results could be too random and uninteresting to users. This is why content platforms usually employ more sophisticated recommendation algorithms. Nevertheless, the simple-to-understand and unbiased method of surfacing content by pure chance remains appealing.

Here is the basic query for retrieving one random database record with SQL (works in SQLite and PostgreSQL):

SELECT * FROM table ORDER BY RANDOM() LIMIT 1;

This query should suffice for many applications. Adding a few WHERE clauses to constrain the results through some taxonomy (e.g. categories, tags, authorship, timestamp range) will also go a long way towards alleviating the aforementioned issue with low-value results.

At scale, however, this query is often slow, because database engines would randomly reorder the entire table before retrieving a single row. With wide tables containing millions of records, it could take a few seconds to process—a rather unacceptable latency for many applications.

In this post I take a look at some optimization solutions for the case of selecting a single row. Note that the case of randomly selecting a set of rows is a subtly distinct problem.

Some approaches

As Tobias Petry discusses in a blog post tackling the multiple-rows case, some of the common approaches include the use of random offsets and random ranges.

Here is the random offset technique for selecting a single row, assuming an integer-type primary key:

SELECT * FROM table WHERE id = (
    SELECT FLOOR(RANDOM() * (COUNT(*)) + 1) FROM table
);

One problem with this is that it does not work with primary key gaps, i.e. you cannot add WHERE clauses or allow many records to be deleted (leaving non-existent keys), because the query will begin to frequently return empty results.

Another problem, which Petry does not note, is that in PostgreSQL this is in fact not much of an optimization! [Update: Petry reviewed this and pointed out important caveats in the performance tests I made; see “Feedback” section below.] In my 1M-row test database (detailed later in the section on Django implementation), the basic query (with the simple ORDER BY RANDOM()) takes ~285 ms. The random offset approach takes ~175 ms. And this is already an improvement; the original sample SQL uses WHERE id IN instead of WHERE id =, and the IN version takes ~230 ms. The complete multiple-row version is even worse: it takes longer to run than the basic query, at ~350 ms.

Using range search partly solves the random offset’s gap problem:

SELECT * FROM table WHERE id >= (
    SELECT FLOOR(RANDOM() * (COUNT(*)) + 1) FROM table
) ORDER BY id LIMIT 1;

The performance of this query varies widely on my test database, ranging from 71 to 221 ms (130.37 ms mean, 43.5 ms std. dev.)—still only a modest speed-up over the basic query. The gap problem also persists; in this case, instead of empty results, the impact is statistical bias. In this approach, significant gaps in extant primary keys will cause some rows to be selected more often than others. Adding WHERE clauses aggravates the bias further.

(Also note that the ORDER BY id clause is not present in Petry’s sample snippet. In PostgreSQL, without this clause, the sorting is indeterminate and the query runs faster but produces bad results: successive executions would often return the same rows, likely due to PostgreSQL’s storage and caching mechanisms.)

Petry ultimately recommends a creative solution that uses PostgreSQL’s spatial extensions, but there is a similar yet simpler approach that is more suitable to the single-row case.

How MediaWiki does it

MediaWiki, the software that serves Wikipedia’s millions of articles, has a “Random article” feature. Its implementation uses an additional, indexed column named page_random that is populated with random values from the half-open interval [0, 1). To retrieve a page, MediaWiki runs a query resembling the following:

SELECT * FROM page
WHERE page_random >= RANDOM()
ORDER BY page_random LIMIT 1;

(Instead of using RANDOM() in SQL, MediaWiki generates a value in PHP using the same function that populates page_random.)

This query is blazing fast. In my test database, an analogous query runs in 3 ms. This method also suffers much less from bias even with gaps and filters, producing more uniformly random results.

The use of the >= operator does result in an edge case: if the query runs with a very high random value (close to 1.0), higher than the maximum page_random value for existing rows, the query produces an empty result. However, the probability of this edge case approaches zero as the number of rows in the table increases, so this is a minor issue. (Nevertheless, I address it in my sample Django implementation below.)

Implementing in Django

Here is a demonstration of how to implement the MediaWiki approach in Django. The code here is tested with Django 4.2.

The critical parts are in this sample models.py, containing a model definition and custom QuerySet/Manager:

import random

from django.db import models


def rand():
    return random.random()


class SomethingQuerySet(models.QuerySet):
    def get_random(self):
        return self.order_by("?").first()

    def get_optimized_random(self):
        rand_val = rand()
        selection = self.order_by("randomizer").filter(
            randomizer__gte=rand_val).first()
        if selection is None:
            # Try the other direction
            selection = self.order_by("-randomizer").filter(
                randomizer__lt=rand_val).first()
        return selection


class Something(models.Model):
    text = models.TextField()
    randomizer = models.FloatField(editable=False, default=rand)
    created = models.DateTimeField(auto_now_add=True)

    objects = SomethingQuerySet.as_manager()

    class Meta:
        indexes = [models.Index(fields=["randomizer"])]

The get_random() method produces the basic random-row SQL and is included here only as a demonstration. get_optimized_random() is the version that produces the MediaWiki-like query above, with additional logic to handle the aforementioned edge case where the query runs with a very high random value. In those cases this method will run another query to reverse the range filter and ensure a result is returned. As the database grows, this will happen less, until the extra query is almost never executed.

The Python standard library’s random.random() is put inside a wrapper here to work around a quirk of Django database migrations. If you try to use the random function directly as the default value for the randomizer field, you will get a “Cannot serialize function… No module” error upon running makemigrations. Of course random() belongs to the standard library random module, but Django’s migration serializer cannot handle the way that this function is implemented as a bound method of a hidden instance.

A further consequence of the interaction between random() and the Django migration system is the need to write a data migration (a RunPython operation), as seen in this sample migration file:

from django.db import migrations, models
import somethings.models


def randomize_defaults(apps, schema_editor):
    """
    Data migration to actually randomize Something.randomizer defaults for all
    rows. Without this, all existing objects will have the same value. Not
    actually necessary if we can be certain the table is empty before this
    migration.
    """
    Something = apps.get_model("somethings", "Something")
    for s in Something.objects.all():
        s.randomizer = somethings.models.rand()
        s.save(update_fields=["randomizer"])


class Migration(migrations.Migration):
    dependencies = [
        ("somethings", "0001_initial"),
    ]

    operations = [
        migrations.AddField(
            model_name="something",
            name="randomizer",
            field=models.FloatField(default=somethings.models.rand, editable=False),
        ),
        migrations.RunPython(
            randomize_defaults, reverse_code=migrations.RunPython.noop
        ),
        migrations.AddIndex(
            model_name="something",
            index=models.Index(
                fields=["randomizer"], name="somethings__randomi_7f33ee_idx"
            ),
        ),
    ]

Lastly, a simple demo view, which is trivially hooked up to a URLconf.

from django.views.generic import TemplateView

from .models import Something


class RandomSomething(TemplateView):
    template_name = "somethings/something.html"

    def get_context_data(self, **kwargs):
        optimize = self.request.GET.get("optimize")
        context = super().get_context_data(**kwargs)
        if optimize and optimize == "true":
            context["something"] = Something.objects.get_optimized_random()
        else:
            context["something"] = Something.objects.get_random()
        return context

I set up a test database on PostgreSQL 14.9 and generated 1,000,000 rows through the Django shell, filling the text column with random data as well (average octet_length(text) across all rows: 194.03; min. 59, max. 371).

With the Django Debug Toolbar, I verified that the optimization works. The unoptimized Django view takes ~295 ms, similar to the basic SQL query, with a minimal overhead probably incurred by Django. The optimized version runs in only ~2 ms.

Feedback

Tobias Petry graciously checked this article and pointed out serious deficiencies in my test setup. He wrote to me:

The performance evaluation you did is complicated. I mean, performance evaluations are always complicated and easy to do wrong. Simple things like creating a table with not many populated columns will make a big difference as the size of the index and table will be quite equal and a table scan and index scan will be the same speed. But this will be different on real datasets. And when you create a test table and fill it, it will be different to a real table: PostgreSQL won’t have time yet to analyze data distribution (ANALYZE) or vacuum it and freeze tuples so it can work from the index-only and doesn’t have to touch the rows anymore. There are quite a lot more issues.

The differences between the range scan and equality queries’ run times are unexpected, and, looking at my query plans, he commented: “The subquery is clearly triggering PostgreSQL to do weird things. You could try running that on its own and using the calculated value in the query.”

I did suspect my test database’s artificiality could be a flaw, and Petry’s response confirms this. I was thinking of the fact that this test table is not much simpler than an actual table in a small app I’m building as a hobby, but since that app will probably never get a million rows, it doesn’t really need optimizing. In fact, I’m simply using the basic ORDER BY RANDOM() in it! As I said at the beginning, it should suffice for many cases.

The MediaWiki-style optimization I presented and recommended here should still be valid, however, especially for the single-row case, if the system’s size justifies it. Petry’s spatial nearest-neighbor solution should be attractive for the multiple-row case.

It bears repeating, what they often say about optimization: don’t do it prematurely.