Question

How do I select a(or some) random row(s) from a table using SQLAlchemy?

Was it helpful?

Solution

This is very much a database-specific issue.

I know that PostgreSQL, SQLite, MySQL, and Oracle have the ability to order by a random function, so you can use this in SQLAlchemy:

from  sqlalchemy.sql.expression import func, select

select.order_by(func.random()) # for PostgreSQL, SQLite

select.order_by(func.rand()) # for MySQL

select.order_by('dbms_random.value') # For Oracle

Next, you need to limit the query by the number of records you need (for example using .limit()).

Bear in mind that at least in PostgreSQL, selecting random record has severe perfomance issues; here is good article about it.

OTHER TIPS

If you are using the orm and the table is not big (or you have its amount of rows cached) and you want it to be database independent the really simple approach is.

import random
rand = random.randrange(0, session.query(Table).count()) 
row = session.query(Table)[rand]

This is cheating slightly but thats why you use an orm.

There is a simple way to pull a random row that IS database independent. Just use .offset() . No need to pull all rows:

import random
query = DBSession.query(Table)
rowCount = int(query.count())
randomRow = query.offset(int(rowCount*random.random())).first()

Where Table is your table (or you could put any query there). If you want a few rows, then you can just run this multiple times, and make sure that each row is not identical to the previous.

Here's four different variations, ordered from slowest to fastest. timeit results at the bottom:

from sqlalchemy.sql import func
from sqlalchemy.orm import load_only

def simple_random():
    return random.choice(model_name.query.all())

def load_only_random():
    return random.choice(model_name.query.options(load_only('id')).all())

def order_by_random():
    return model_name.query.order_by(func.random()).first()

def optimized_random():
    return model_name.query.options(load_only('id')).offset(
            func.floor(
                func.random() *
                db.session.query(func.count(model_name.id))
            )
        ).limit(1).all()

timeit results for 10,000 runs on my Macbook against a PostgreSQL table with 300 rows:

simple_random(): 
    90.09954111799925
load_only_random():
    65.94714171699889
order_by_random():
    23.17819356000109
optimized_random():
    19.87806927999918

You can easily see that using func.random() is far faster than returning all results to Python's random.choice().

Additionally, as the size of the table increases, the performance of order_by_random() will degrade significantly because an ORDER BY requires a full table scan versus the COUNT in optimized_random() can use an index.

Some SQL DBMS, namely Microsoft SQL Server, DB2, and PostgreSQL have implemented the SQL:2003 TABLESAMPLE clause. Support was added to SQLAlchemy in version 1.1. It allows returning a sample of a table using different sampling methods – the standard requires SYSTEM and BERNOULLI, which return a desired approximate percentage of a table.

In SQLAlchemy FromClause.tablesample() and tablesample() are used to produce a TableSample construct:

# Approx. 1%, using SYSTEM method
sample1 = mytable.tablesample(1)

# Approx. 1%, using BERNOULLI method
sample2 = mytable.tablesample(func.bernoulli(1))

There's a slight gotcha when used with mapped classes: the produced TableSample object must be aliased in order to be used to query model objects:

sample = aliased(MyModel, tablesample(MyModel, 1))
res = session.query(sample).all()

Since many of the answers contain performance benchmarks, I'll include some simple tests here as well. Using a simple table in PostgreSQL with about a million rows and a single integer column, select (approx.) 1% sample:

In [24]: %%timeit
    ...: foo.select().\
    ...:     order_by(func.random()).\
    ...:     limit(select([func.round(func.count() * 0.01)]).
    ...:           select_from(foo).
    ...:           as_scalar()).\
    ...:     execute().\
    ...:     fetchall()
    ...: 
307 ms ± 5.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [25]: %timeit foo.tablesample(1).select().execute().fetchall()
6.36 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [26]: %timeit foo.tablesample(func.bernoulli(1)).select().execute().fetchall()
19.8 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Before rushing to use SYSTEM sampling method one should know that it samples pages, not individual tuples, so it might not be suitable for small tables, for example.

This is the solution I use:

from random import randint

rows_query = session.query(Table)                # get all rows
if rows_query.count() > 0:                       # make sure there's at least 1 row
    rand_index = randint(0,rows_query.count()-1) # get random index to rows 
    rand_row   = rows_query.all()[rand_index]    # use random index to get random row

This is my function to select random row(s) of a table:

from sqlalchemy.sql.expression import func

def random_find_rows(sample_num):
    if not sample_num:
        return []

    session = DBSession()
    return session.query(Table).order_by(func.random()).limit(sample_num).all()

this solution will select a single random row

This solution requires that the primary key is named id, it should be if its not already:

import random
max_model_id = YourModel.query.order_by(YourModel.id.desc())[0].id
random_id = random.randrange(0,max_model_id)
random_row = YourModel.query.get(random_id)
print random_row

Theres a couple of ways through SQL, depending on which data base is being used.

(I think SQLAlchemy can use all these anyways)

mysql:

SELECT colum FROM table
ORDER BY RAND()
LIMIT 1

PostgreSQL:

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1

MSSQL:

SELECT TOP 1 column FROM table
ORDER BY NEWID()

IBM DB2:

SELECT column, RAND() as IDX
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY

Oracle:

SELECT column FROM
(SELECT column FROM table
ORDER BY dbms_random.value)
WHERE rownum = 1

However I don't know of any standard way

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top