Benchmark near duplication with PostgreSQL
Benchmark near duplication with PostgreSQL
Setup
In this script above, we are setting up a PostgreSQL database to benchmark string similarity search using the Levenshtein distance metric. Here’s a brief explanation of what we’re doing exactly:
- We create a table strings to store 10 million random strings of length 32, consisting of only lowercase letters from ‘a’ to ‘z’.
- We define a function
generate_random_string
to generate random strings of a specified length. - We insert 10 million random strings into the strings table using the
generate_random_string
function. - We create an index on the str column of the strings table using the
pg_trgm
extension, which provides efficient trigram-based string similarity search. - We enable the
fuzzystrmatch
extension, which provides thelevenshtein
function for calculating the Levenshtein distance between strings. - We create a function
find_similar_strings
that takes an input string (input_str) and a maximum Levenshtein distance (max_distance
) as parameters, and returns a table of up to 100 strings from the strings table that are within the specified Levenshtein distance from the input string, sorted by distance.
-- Create a table to store strings
CREATE TABLE strings
(
id SERIAL PRIMARY KEY,
str TEXT NOT NULL
);
-- Create a function to generate random strings
CREATE OR REPLACE FUNCTION generate_random_string(length INTEGER)
RETURNS TEXT AS
$$
DECLARE
chars TEXT[] := '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
result TEXT := '';
BEGIN
FOR i IN 1..length
LOOP
result := result || chars[1 + random() * (array_length(chars, 1) - 1)];
END LOOP;
RETURN result;
END;
$$ LANGUAGE plpgsql;
-- Insert 10 million random strings into the table
INSERT INTO strings (str)
SELECT generate_random_string(32)
FROM generate_series(1, 10000000);
-- Create an index on the str column for faster similarity search
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX strings_trgm_idx ON strings USING GIN (str gin_trgm_ops);
-- Create an extension to enable the Levenshtein distance function
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
-- Create a function to find similar strings
CREATE OR REPLACE FUNCTION find_similar_strings(input_str TEXT, max_distance INTEGER)
RETURNS TABLE
(
id INTEGER,
str TEXT,
distance INTEGER
)
AS
$$
BEGIN
RETURN QUERY
SELECT s.id, s.str, levenshtein(input_str, s.str) AS distance
FROM strings s
WHERE levenshtein(input_str, s.str) <= max_distance
ORDER BY distance
LIMIT 100;
END;
$$ LANGUAGE plpgsql;
Usage
To use this setup for benchmarking string similarity search, you can follow these steps:
- First, we generate a random input string of length 32 using the
generate_random_string
function. - Then, we call the
find_similar_strings
function with the generated input string and a maximum Levenshtein distance of 3. - This function will return up to 10 strings from the strings table that have a Levenshtein distance of 30 or less from the input string, along with their respective distances.
-- Generate a random input string
SELECT generate_random_string(32) AS input_str;
-- Output: 'abcdefghijklmnopqrstuvwxyz012345'
-- Measure the time to find similar strings
SELECT str, distance
FROM find_similar_strings('abcdefghijklmnopqrstuvwxyz012345', 30)
ORDER BY distance
LIMIT 10;
By measuring the execution time of the find_similar_strings
function with different input strings and maximum distances,
you can benchmark the performance of string similarity search in PostgreSQL.