Use Rust to reduce the size of your SQLite database

0

Check out sqlite-zstd, a Rust library that compresses your database multiple times, providing significant size savings while keeping its search capabilities intact.


As stated in “In Praise Of SQLite”, SQLite is not a toy database:

Despite its compact size and lack of the client-server model, SQLite is an RDBMS with all the functionality that does something relational – i.e. tables, indexes, constraints, triggers, transactions , etc. However, there are few built-in functions compared to PostgreSQL or Oracle.

SQLite has no compression functionality. This has changed with the Rust-based library, sqlite-zstd which promises to:

provides dictionary-based transparent row-level compression that compresses the entries of an sqlite database almost as well as compressing the entire DB file, while still maintaining random access.

Suffice it to say, we are all aware of the benefits of compressing data, be it a PDF document, a humble ZIP file, or in this case a database. As a proof of concept and presentation of the tool, I will be working with an example database used by the note-taking application Joplin. Work well with Windows as it provides a simpler experience.

The sample database.sqlite is 2.6 GB in size. Some would say ‘what kind of notes do you have that take up so much space?’. It is true that this figure seems extravagant. It’s like that because Joplin lets you scrape any web page you come across and store it as a markdown in the database. Since I am a person with varied interests, when I find something interesting, I keep a copy. So the number of pages I dump in Joplin accumulates and the size of the database increases. So sqlite-zstd was a godsend. Of course, your use case may be different; The main online library use case example demonstrates compressing a database with 7 million book entries in JSON format, and does so by reducing its size from 2.2 GB to 550 KB!

But to get there, the first hurdle was to find the 64-bit version of the sqlite CLI, sqlite3, since the official version is only offered in 32-bit, and since zstd is a 64-bit library, you needed the corresponding version. Although you can build it manually, why give you this trouble when someone has already done it for you? Jump to the SQLite shell generator Github repo and download the latest 64-bit version for Windows (Ubuntu, MacOS as well).

After getting your hands on the CLI, it’s time to run it to load both the library and the sample database. After that, we’ll enable transparent compression at the row level of the ‘body’ column, the column that keeps most of the text in the ‘notes’ table. That said, you can invoke the transparent row-level compression function multiple times on the same table with different columns to compress.

You call it like this:

SELECT
zstd_enable_transparent('{"table": "objects", "column": "data1", "compression_level": 19, "dict_chooser": "''a''"}'),

and as a result of this call, the table will be renamed ‘_table_name_zstd’, while ‘table_name’ will become a view that can be queried normally, including SELECT, INSERT, UPDATE and DELETE queries. Keep in mind that this function will not compress any data by itself, you must then call zstd_incremental_maintenance.

The following differences apply when compression is active:

  • The compressed column can only contain blob or text data, depending on the declared data type affinity (e.g. VARCHAR(10) is fine, but int is not).
  • The primary key must not be null for any row, otherwise the update may not work as expected
    sqlite3_changes() will return 0 for changing queries (see here).
  • The SQLite streaming blob API will be somewhat useless since the entire blob is copied into memory anyway.
  • Attaching a database containing compressed tables using ATTACH ‘foo.db’ is not supported.
  • DDL statements (such as ALTER TABLE and CREATE INDEX) are only partially supported

Also, ‘dict_chooser’ is an SQL expression that decides how to partition the data. Examples of partition keys:

I went with the simplest case of ‘a’ which means all lines are compressed with the same dictionary.

And now in code:

$ sqlite3

sqlite>
#opening database
.open database.sqlite

#loading extension
.load sqlite_zstd

#enabling enabling transparent row-level compression
select zstd_enable_transparent('{"table": "notes", "column": "body", "compression_level": 19, "dict_chooser": "''a''"}');

Unfortunately I got an error:

Runtime error: Could not create insert trigger

Caused by:
0: near "order": syntax error
1: Error code

Enabling the library debug log reveals what’s going on behind the covers (many columns removed for brevity):

[2022-08-22T16:57:46Z INFO sqlite_zstd::create_extension] [sqlite-zstd] initialized
sqlite> select zstd_enable_transparent('{"table": "notes", "column": "body", "compression_level": 19, "dict_chooser": "''a''"}');

[2022-08-22T16:57:59Z DEBUG sqlite_zstd::transparent] cols=[ColumnInfo { name: "id", coltype: "TEXT", is_primary_key: true, to_compress: false, is_dict_id: false }, ColumnInfo { name: "parent_id", coltype: "TEXT", is_primary_key: false, to_compress: false, is_dict_id: false }, ColumnInfo { name: "title", coltype: "TEXT", is_primary_key: false, to_compress: false, is_dict_id: false }, ColumnInfo { name: "*body", coltype: "TEXT", is_primary_key: false, to_compress: true, is_dict_id: false }, ColumnInfo { name: "order", }]

[2022-08-22T16:57:59Z DEBUG sqlite_zstd::transparent] [run] alter table `notes` rename to `_notes_zstd`

[2022-08-22T16:57:59Z DEBUG sqlite_zstd::transparent] [run]
create view `notes` as
select `id`, `parent_id`, `title`, zstd_decompress_col(`body`, 1, `_body_dict`, true) as `body`, `created_time`, `updated_time`, `is_conflict`, `latitude`, `longitude`, `order`, `master_key_id`
from `_notes_zstd`

[2022-08-22T16:57:59Z DEBUG sqlite_zstd::transparent] [run]
create trigger `notes_insert_trigger`
instead of insert on `notes`
for each row
begin
insert into `_notes_zstd`(id, parent_id, title, body, _body_dict, created_time, order, user_created_time, user_updated_time) select new.`id`,
new.`parent_id`,
new.`title`,
new.`body` as `body`, null as `_body_dict`,
new.`created_time`,`,
new.`application_data`,
new.`order`,
new.`user_created_time`,
end;

After some research, I understood that the uncited order the input was seen as a reserved SQL word and hence the error. Changing order at ‘order’ passed the SQL Create Trigger statement. This, of course, was a borderline case due to the special column name used by the Joplin app and if it was any other app I might not even have noticed. But in any case, the citation mechanism should be fixed on the library side, so I opened an issue on its GitHub repository.

Until it is fixed and a new version is released, what can you do? I renamed the column just to make it pass:

alter table notes rename 'order' to 'test'

Then run sqlite_zstd::transparent again, which now runs to completion.

As already said, sqlite_zstd::transparent just enables compression. The real work is done by:

select zstd_incremental_maintenance(null, 1);

followed by:

vacuum;

to recover free space.

After the few minutes it takes for the trades to complete, let’s see the gains. From 2,663,996 KB original size to 1,912,900 KB. And that’s without changing the settings and using the default dictionary selector. Not bad!

Let’s see if the search also works:

select * from notes where body looks like ‘python%’ boundary 1;

It works like a charm!

Compression saves you storage space by reducing the size of the database, but it has a performance impact because the data must be compressed and decompressed while being accessed. As such, the library must be loaded each time the database is accessed. For example, let’s see how Joplin behaves when trying to load the compressed database (assuming the command column renamed and ‘notes’ became a view instead of a table) – this has no impact on its functionality.

Yes, there is no such function. I can still run SQL queries through the CLI.

Bonus material

Instead of messing with the sqlite3 CLI, you can do the same easily and GUI-based using the Db Browser for SQLite. Just load the database and extension through the GUI and then run your SQL on it.

The big takeaway here is that any application capable of loading this library can reduce its database size by 50-95% without affecting its core functionality. Sure, the performance hit is there, but since most operations are still running at over 50,000 per second, you’ll likely have other bottlenecks.

There are further optimizations to be done, but the same method should work for other databases, with almost no modification required for, say, PostgreSQL.

Is this the humble beginning of a greater impact on the state of database technology?

Use of rust continues to increase

More information

sqlite-zstd

Related Articles

Praise of SQLite

Take the Beginner Series to Rust

To be informed of new articles on I Programmer, subscribe to our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook Where LinkedIn.

Banner


pythondata



comments

or send your comment to: [email protected]

Share.

Comments are closed.