PostgreSQL Setup For a massive import - 160,000,000 rows in minutes

The Brazilian government publishes CNPJ (National Register of Legal Entities) data on a monthly basis. For November 2024, there are 35 compressed files in ‘.zip’ format (download link at the end of the text).

One of our clients approached us asking to improve their application, which keeps a local database updated with CNPJ data.

Our client's database serves other internal applications and should be maintained on a Microsoft SQL Server. We maintained the code with some improvements and there was a gain in data update time, but it still took a few hours - not exactly a problem for data updated every four or five weeks.

Even though we had delivered the solution to the client, the difficulty of importing, both because of the size of the files and the import time, was the subject of some discussions within the Necto team. So we decided to carry out further tests.

Necto has been using PostgreSQL as its database for years, and it's our first choice whenever possible. It's a robust, open-source database that doesn't fall short of other database management systems (DBMSs) on the market. That's why we decided to do some tests with PostgreSQL to import the CNPJ data.

The PostgreSQL "COPY" command

CNPJ data is made available in CSV files compressed into ZIP format. Today (2024), there are 35 ZIP files containing around 170 million records, of which around 60 million are unique CNPJs.

With the CSV files in hand, we use PostgreSQL's `COPY` command.

COPY moves data between PostgreSQL tables and standard file-system files. COPY TO copies the contents of a table to a file, while COPY FROM copies data from a file to a table (appending the data to whatever is in the table already). COPY TO can also copy the results of a SELECT query.
source: https://www.postgresql.org/docs/16/sql-copy.html
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHERE condition ]

COPY { table_name [ ( column_name [, ...] ) ] | ( query ) }
TO { 'filename' | PROGRAM 'command' | STDOUT }
[ [ WITH ] ( option [, ...] ) ]

Although the copy command is efficient for copying data, you can improve its performance by changing some PostgreSQL parameters.

There are many possibilities for configuring PostgreSQL, and some must be done at system start-up, while others can be done at runtime.

Here's a complete list of the parameters that can be changed to run PostgreSQL - https://www.postgresql.org/docs/16/runtime-config.html.

In the case of massive imports, pay attention to the ‘WAL’ configuration (Write Ahead Log - https://www.postgresql.org/docs/16/wal-configuration.html.

PostgreSQL Config Settings

autovacuum.......................off
checkpoint_completion_target............0.9
checkpoint_timeout...............900 s
constraint_exclusion.......partition
default_statistics_target........100
effective_cache_size..........524288 kB
effective_io_concurrency...........1
fsync............................off
full_page_writes.................off
huge_pages.......................try
maintenance_work_mem...........65536 kB
max_wal_size....................5120 MB
min_wal_size....................1024 MB
random_page_cost...................4
shared_buffers.................98304 kB
synchronous_commit................on
wal_buffers.....................2048 kB
work_mem........................4096 kB

These parameters can be defined in the PostgreSQL configuration file. As we are using a docker image, we pass as a parameter in the `command` of docker-compose .

# docker compose file
# PostgreSQL config settings before start.

postgis16:
    image: postgis/postgis:16-3.5-alpine
    mem_limit: 2.5GB  # limit the memory used by PostgreSQL container
    mem_reservation: 1.5GB
    shm_size: '1gb'   # its needed to build indexes and Primary Keys.
    ... # other
    volumes:
      - data-files:/usr/src/data_files/
    container_name: postgis16

    command:
      - "postgres"
      - "-c"
      - "autovacuum=OFF"
      - "-c"
      - "min_wal_size=1GB"
      - "-c"
      - "max_wal_size=5GB"
      - "-c"
      - "shared_buffers=768MB"
      - "-c"
      - "fsync=OFF"
      - "-c"
      - "full_page_writes=off"
      - "-c"
      - "checkpoint_timeout=15min" # min must be lowercase

Other parameters are changed at runtime...

# python code, before start import
# PostgreSQL config settings runtime

cur.execute("ALTER DATABASE cnpj SET constraint_exclusion='OFF';")
cur.execute("ALTER DATABASE cnpj SET maintenance_work_mem='512MB';")
cur.execute("ALTER DATABASE cnpj SET synchronous_commit='OFF';")
cur.execute("ALTER DATABASE cnpj SET work_mem='4GB';")
cur.execute("SELECT pg_reload_conf();")

and some just before executing SQL commands such as `INSERT INTO`.

ALTER TABLE cnae DISABLE TRIGGER ALL;

Results

Below are the results and import times on 2 different machines.

The specifications are below.

Database Created
Table Number of Records
cnae 1,358
motivo 60
municipio 5,570
pais 254
natureza_juridica 89
qualificacao 67
simples 41,093,712
empresa 60,294,472
estabelecimento 63,333,636
socio 24,933,508
Setup
CPU Memory Tempo (min)
3.1 GHz Dual-Core Intel Core i7 16 GB 1867 MHz DDR3 39
2.60 GHz Intel i5 * 12 32 GB 3200 MHz DDR4 21

Python Code and Data Corrections

We had problems creating the indexes and needed to change Docker's ‘shared memory’ to ‘1 GB’. See the ‘docker-compose’ file in the repository.

Python was mainly used as an orchestrator, but it also makes small corrections to the data and file names. The first problem is due to a duplicate CNPJ record, which prevented the PK from being created. The other is that some original file names have a `$` character in the middle.

There is also an encoding problem in one of the CSV files, a problem that persists between published versions. So Python runs a Perl script - `perl -i -pe ‘s/\\x00//g’ {fle}` - to fix the problem at runtime.

perl -i -pe ‘s/\\x00//g’ {fle}

Writing in Python gave us greater flexibility to control the whole process. However, it would be possible to do everything using PSQL alone.

Repository: https://github.com/znc-sistemas/cnpj_import

For more details, check out: https://www.postgresql.org/docs/16/runtime-config-resource.html.