The Brazilian government publishes CNPJ (National Register of Legal Entities) data on a monthly basis. For November 2024, there are 35 compressed files in ‘.zip’ format (download link at the end of the text).
One of our clients approached us asking to improve their application, which keeps a local database updated with CNPJ data.
Our client's database serves other internal applications and should be maintained on a Microsoft SQL Server. We maintained the code with some improvements and there was a gain in data update time, but it still took a few hours - not exactly a problem for data updated every four or five weeks.
Even though we had delivered the solution to the client, the difficulty of importing, both because of the size of the files and the import time, was the subject of some discussions within the Necto team. So we decided to carry out further tests.
Necto has been using PostgreSQL as its database for years, and it's our first choice whenever possible. It's a robust, open-source database that doesn't fall short of other database management systems (DBMSs) on the market. That's why we decided to do some tests with PostgreSQL to import the CNPJ data.
The PostgreSQL "COPY" command
CNPJ data is made available in CSV files compressed into ZIP format. Today (2024), there are 35 ZIP files containing around 170 million records, of which around 60 million are unique CNPJs.
With the CSV files in hand, we use PostgreSQL's `COPY` command.
COPY moves data between PostgreSQL tables and standard file-system files. COPY TO copies the contents of a table to a file, while COPY FROM copies data from a file to a table (appending the data to whatever is in the table already). COPY TO can also copy the results of a SELECT query.
source: https://www.postgresql.org/docs/16/sql-copy.html
COPY table_name [ ( column_name [, ...] ) ] FROM { 'filename' | PROGRAM 'command' | STDIN } [ [ WITH ] ( option [, ...] ) ] [ WHERE condition ] COPY { table_name [ ( column_name [, ...] ) ] | ( query ) } TO { 'filename' | PROGRAM 'command' | STDOUT } [ [ WITH ] ( option [, ...] ) ]
Although the copy command is efficient for copying data, you can improve its performance by changing some PostgreSQL parameters.
There are many possibilities for configuring PostgreSQL, and some must be done at system start-up, while others can be done at runtime.
Here's a complete list of the parameters that can be changed to run PostgreSQL - https://www.postgresql.org/docs/16/runtime-config.html.
In the case of massive imports, pay attention to the ‘WAL’ configuration (Write Ahead Log - https://www.postgresql.org/docs/16/wal-configuration.html.
PostgreSQL Config Settings autovacuum.......................off checkpoint_completion_target............0.9 checkpoint_timeout...............900 s constraint_exclusion.......partition default_statistics_target........100 effective_cache_size..........524288 kB effective_io_concurrency...........1 fsync............................off full_page_writes.................off huge_pages.......................try maintenance_work_mem...........65536 kB max_wal_size....................5120 MB min_wal_size....................1024 MB random_page_cost...................4 shared_buffers.................98304 kB synchronous_commit................on wal_buffers.....................2048 kB work_mem........................4096 kB
These parameters can be defined in the PostgreSQL configuration file. As we are using a docker image, we pass as a parameter in the `command` of docker-compose .
# docker compose file # PostgreSQL config settings before start. postgis16: image: postgis/postgis:16-3.5-alpine mem_limit: 2.5GB # limit the memory used by PostgreSQL container mem_reservation: 1.5GB shm_size: '1gb' # its needed to build indexes and Primary Keys. ... # other volumes: - data-files:/usr/src/data_files/ container_name: postgis16 command: - "postgres" - "-c" - "autovacuum=OFF" - "-c" - "min_wal_size=1GB" - "-c" - "max_wal_size=5GB" - "-c" - "shared_buffers=768MB" - "-c" - "fsync=OFF" - "-c" - "full_page_writes=off" - "-c" - "checkpoint_timeout=15min" # min must be lowercase
Other parameters are changed at runtime...
# python code, before start import # PostgreSQL config settings runtime cur.execute("ALTER DATABASE cnpj SET constraint_exclusion='OFF';") cur.execute("ALTER DATABASE cnpj SET maintenance_work_mem='512MB';") cur.execute("ALTER DATABASE cnpj SET synchronous_commit='OFF';") cur.execute("ALTER DATABASE cnpj SET work_mem='4GB';") cur.execute("SELECT pg_reload_conf();")
and some just before executing SQL commands such as `INSERT INTO`.
ALTER TABLE cnae DISABLE TRIGGER ALL;
Results
Below are the results and import times on 2 different machines.
The specifications are below.
Table | Number of Records |
---|---|
cnae | 1,358 |
motivo | 60 |
municipio | 5,570 |
pais | 254 |
natureza_juridica | 89 |
qualificacao | 67 |
simples | 41,093,712 |
empresa | 60,294,472 |
estabelecimento | 63,333,636 |
socio | 24,933,508 |
CPU | Memory | Tempo (min) |
---|---|---|
3.1 GHz Dual-Core Intel Core i7 | 16 GB 1867 MHz DDR3 | 39 |
2.60 GHz Intel i5 * 12 | 32 GB 3200 MHz DDR4 | 21 |
Python Code and Data Corrections
We had problems creating the indexes and needed to change Docker's ‘shared memory’ to ‘1 GB’. See the ‘docker-compose’ file in the repository.
Python was mainly used as an orchestrator, but it also makes small corrections to the data and file names. The first problem is due to a duplicate CNPJ record, which prevented the PK from being created. The other is that some original file names have a `$` character in the middle.
There is also an encoding problem in one of the CSV files, a problem that persists between published versions. So Python runs a Perl script - `perl -i -pe ‘s/\\x00//g’ {fle}` - to fix the problem at runtime.
perl -i -pe ‘s/\\x00//g’ {fle}
Writing in Python gave us greater flexibility to control the whole process. However, it would be possible to do everything using PSQL alone.
Repository: https://github.com/znc-sistemas/cnpj_import
For more details, check out: https://www.postgresql.org/docs/16/runtime-config-resource.html.