Importing large data volumes into PostgreSQL looks simple — until you try to import 170 million records and discover that the default approach takes hours, or fails before it finishes.

Necto Systems documented the full import of Brazil’s public CNPJ dataset (the National Registry of Legal Entities) — 35 compressed CSV files, roughly 170 million records — landing at 21 minutes on mid-range hardware. This article breaks down the decisions that made it possible.


The Problem with Conventional Approaches

An INSERT statement per record is the most intuitive way to populate a database. It’s also the slowest for large volumes: each INSERT creates a transaction, validates constraints, updates indexes, and writes to the WAL (Write Ahead Log). At 170 million records, that’s a non-starter.

The solution: PostgreSQL’s COPY command, which moves data directly between files and tables, bypassing most of the transactional overhead.


The Solution Architecture

Components

  • PostgreSQL running in Docker with parameters tuned for batch loading
  • Python as the orchestrator — responsible for processing files, fixing inconsistencies, and triggering COPY
  • Perl for one specific encoding case in a corrupted CSV file

Why Python as the Orchestrator

Pure SQL lacks the flexibility to handle exceptions in real time: duplicates, filenames with special characters, incorrect encoding. Python lets you detect and correct these problems without halting the process — moving on to the next file when an issue is identified.


Configuring PostgreSQL for Batch Loading

The optimizations fall into two groups:

Startup parameters (via Docker):

autovacuum = off
fsync = off
full_page_writes = off
shared_buffers = 768MB
checkpoint_timeout = 15min

fsync = off and full_page_writes = off are risky in normal production — if the server crashes mid-load, the data can be corrupted. For a public-data import that can simply be re-run, the risk is acceptable given the performance gain.

Runtime settings:

SET constraint_exclusion = off;
SET maintenance_work_mem = '512MB';
SET synchronous_commit = off;
SET work_mem = '4GB';

work_mem = 4GB is critical for index creation after the import — indexes built with insufficient memory are dramatically slower.


Import Steps

  1. Disable triggers on the target tables before starting
  2. Raise Docker’s shared memory to 1GB (required for index creation)
  3. Process each file with Python: decompress, fix encoding, normalize column names
  4. Run COPY for each processed file
  5. Re-enable triggers and create indexes after the full load

Creating indexes after the load is faster than keeping them updated during — especially with elevated work_mem.


Results

HardwareTotal time
Intel Core i7 Dual-Core 3.1 GHz, 16GB DDR339 minutes
Intel i5 12-core 2.60 GHz, 32GB DDR421 minutes

Challenges Encountered

Duplicates in the CNPJ data: the public dataset contained duplicate records that blocked primary-key creation. Fix: identify and remove duplicates before creating the constraint.

Filenames with special characters: some files in the package had non-ASCII characters in their names. Python (with pathlib) handles this more cleanly than a shell script.

Corrupted encoding: one CSV file had inconsistent encoding. A one-line Perl script was enough to convert it to UTF-8 before processing.


When This Approach Applies

One-off or periodic imports of large volumes — government data, legacy-system migration, initial data-warehouse load — benefit from this configuration. For continuous incremental loads in production, the optimization approach is different (replication, partitioning, COPY with upsert). In every case, source data quality determines what reaches the destination — ETL problems are frequently quality problems in disguise.

Necto Systems applies data engineering to projects involving large volumes, multiple sources, and integration with legacy systems. Database performance is often the unidentified bottleneck in projects that look like they’re about software development.

Talk to a specialist if your company is hitting data-volume bottlenecks.


Frequently Asked Questions

What is PostgreSQL’s COPY command, and why is it faster than INSERT? COPY is PostgreSQL’s native command for moving data in bulk between files and tables. It’s faster than INSERT because it processes many records in a single operation, reduces transaction overhead, and can be combined with settings that temporarily disable safety mechanisms unnecessary for controlled imports (such as fsync).

What is the WAL (Write Ahead Log) in PostgreSQL? The WAL is the mechanism that guarantees durability in PostgreSQL: every change is recorded in the log before it’s applied to the database. This guarantees recovery in case of failure, but adds overhead on write-intensive operations. For batch imports of data that can be re-run, disabling or reducing the WAL is a valid optimization.

How do you configure PostgreSQL for high-volume imports? The main settings are: fsync=off and full_page_writes=off (to reduce disk operations), elevated shared_buffers (to cache data in memory), elevated maintenance_work_mem (for index creation), and synchronous_commit=off (to reduce commit latency). These settings should be reverted after the import for normal operation.

Why create indexes after the import, not before? Keeping indexes updated during the import means every inserted record must update all of the table’s indexes. For 170 million records, that multiplies the work. Creating the indexes once, after the full load, is significantly faster — especially with elevated work_mem, which lets the operation use memory instead of disk.

How do you handle CSV files with incorrect encoding in Python? The most robust approach is to detect the encoding with the chardet library and convert to UTF-8 before processing. For simple cases, specifying the encoding directly in open() or Pandas read_csv() is enough. For badly corrupted files, a Perl script or iconv on the command line is often more efficient.

What is ETL, and how does it relate to large-volume data import? ETL (Extract, Transform, Load) is the process of moving data from one or more sources to a destination, applying transformations along the way. Large-volume import is the Load phase of ETL — and it’s where performance problems show up most often. A well-designed ETL pipeline separates each stage, makes each one testable, and makes it easier to pinpoint where the bottleneck is.

How does Necto Systems handle projects with large data volumes? Necto designs data pipelines that scale with the operation’s volume — from one-off legacy-data imports to continuous ingestion pipelines. The work includes diagnosing performance bottlenecks, choosing the load strategy suited to the use case, and configuring the database environment for the expected volume. We serve companies in agribusiness, the public sector, and environmental sectors with data volumes that demand this attention.