Import multiple CSV files into the Postgres through Java/Scala code.

Reading Time: 2 minutes

It’s pretty simple to ingest data in the Postgres using the insert query, but in the big data world, we have a lot of data that we can not insert using insert queries. We get the data in CSV files that we want to import directly to the Postgres. It will take a lot of effort and time if we will try to import these large CSVs using insert query or in batch way.

We can do the import using the COPY command in a very simple way as follows:

psql -h 127.0.0.1 -d testDB -U postgres -c”copy employee from ‘/home/rishi/files/employee.csv’ WITH DELIMITER ‘~’ CSV”

  • -h: It represents the host address
  • -d: Database name
  • -U: Username of the Postgres instance.
  • -c: The command

When you will run the above command, then it will ask for the password where you need to enter the Postgres password.

It’s very simple, isn’t it?

(Not going deep in the COPY command here. You can get more details about it here)

But the real pain comes when we get a lot of CSV files. Running the COPY command one by one multiple time (for ex. 50 or 100) is the real pain. So for that, we would like to write a code that will help us in reducing the manual effort.

Suppose, we have multiple tables data at some location. Each table data is having multiple CSVs, so in short, we have multiple tables * multiple CSV files.

To import these CSVs files, let’s write the code. (I am writing it in Scala here, but it can easily be converted to Java). I am using the same COPY command in the code.

First, add the below dependency in your pom.xml:

<dependency>
  <groupId>org.postgresql</groupId>
  <artifactId>postgresql</artifactId>
  <version>42.2.12</version>
</dependency>

then, create a simple DbHandler for creating a connection with Postgres.

object DbHandler {
  def getConnection: Connection = {
    Class.forName("org.postgresql.Driver")
    DriverManager.getConnection("jdbc:postgresql://localhost:5432/testDB?user=postgres&password=postgres")
  }
}

then create a main class, where we will write the main logic:

object Application {

  def main(args: Array[String]): Unit = {
    val conn = DbHandler.getConnection
    val files = new File("/home/rishi/files")
      .listFiles((_, name: String) => name.toLowerCase.endsWith(".csv")).toList

    files.foreach { file =>
      val name = file.getName
      println("file name : " + name)
      val tableName = name.split('.')(0)
      val rowsInserted = new CopyManager(conn.asInstanceOf[BaseConnection])
        .copyIn(s"COPY $tableName FROM STDIN (DELIMITER '~',FORMAT csv)",
          new BufferedReader(new FileReader(file.getPath)))

      println(s"$rowsInserted row(s) inserted for file $file")
    }
  }
}

This code is also a very simple one. We have used CopyManager here which takes the Postgres connection and runs the COPY command.

In the above code, we are assuming that the CSV name is the same as the table name. This way we would not need to provide the table name explicitly and the code will handle on its own.

I hope, this will blog will help you in some way to reduce the manual effort.

blog-footer

Written by 

Rishi is a tech enthusiast with having around 10 years of experience who loves to solve complex problems with pure quality. He is a functional programmer and loves to learn new trending technologies. His leadership skill is well prooven and has delivered multiple distributed applications with high scalability and availability by keeping the Reactive principles in mind. He is well versed with Scala, Akka, Akka HTTP, Akka Streams, Java8, Reactive principles, Microservice architecture, Async programming, functional programming, distributed systems, AWS, docker.