It’s pretty simple to ingest data in the Postgres using the insert query, but in the big data world, we have a lot of data that we can not insert using insert queries. We get the data in CSV files that we want to import directly to the Postgres. It will take a lot of effort and time if we will try to import these large CSVs using insert query or in batch way.
We can do the import using the COPY command in a very simple way as follows:
psql -h 127.0.0.1 -d testDB -U postgres -c”copy employee from ‘/home/rishi/files/employee.csv’ WITH DELIMITER ‘~’ CSV”
- -h: It represents the host address
- -d: Database name
- -U: Username of the Postgres instance.
- -c: The command
When you will run the above command, then it will ask for the password where you need to enter the Postgres password.
It’s very simple, isn’t it?
(Not going deep in the COPY command here. You can get more details about it here)
But the real pain comes when we get a lot of CSV files. Running the COPY command one by one multiple time (for ex. 50 or 100) is the real pain. So for that, we would like to write a code that will help us in reducing the manual effort.
Suppose, we have multiple tables data at some location. Each table data is having multiple CSVs, so in short, we have multiple tables * multiple CSV files.
To import these CSVs files, let’s write the code. (I am writing it in Scala here, but it can easily be converted to Java). I am using the same COPY command in the code.
First, add the below dependency in your pom.xml:
<dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <version>42.2.12</version> </dependency>
then, create a simple DbHandler for creating a connection with Postgres.
object DbHandler { def getConnection: Connection = { Class.forName("org.postgresql.Driver") DriverManager.getConnection("jdbc:postgresql://localhost:5432/testDB?user=postgres&password=postgres") } }
then create a main class, where we will write the main logic:
object Application { def main(args: Array[String]): Unit = { val conn = DbHandler.getConnection val files = new File("/home/rishi/files") .listFiles((_, name: String) => name.toLowerCase.endsWith(".csv")).toList files.foreach { file => val name = file.getName println("file name : " + name) val tableName = name.split('.')(0) val rowsInserted = new CopyManager(conn.asInstanceOf[BaseConnection]) .copyIn(s"COPY $tableName FROM STDIN (DELIMITER '~',FORMAT csv)", new BufferedReader(new FileReader(file.getPath))) println(s"$rowsInserted row(s) inserted for file $file") } } }
This code is also a very simple one. We have used CopyManager here which takes the Postgres connection and runs the COPY command.
In the above code, we are assuming that the CSV name is the same as the table name. This way we would not need to provide the table name explicitly and the code will handle on its own.
I hope, this will blog will help you in some way to reduce the manual effort.