SQL

Defining your workflow: Why Not Airflow?

Reading Time: 4 minutes What is Apache Airflow? Airflow is a platform to programmatically author, schedule & monitor workflows or data pipelines. These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. Continue Reading

Using Vertica with Spark-Kafka: Reading

Reading Time: 4 minutes We live in a world of Big data where the size of data is so big even for small results. This is the result of an increase in data collection on a rapid scale in the modern world. This massiveness of data brings the requirements of such tools which can work upon such a big chunk of data. I am pretty sure that you guys Continue Reading

KSQL: Streams and Tables

Reading Time: 3 minutes By now you must be familiar with KSQL and how to get started with it. If not, check out the Part1 KSQL: Getting started with Streaming SQL for Apache Kafka of this series. In this blog, we’ll move one step forward to get an understanding of the Dual streaming model to see what abstractions does KSQL use to process the data. All the data that we Continue Reading

Streaming data from PostgreSQL using Akka Streams and Slick in Play Framework

Reading Time: 4 minutes In this blog post I’ll try to explain the process wherein you can stream data directly from PostgreSQL database using Scala Slick (which is Scala’s database access/query library) and Akka Streams (which is an implementation of Reactive Streams specification on top of Akka toolkit) in Play Framework. The process is going to be pretty straightforward in terms of implementation where data is read from one Continue Reading

KnolX: Understanding Spark Structured Streaming

Reading Time: 1 minute Hello everyone, Knoldus organized a session on 05th January 2018. The topic was “Understanding Spark Structured Streaming”. Many people attended and enjoyed the session. In this blog post, I am going to share the slides & video of the session. Slides:

presto server using JDBC

Knolx: Getting started with Presto

Reading Time: 1 minute Hi all, Knoldus has organized a 1-hour session on 8th September 2017. The topic was “Getting started with Presto”. Many people have joined and enjoyed the session. I am going to share the slides here. Please let me know if you have any question related to linked slides or video. The slides of the Knolx are here: And Here’s the video of the session: For any Continue Reading

SQL made easy and secure with Slick

Reading Time: 5 minutes Slick stands for Scala Language-Integrated Connection Kit. It is Functional Relational Mapping (FRM) library for Scala that makes it easy to work with relational databases. Slick can be considered as a replacement of writing SQL queries as Strings with a nicer API for handling connections, fetching results and using a query language, which is integrated more nicely into Scala. You can write your database queries Continue Reading

presto server using JDBC

Connecting To Presto server using JDBC

Reading Time: 2 minutes Hi Guys, In this blog we’ll be discussing about how to make a connection to presto server using JDBC, but before we get started let’s discuss what Presto is. What is Presto ? So, Presto is an open source distributed SQL query engine for running interactive analytic queries against different data sources. The sizes may ranges from gigabytes to petabytes. It runs on a cluster Continue Reading

Best Practices for Using Slick on Production

Reading Time: 5 minutes Slick is most popular library for relational database access in Scala ecosystem. When we are going to use Slick for production , then some questions arise  like where should the mapping tables be defined and how to join with other tables, how to write unit tests. Apart from this there is lack of clarity on the design guidelines. In this blog post , I am Continue Reading

Using Spark DataFrames for Word Count

Reading Time: 2 minutes As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease. Also, DataFrame API came with many under the hood optimizations Continue Reading

Easiest Way To Insert Scala Collection into PostgreSQL using Slick

Reading Time: 2 minutes Few days ago, I had a scenario, in which I had to insert scala collection into postgreSQL using Slick. My postgreSQL table has some columns with data types such as Arrays, hstore  etc.. I tried to do this using slick, but didn’t get success. After beating my head whole day, I found a solution. I found a slick extension slick-pg, which supports following postgreSQL types:- Continue Reading

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!