BigData

man in white shirt using macbook pro

Concept of UDF in Spark: User-Defined Function

Reading Time: 3 minutes As we all know, Spark contains a whole variety of inbuilt functions through which you can do any sort of transformation in your data frame and achieve your desired output, but sometimes you may find that you don’t require them. Then What? In that case, you can define your own function, known as UDFs (User Defined Functions) which makes it possible to write your own Continue Reading

Spark Broadcast Variables Simplified With Example

Reading Time: 3 minutes Welcome back everyone, Today we will learn about a new yet important concept of Apache Spark called Broadcast variables. For new learners, I recommended starting with a Spark introduction blog. What is a Broadcast Variable Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. Imagine you want to make some information, Continue Reading

Snowflake integration with Power BI tool

Reading Time: 4 minutes Snowflake is a popular Cloud DWH solution and in this blog we will discuss how to get insights from you data using Power BI as the Reporting and Analytics tool.

MachineX: Boosting performance with XGBoost

Reading Time: 5 minutes In this blog, we are going to see how XGBoost works and some of the important features of XGBoost with the help of an example. So, many of us heard about tree models and boosting techniques. Let’s put these concepts together and talk about XGBoost, the most powerful machine learning Algorithm out there. XGboost called for eXtreme Gradient Boosted trees. The name XGBoost, though, actually Continue Reading

Time Travel: Data versioning in Delta Lake

Reading Time: 3 minutes In today’s Big Data world, we process large amounts of data continuously and store the resulting data into data lake. This keeps changing the state of the data lake. But, sometimes we would like to access a historical version of our data. This requires versioning of data. Such kinds of data management simplifies our data pipeline by making it easy for professionals or organizations to Continue Reading

Big Data Landscape explained

Reading Time: 5 minutes Big Data has now evolved into a buzz word and it seems everyone is either working on it or want to work on it. However, most of the people associate Big Data with some of the popular tool sets like Hadoop, Spark, NoSql databases like Hive, Cassandra , HBase etc. HDFS made Big Data popular as it gave us an option to distribute the data Continue Reading

HDFS: A Conceptual View

Reading Time: 5 minutes There has been a significant boom in distributed computing over the past few years. Various components communicate with each other over network inspite of being deployed on different physical machines. A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information Continue Reading

Getting Introduced with Presto

Reading Time: 3 minutes Hi Folks! In today’s blog I will be introducing you to a new open source distributed Sql Query Engine – Presto. It is designed for running SQL queries over Big Data( petabytes of Data). It was designed by the people at Facebook. Introduction Quoting it’s formal definition “Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of Continue Reading

Business Intelligence-Data Visualization: Tableau

Reading Time: 3 minutes Spark, Bigdata, NoSQL, Hadoop are some of the most using and top in charts technologies that we frequently use in Knoldus, when these terms used than one thing comes into picture is ‘Huge Data, millions/billions of records’ Knoldus developers use these terms frequently, managing (and managing means here- storing data, rectifying data, normalizing it, cleaning it and much more) such amount of data is really Continue Reading

Solr Relevance Search Using SolrJ In Scala

Reading Time: 3 minutes In this blog we will see how we can perform relevance(or relevant) search in solr using solrj Http API in scala . To give brief what is relevance search : – A developer working on search relevancy focuses on the following areas as the “first line of defense”: Text Analysis: the act of “normalizing” text from both a search query and a search result to Continue Reading

Apache spark + cassandra: Basic steps to install and configure cassandra and use it with apache spark with example

Reading Time: 3 minutes To build an application using apache spark and cassandra you can use the datastax spark-cassandra-connector to communicate with spark. Before we are going to communicate with spark using connector we should know how to configure cassandra. So following are prerequisite to run example smoothly. Following steps to install and configure cassandra If you are new to cassandra first we nee to install cassandra on our Continue Reading

Handling Large Data File Using Scala and Akka

Reading Time: 6 minutes We needed to handle large data files reaching size Gigabytes in scala based Akka application of ours. We are interested in reading data from the file and then operating on it. In our application a single line in a file forms a unit of data to be worked upon. That means that we can only operate on lines in our big data file. These are Continue Reading