Big Data and Fast Data

top 7 data analytics trends

Top 7 Data Analytics and Management Trends for 2020

Reading Time: 5 minutes We live in an era of data as it lies at the heart of digital transformation. And datasets are no longer as simple as before. They have increased in volumes, velocity, complexity and above all, are coming from multiple sources. Top tech giants like Google, Netflix, Amazon, and others are crunching massive amounts of data on a daily basis to give you a personalized experience. Continue Reading

Amazon EMR

Reading Time: 3 minutes Businesses worldwide are discovering the power of new big data processing and analytics frameworks like Apache Hadoop and Apache Spark, but they are also discovering some of the challenges of operating these technologies in on-premises data lake environments. They may also have concerns about the future of their current distribution vendor. Common problems of on-premises big data environments include a lack of agility, excessive costs, Continue Reading

Apache Spark: Tricks to Increase Job Performance

Reading Time: 2 minutes Apache Spark is quickly adopting the Real-world and most of the companies like Uber are using it in their production. Spark is gaining its popularity in the market as it also provides you with the feature of developing Streaming Applications and doing Machine Learning, which helps companies get better results in their production along with proper analysis using Spark. Although companies are using Spark in Continue Reading

Spark: ACID Transaction with Delta Lake

Reading Time: 3 minutes Spark doesn’t provide some of the most essential features of a reliable data processing system such as Atomic APIs and ACID transactions as discussed in the blog Spark: ACID compliant or not. Spark welcomes a solution to the problem by working with Delta Lake. Delta Lake plays an intermediary service between Apache Spark and the storage system. Instead of directly interacting with the storage layer, Continue Reading

Time Travel: Data versioning in Delta Lake

Reading Time: 3 minutes In today’s Big Data world, we process large amounts of data continuously and store the resulting data into data lake. This keeps changing the state of the data lake. But, sometimes we would like to access a historical version of our data. This requires versioning of data. Such kinds of data management simplifies our data pipeline by making it easy for professionals or organizations to Continue Reading

Data Lake – Build it in Phases

Reading Time: 3 minutes Data Lake – How to build a data lake and what are the phases involved in the same.

Apache Spark: Read Data from S3 Bucket

Reading Time: < 1 minute Amazon S3 Accessing S3 Bucket through Spark Edit spark-default.conf file You need to add below 3 lines consists of your S3 access key, secret key & file system

Scale Out with Cluster Sharding

Reading Time: 3 minutes If your actors are distributed across several nodes in the cluster, Cluster Sharding allows you to interact with them without worrying about their physical location and using only their logical identifier. Even if an actor re-locates to a new node, Akka will take care of locating it for you. You just need to send a message to it as if it is located on your local node.

iot application development

IoT Application Development: Tips & Tricks for success

Reading Time: 6 minutes Internet of Things or IoT is everywhere. From smart homes & smart cities to your fitness trackers & connected cars, we have seen them all and there’s more to come. As we gear up for 2020, studies suggest that IoT will comprise of 30 billion connected devices and that number may go up to 500 billion in another 10 years.  IoT is changing the trajectory Continue Reading

integrating Cucumber with Akka-Http

Akka Cluster Formation Fundamentals

Reading Time: 3 minutes Akka Cluster Formation Every actor has an address in Akka. The actor could be present locally or could be remote. Remote Actors require communication over the network. Each Actor system in a cluster is called a member or node. Node is addressed by a combination of hostname, port, and UUID (Regenerated when Actor System restarted). An actor can join the cluster with this combination to Continue Reading

Apache Spark: Repartitioning v/s Coalesce

Reading Time: 3 minutes Does partitioning help you increase/decrease the Job Performance? Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark applications efficiently. Now, diving into our main topic i.e Repartitioning v/s Coalesce What is Coalesce? The coalesce method reduces the number Continue Reading