data analysis

MachineX: Sentiment analysis with NLTK and Machine Learning

Reading Time: 9 minutes In this blog, we are going to see how we can NLP library NLTK for sentiment analysis. Sentiment Analysis is a common NLP task nowadays. Every data scientist or a person working on data science needs to perform. Introduction to NLP Natural Language processing Natural Language Processing (NLP) is a subfield of artificial intelligence that helps computers understand human language. NLP enables machines to derive Continue Reading

Apache Spark: Tricks to Increase Job Performance

Reading Time: 2 minutes Apache Spark is quickly adopting the Real-world and most of the companies like Uber are using it in their production. Spark is gaining its popularity in the market as it also provides you with the feature of developing Streaming Applications and doing Machine Learning, which helps companies get better results in their production along with proper analysis using Spark. Although companies are using Spark in Continue Reading

Apache Spark: Read Data from S3 Bucket

Reading Time: < 1 minute Amazon S3 Accessing S3 Bucket through Spark Edit spark-default.conf file You need to add below 3 lines consists of your S3 access key, secret key & file system

Apache Spark: Repartitioning v/s Coalesce

Reading Time: 3 minutes Does partitioning help you increase/decrease the Job Performance? Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark applications efficiently. Now, diving into our main topic i.e Repartitioning v/s Coalesce What is Coalesce? The coalesce method reduces the number Continue Reading

Apache Spark

Deep Dive into Apache Spark Transformations and Action

Reading Time: 4 minutes In our previous blog of Apache Spark, we discussed a little about what Transformations & Actions are? Now we will get deeper into the topic and will understand what actually they are & how they play a vital role to work with Apache Spark? What is Spark RDD? Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects Continue Reading

Big Data Evolution: Migrating on-premise database to Hadoop

Reading Time: 4 minutes We are now generating massive volumes of data at an accelerated rate. To meet business needs, address changing market dynamics as well as improve decision-making, sophisticated analysis of this data from disparate sources is required. The challenge is how to capture, store and model these massive pools of data effectively in relational databases. Big data is not a fad. We are just at the beginning Continue Reading

Data Analysis using Python: Pandas

Reading Time: 3 minutes In this blog, I am going to explain pandas which is an open source library for data manipulation, analysis, and cleaning. Pandas is a high-level data manipulation tool developed by Wes McKinney. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data. Pandas is built on the top of NumPy. Five typical steps in the processing and analysis of Continue Reading