Upala Corporation

September 21, 2023

Modernizing Data Lakes

Introduction Hadoop was created in 2006 by engineers working at Yahoo! (most notably, Doug Cutting) with the idea of storing and processing data at internet scale focused on 3 V’s (volume, velocity and variety), defining dimensions of big data. Volume refers to the amount of data, Velocity refers to the speed of data processing, and Variety refers to the number of
Read more
August 25, 2023

AWS Cloud Migration case study

Summary A leading player in the internet publishing industry, embarked on a strategic initiative to migrate their on-premises infrastructure to Amazon Web Services (AWS) to greatly improve scalability and elasticity (they needed burst capacity for certain events happening around the country to server a spike in traffic), improve performance, and optimize costs. This case study
Read more
August 20, 2023

OAuth Integration for Kafka

Summary We recently pushed OAuth2 authentication for Kafka into Production for a fortune 500 client. KIP-255 allows one to implement OAuth2 authentication from Java clients to brokers and for inter-broker authentication. In addition to these, we also implemented OAuth2 authentication for REST Proxy clients. In this post, we give enough details for anyone interested to
Read more
November 7, 2019

Kafka+Kerberos

Summary Here we document how to secure Kafka cluster with Kerberos. Kerberos is one of the most widely used security protocol in corporate networks, thanks largely to widespread adoption of Microsoft Active Directory in corporations for directory-based identity-related services. A fair number of Kafka installations live alongside Hadoop in a Big data ecosystem. Kerberos is
Read more
November 9, 2017

Kafka Security using SSL

Summary There are few posts on the internet that talk about Kafka security, such as this one. However, none of them cover the topic from end to end. This article is an attempt to bridge that gap for folks who are interested in securing their clusters from end to end. We will discuss securing
Read more
March 4, 2017

Setting up Tez on CDH Cluster

Summary It is known that Cloudera has no official support for Tez execution engine. They push their customers to use Impala instead (or Hive on Spark nowadays). This article describes how we set Tez engine up on CDH cluster including Tez UI. Install Java/Maven Follow official instructions on how to install Java. Ensure version is same as the
Read more
October 5, 2015

Building rack aware mirroring scheme in Greenplum

Summary Customers using software only installations of Greenplum have an option to configure mirrors that fit their availability and performance needs. This post describes a way that can be leveraged to maximize availability and performance when Greenplum cluster is utilizing servers on 2 more racks. Rack/Server setup In a typical data center, customers usually have
Read more
July 20, 2015

Performing FULL VACCUM in Greenplum efficiently

Summary Greenplum is a MPP Warehouse platform based on PostgreSQL database. We discuss one of the most important and most common maintenance task that needs to be executed on periodic basis on the platform. What causes bloat? Greenplum platform is ACID compliant. The isolation property ensures that the concurrent execution of transactions results in a
Read more
February 12, 2015

Design Pattern: Avoid HAWQ creating too many small files on HDFS for tables

Summary This post deals with rather unknown problem that HAWQ has on hadoop clusters, specifically creating too many small files on HDFS even when there’s negligible amount of data in tables. This post talks about our encounter with this problem and we overcame that issue. Bear in mind that the solution may not work for
Read more
January 24, 2015

Authorization in Hadoop using Apache Ranger

Summary Over the past couple of years, Apache Hadoop has made great progress in the area of security. Security for any computing system is divided in two categories: Authentication is the process of ascertaining that somebody really is who he claims to be. In Hadoop, this is achieved via Kerberization. This post will not cover any
Read more