• Modernizing Data Lakes

    Introduction Hadoop was created in 2006 by engineers working at Yahoo! (most notably, Doug Cutting) with the idea of storing and processing data at internet scale focused on 3 V’s (volume, velocity and variety), defining dimensions of big data. Volume refers to the amount of data, Velocity refers to the speed of data processing, and Variety refers to the number of

    Read more

  • Summary A leading player in the internet publishing industry, embarked on a strategic initiative to migrate their on-premises infrastructure to Amazon Web Services (AWS) to greatly improve scalability and elasticity (they needed burst capacity for certain events happening around the country to server a spike in traffic), improve performance, and optimize costs. This case study

    Read more

  • Summary We recently pushed OAuth2 authentication for Kafka into Production for a fortune 500 client. KIP-255 allows one to implement OAuth2 authentication from Java clients to brokers and for inter-broker authentication. In addition to these, we also implemented OAuth2 authentication for REST Proxy clients. In this post, we give enough details for anyone interested to

    Read more

  • Kafka+Kerberos

    Summary Here we document how to secure Kafka cluster with Kerberos. Kerberos is one of the most widely used security protocol in corporate networks, thanks largely to widespread adoption of Microsoft Active Directory in corporations for directory-based identity-related services. A fair number of Kafka installations live alongside Hadoop in a Big data ecosystem. Kerberos is

    Read more

  • Kafka Security using SSL

    Summary   There are few posts on the internet that talk about Kafka security, such as this one. However, none of them cover the topic from end to end. This article is an attempt to bridge that gap for folks who are interested in securing their clusters from end to end. We will discuss securing

    Read more

  • Summary It is known that Cloudera has no official support for Tez execution engine. They push their customers to use Impala instead (or Hive on Spark nowadays). This article describes how we set Tez engine up on CDH cluster including Tez UI. Install Java/Maven Follow official instructions on how to install Java. Ensure version is same as the

    Read more

  • Summary Customers using software only installations of Greenplum have an option to configure mirrors that fit their availability and performance needs. This post describes a way that can be leveraged to maximize availability and performance when Greenplum cluster is utilizing servers on 2 more racks. Rack/Server setup In a typical data center, customers usually have

    Read more

  • Summary Greenplum is a MPP Warehouse platform based on PostgreSQL database. We discuss one of the most important and most common maintenance task that needs to be executed on periodic basis on the platform. What causes bloat? Greenplum platform is ACID compliant. The isolation property ensures that the concurrent execution of transactions results in a

    Read more

  • Summary This post deals with rather unknown problem that HAWQ has on hadoop clusters, specifically creating too many small files on HDFS even when there’s negligible amount of data in tables. This post talks about our encounter with this problem and we overcame that issue. Bear in mind that the solution may not work for

    Read more

  • Summary Over the past couple of years, Apache Hadoop has made great progress in the area of security. Security for any computing system is divided in two categories: Authentication is the process of ascertaining that somebody really is who he claims to be. In Hadoop, this is achieved via Kerberization. This post will not cover any

    Read more