Design Pattern: Avoid HAWQ creating too many small files on HDFS for tables

Summary

This post deals with rather unknown problem that HAWQ has on hadoop clusters, specifically creating too many small files on HDFS even when there’s negligible amount of data in tables. This post talks about our encounter with this problem and we overcame that issue. Bear in mind that the solution may not work for your use case, so you may have to get little creative to workaround the problem.

Problem

I am sure you have heard that for larger read-only tables and especially the ones with lots of columns, using Column orientation is a good idea. Column-oriented tables can offer better query performance on wide tables (lots of columns) where you typically only access a small subset of columns in your queries. It is one of the strategies used to reduce Disk IO. At our install site, HAWQ is used as an archival platform for ERP databases that tend to have lots and lots of tables with negligible amount of data. The data is rarely queried and when it is queried, queries tend to be simple queries (with minimal joins). Going by the above recommendation, architects had asked all the internal customers to create tables using that as default storage format. After few months, name node started crashing because JVM started running out of memory. When we looked into the problem, largest database with 34145 tables (all with column orientation) has 5,323,424 files on HDFS (cluster has 116 segments). Overall, number of files on HDFS for all databases in the cluster had reached 120 million at which point name node JVM kept running out of memory making the cluster highly unstable.

Problem in a nutshell arises because, HAWQ, like Greenplum database, implements columnar storage with one file per column. As you may already know, Hadoop doesn’t do well with lots and lots of really small files. Let’s look at the technical details:

[gpadmin@hdm2 migration]$ psql manoj
psql (8.2.15)
Type "help" for help.
-- CREATE TABLE
manoj=# create table testhdfs (mynum1 integer, mynum2 integer, mytext1 char(2), mytext2 char(2)) 
manoj-# with (appendonly=true, orientation=column) distributed randomly;
CREATE TABLE

-- INSERT A ROW
manoj=# insert into testhdfs values (1, 2, 'v1', 'v2');
INSERT 0 1

Now, let’s look at how many files this table has created. HAWQ stores files for this on HDFS at this location (/hawq_data/<segment>/<tablespace ID>/<database ID>).

[gpadmin@hdm2 migration]$ hadoop fs -ls /hawq_data/gpseg0/16385/5699341
Found 7 items
-rw------- 3 postgres gpadmin 0 2015-02-12 18:35 /hawq_data/gpseg0/16385/5699341/5699342
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.1
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.129
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.257
-rw------- 3 postgres gpadmin 0 2015-02-12 18:58 /hawq_data/gpseg0/16385/5699341/5699432.385
-rw------- 3 postgres gpadmin 4 2015-02-12 18:35 /hawq_data/gpseg0/16385/5699341/PG_VERSION

As you can see, it created 1 file per column per segment (even where there’s no data for this particular segment). That is a HUGE problem.

Solution

There are couple of ways to deal with this. One option is to use parquet format to store the data.

with (appendonly=true, orientation=parquet);

This will create one file for storing data per segment. You can use a nifty tool called “gpextract” to see where these files are stored on HDFS.

gpextract -o manoj.yaml -W testhdfs -dmanoj

[gpadmin@hdm2 migration]$ cat manoj.yaml 
DBVersion: PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 1.2.0.1 build
 8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr
 23 2014 16:12:32
DFS_URL: hdfs://hdm1.gphd.local:8020
Encoding: UTF8
FileFormat: Parquet
Parquet_FileLocations:
 Checksum: false
 CompressionLevel: 0
 CompressionType: null
 EnableDictionary: false
 Files:
 - path: /hawq_data/gpseg0/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg1/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg2/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg3/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg4/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg5/16385/5699341/5699460.1
 size: 1371
 - path: /hawq_data/gpseg6/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg7/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg8/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg9/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg10/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg11/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg12/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg13/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg14/16385/5699341/5699460.1
 size: 0
 - path: /hawq_data/gpseg15/16385/5699341/5699460.1
 size: 0
 PageSize: 1048576
 RowGroupSize: 8388608
TableName: public.testhdfs
Version: 1.0.0

Another way is to just copy the data in a flat file format on HDFS and create extrenal tables on it. We used this approach as this creates just one file per table in the whole cluster. We did not need the read performance, as this is an archival database, so it was an easy choice.

Conclusion

HAWQ provides a nice SQL layer on HDFS, use the tool in right way to solve your business problem. In addition, parquet format is easy to use in HAWQ and doesn’t lock you into a Pivotal HD/HAWQ only solution. It is easy to use the other tools like Pig or MapReduce to read the Parquet files in your Hadoop cluster.

One thought on “Design Pattern: Avoid HAWQ creating too many small files on HDFS for tables

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s