Required by my research project, I need to setup a distributed file system which can store hundreds terabytes of music mp3 files while still remains highly robust and efficient.

Since the number of mp3 files is large, even though I have already installed the Hadoop on our cluster, but HDFS is not a good choice for those “lots of small” files, I need to re-architect the cluster. MogileFS which is developed by Danga Interactive (This company is the creator of LiveJournal) and successfully utilized by Last.FM, can satisfy the requirements very well.

Before I can install MogileFS onto the cluster, I need first to prepare some computer nodes for it. Our cluster has more than 70 nodes, and I’ll use 6 nodes to setup the MogileFS, so, first I need to recommission the 6 nodes from Hadoop.

According to one thread from ServerFault, First, we need to add the following lines to conf/mapred-site.xml, 

where in the value part, we put the name of the file containing the 6 nodes. After this, usehadoop dfsadmin -refreshNodes to notify HDFS to transfer the data from the 6 nodes and delete the 6 nodes from the system. We can monitor this process from the webpage http://hadoop_master_data_node/dfsnodelist.jsp?whatNodes=LIVE

Our cluster is deployed on Ubuntu, according to the incomplete document of MogileFS, installing MogileFS on Ubuntu is not that complicated.