How to achieve 1 GByte/sec I/O throughput with commodity IDE
disks
Jens Mache,
Joshua Bower-Cooley,
Jason Guchereau,
Paul Thomas, and
Matthew Wilkinson
Lewis & Clark College
Portland, OR 97219
The Problem
In order to compete with custom-made systems, PC clusters have to provide
not only fast computation and communication, but also high-performance
disk access. I/O performance can play a critical role in the completion
times of many applications that transfer large amounts of data to and from
secondary storage, for example simulations, computer graphics, file serving,
data mining or visualization.
An I/O throughput of 1 GByte/sec was first achieved on ASCI Red with I/O
hardware costing over one million dollars. We set out to achieve similar
I/O performance on our PC cluster by harnessing the power of commodity IDE
disks on remote nodes.
The Approach
We set out to achieve an I/O throughput of 1 GByte/sec on a PC cluster that
(1) has as few as 32 nodes and (2) uses less than ten thousand dollars
worth of I/O hardware. In order to reach this goal, each node must be able
to access data at a rate of at least 32 MBytes/sec.
The novelty of our approach is (A) on each node to use two commodity IDE
disks (not SCSI disks) in a software RAID configuration and (B) to configure
the parallel file system such that each nodes acts as both I/O node and
compute node.
In our first experiment, we measured the local read and write performance of
our two IDE drives (IBM 20GB ATA100 7200rpm costing $112 each), configured as
a software RAID 0. Using the Bonnie disk benchmark, we measured up to 68.23
MBytes/sec.
In our second experiment, we measured the performance of a concurrent read/
write test program that sits on top of PVFS, an open-source parallel file
system. Parallel file systems allow transparent access to disks on remote
nodes. We configured each machine as both an I/O and a compute node to best
make use of our limited number of nodes. Using MPI and the native PVFS API,
I/O throughputs were well above 1 GByte/sec. We achieved up to 2007.199
MBytes/sec read throughput and 1698.896 MBytes/sec write throughput (with
appropriate file view and stripe size such that most disk accesses were
local).
In additional experiments, we measured the I/O performance of a ray tracing
application and studied how I/O performance is sensitive to configuration
and programming choices.
Our conclusions are as follows:
- High-performance I/O is now possible on PC clusters with commodity
IDE disks.
- Compared to ASCI Red, price/performance for I/O improved by over a
factor of 100. (To achieve 1 GByte/sec, we used 64 IDE drives costing $112
each and the ASCI Red had 18 SYMBIOS RAIDs costing $60,000 each.)
- In contrast to the ASCI Red, I/O nodes in our cluster have a
higher throughput than the interconnect. (Using ttcp, we measured 38 to 46
MBytes/sec network throughput for our copper Gigabit Ethernet Foundry
switch and Intel cards. ASCI Red's SYMBIOS RAID can write data at 70
MBytes/sec, while the custom-made network can transfer data at 380
MBytes/sec.)
Impact, Importance, Interest, Audience
Interest in cluster computing is at an all time high.
While there is no I/O category in the top500 ranking (nor for SC awards) yet,
I/O performance is getting more and more attention ("the I/O bottleneck").
- The impact of our work is
- (A) showing how commodity IDE disks on remote nodes can be harnessed,
- (B) reporting of I/O performance sensitivities.
- (C) reporting on extremely good price/performance (factor of 100
better)
- Thus, parallel I/O now seems affordable, even for small businesses and
colleges.
- Our sensitivity results are highly valuable
- (1) to give performance recommendations for application development,
- (2) as a guide to I/O benchmarking (which will play an important role
in compiling the new "clusters @ top500" ranking), and
- (3) as a guide to further improvement of parallel file systems.
Visual Presentation
First, we'll have a traditional color poster display (32"x40"), describing
the problem, our approach (IDE disks in RAID configuration, PVFS with
overlapped nodes), our experiments (graphs and tables) and our conclusions.
Second, we plan to show the performance of application and benchmark runs
"on demand". (It only takes a laptop and an Internet connection for us
to start programs on our cluster from Denver and get the performance
results back.)