Distance learning and changing majors are both easy tasks for students at City University of New York, thanks to two Web-based software applications. Keeping the Linux servers that powered those applications running wasn't easy, however, due to constant server failures and the need for hands-on fixes.
The need for manual repairs for frequent Linux server crashes "translated into wasted time and money and, in some cases, downtime for important applications," said Arty Ecock, manager of VM enterprise systems for CUNY Computing and Information Systems (CIS).
Downtime for these core applications is just not acceptable, Ecock said. DegreeWorks, a degree ordering tool from SunGard Bi-Tech Inc., and Blackboard, a distance learning program from Blackboard Inc., are used by about 100,000 students, roughly one-quarter of CUNY's population. Blackboard is an e-learning software application that provides online teaching and learning tools. DegreeWorks lets students compare their credits and courses to degree requirements.
CUNY chose Red Hat Linux running on a single chassis of IBM blade servers to support these applications. Unfortunately, the servers had "laptop-quality IDE drives installed on each blade," said Ecock. "They would fail frequently." To fix each failure, CUNY's IT folks had to replace hard drives and provision the blades manually.
They should not have had to do this task manually. CUNY owned IBM's provisioning server, CSM (cluster systems management). "CSM
Rather than CSM, Ecock decided to use Red Hat's Kickstart installation software for installations and provisioning. With Kickstart, one can create a single file containing the answers to questions normally asked during a Red Hat Linux installation.
"Unfortunately, we weren't as adept at using Kickstart as we'd hoped," Ecock said.
One hitch was that the documentation for the RAID adaptor was difficult for Ecock to obtain. "The drivers were not in the Linux distribution base, and that was a problem," he said. "They were available, but they weren't in the distro base, so we had to provision using a floppy or CD."
Ecock hit another snag with on-board rate adapters. "The on-board rate adapters make the on-board IDB [intelligent disk backup] drives appear as iSCSI drives to the applications running on the OS, but we weren't fluent enough in Kickstart to make this happen," he said.
Once again, drivers caused the snafu. "These rate cards needed special drivers, and for the life of us, we couldn't get the driver RPMs loaded into Kickstart," Ecock explains. So, when he kickstarted a blade, he had to walk over to the console or the blade at the appropriate time and load the drivers for the hard drive. "This was a pain," he said.
Servers didn't fail just once, either. "The hard drives in our blades were frequently burning out," Ecock said.
Just to complicate matters, not all hard drive failures were alike. Depending on RAID's mirroring status, either the blades stayed up, or they were severely impaired and needed to be brought down.
"Luckily, we were pooling our servers, so one machine going down didn't always impact the applications," Ecock said. "But we did have an occasion where the application was hosted on a single machine and both went down for a day."
Things got really bad in one six-week period, when one blade hard drive died each week. Each server failure cost CUNY about eight hours of one person's labor, a high toll for the six-person CIS group.
After two months of server crashes, Ecock began evaluating server provisioning solutions, including looking at advancements in Kickstart and IBM CMS. He chose the Intrepid Linux Management Appliance from Levanta Inc. in San Mateo, Calif.
Intrepid had several winning qualities, including its easy-to-use appliance model and diskless approach to provisioning. "As we deploy a blade for Levanta Intrepid, we remove the hard drives from the blades, so they become diskless," said Ecock.
Kickstart is not necessary anymore. As a blade server needs to be re-provisioned, a template is created on the Intrepid. "As blades fail, we can simply use existing templates to re-provision the blade," he said.
Currently Blackboard and DegreeWorks run on 56 front-end servers running Red Hat Linux, Apache and Tomcat, all managed by the Intrepid appliance.
Ecock unhappily remembers the high costs of work days lost to those two months of server failures and hands-on repairs. "It was a big chunk of change that was far more than the cost of the Levanta appliance," he said.
Today, when a blade server fails, it can be re-provisioned in about 10 minutes.