WebDwarf No Me Gusta
I blew up my old index page experimenting with WebDwarf for a customer. I officially don't like that HTML editor. I am not sure what Todd sees in it. It's probably for the best though as I have been playing with some new Emacs-based planning software that I think is going to be better for this sort of thing anyhow.
Emacs makes me very happy.
At long last it is almost time for our long anticipated trip to DisneyLand. I can't hardly wait. I would be even more excited if it weren't for the fact that we had some issues at 0catch.com over the weekend.
I am not entirely sure what caused the problem, but the stats box wasn't serving up files, and that basically killed the whole site. I really should come up with a way to back that particular box up. Perhaps something as easy as rsyncing the files somewhere else.
I also lost a drive on Catchusers456. Fortunately the array had a unit that wasn't being used that I could borrow a drive from.
KaeLynn Continues to be Sick
Today is my third day working from home. Things have gone fairly well, but the added worry of having to watch out for KaeLynn is starting to wear on me. On the plus side I have gotten most of the Bullysports.com stuff finished. Unfortunately, I haven't done quite as well on the 0catch.com front. It's not that I haven't been trying, but I am a little stuck. Hopefully the tactic I am looking into today will make the difference.
On the bright side the hardware at 0catch.com continues to work as expected. I haven't had a fire in a while now. I might still be having some problems with the banner server. We are serving up impressions, but not as many as our customers would hope. I think that it is possible that we are simply backed up from our downtime and we have simply reached our ability to serve up banners. popup_mainsite.js was not serving up banners 40% of the time, and so I have changed that so that we don't serve up a banner 5% of the time. I am hoping that this will help. I am going to look at the total number of banners served to see if we are serving up as many banners as we should be.
Somebody Turned Out The Lights
I am not exactly sure what happened, but a whole pile of servers were turned off and didn't come back on at 0catch.com tonight at about 10:00 pm. When I drove in to turn them on I could hear the generator going, so it is possible that had something to do with it. We probably need to clean up how our power is set up.
It turns out that my sentiment about "cleaning up the power" was a good one. At 1:30 am the same machines decided to reboot again. This time I paid better attention to what was wrong and I noticed that several of our power strips were plugged into sockets that weren't connected to the UPS. So I unplugged everything (again), and fixed that. Then I got into the BIOS on the machines that didn't reboot automatically and I fixed that as well. If I would have taken care of that at 11:00 pm I could have saved myself a trip and a nights sleep.
On the bright side our power situation is much cleaner now. Unfortunately it is 3:00 am and I haven't really been to sleep yet.
The Winds of Change
This last weekend actually worked out really well. I didn't go near my computer for two whole days. What's more, Catchusers3 decided to blow up during the week. I can't hardly believe it. Catchusers3 failed catastrophically today at around 4:00 pm MDT. I am not surprised that it failed. In fact, I have been dreading its eventual failure for some time now. What was surprising is that it failed A) during business hours, and B) while I was busy working on it. This guaranteed that I not only knew about the failure, but that I was in a position to do something about it immediately.
At first, doing something about it, consisted of trying to find out why it was rebooting every few minutes. So I dutifully replaced fans and looked at the logs trying to find something that seemed out of the ordinary. My guess is that the 3ware card is completely hosed, but I only say that because the 3ware card in that machine has been nothing but trouble.
After some time it became clear that Catchusers3 was not going to be easily revived, and so I began to look at what it would take to replace Catchusers3 with the backup box. It turns out that it didn't take much. I had a very recent backup and so I simply set up the backup box to take Catchusers3's place. It turned out that things went amazingly well. I can't believe how happy I am to finally have Catchusers3 out of the picture.
More to the Story
Of course, while I was writing the first little bit my pager went off. It would appear that my new NFS server has some performance issues. The backup box serves up the files with a lot less spindles than Catchusers3 had at its disposal. It would appear that with a bit of ext3 tuning that things are going to be just fine.
I Don't Like Weekends
Catchusers3 decided that it was tired again this weekend. The new drive that we put in it failed. That would be fine actually, we have plenty of redundant drives in that array. The real problem is that every time that stupid 3ware card hiccups the NFS server on the box stops working. This makes all of the web servers back up waiting for files on Catchusers3, and everything grinds to a halt. So I spent most of Saturday (and a good portion of Sunday) trying to keep Catchusers3 online. At one point I nearly decided to switch to the backup server. Between that and the problems I am having with the baseball launch at Bullysports.com I am getting really tired of computers.
Longest Weekend Ever
This weekend's tragic story of woe and misery actually started on Friday. Or at least I think it started on Friday, it's all a little blurry. It could have first started on Thursday. Anyway, sometime before the weekend the machine that ran the front page extensions and our banner ad server decided that it was time for a rest. It went from running just fine to not running at all. One of the drives in the mirror decided it didn't have any file system at all, and the other machine's file system was very corrupted. The 3ware card in the box wouldn't even boot until I pulled the completely problematic drive out completely. I was very glad to see the box come up, as I hadn't ever tested the backups. It turned out that my worries were well founded. The backup scripts hadn't really ever worked. I was able to get a backup of everything that wasn't completely mangled by the file system however, and that backup proved to be very important. More on that later.
So, anyway, I got the box back up (mostly), and at first I set about keeping it up. The fact that the box came up not only allowed me to get a more recent backup of most of the files,it also allowed me to go out on a date with my wife on Friday evening. I'm thankful for that small miracle.
I originally figured that this was simply the bad luck of nearly losing both drives in a mirror at the same time. So I got a fresh drive from work (despite the fact that the SMART stuff on both of the drives said they checked out), and I tried to rebuild the mirror while I tried to get the banner server back working. The cgi scripts that make up this application had been corrupted badly and I was forced to restore from a fairly old backup. Danny Ashworth called quite a bit and asked how things were going. They never managed to go well.
By Saturday at about noon I had been forced to reboot the frontpage/banner server any number of times, and I had come to the conclusion that something was very wrong, probably with the 3ware card. So I decided to take one of our new servers and use it to replace our existing box. So I went to fetch the backup off of our brand new, very fancy, backup server and I was surprised to find that I could not ssh into it. It turns out that I had accidentally unplugged the box while fiddling with the cords to the frontpage box. Worse, when I did try and fire it back up two of the four arrays (including the one with the operating system) wouldn't come back up. By this time I was so frustrated I could spit. Fortunately the array that had the information I needed was still fine. Unfortunately, I was going to have to install a new operating system on the box before I could get the information off of it.
As an added bonus there was no way to pull the backup box out of the rack to put a CD rom drive in it so that I could do a quick install off of my trusty Debian CD. I have been wanting to set up a netboot server for a while so that I could automate installs. All of a sudden I was stuck until I could get something like this set up.
I decided to use my laptop as a temporary dhcp and netboot server and it is a good thing that I did. It turned out that the FAI automated install project had a new easy-install CD. I thought that it was something that could be used as a live cd. The sort of thing that you put into your machine and it boots off of the CD without harming the contents of your hard drive. Instead it turns out that it automatically wipes out your existing installation and installs an FAI server. I had been meaning to try out the new beta of Ubuntu. Now my laptop is not functional until I reinstall. Fortunately I didn't test on a machine that had important information on it.
So now I had the frontpage/banner server down, the brand new backup server was down, and I had just nuked my laptop as well. Things were not going well for me.
The FAI software did teach me quite a bit about PXE netbooting, and so I was able to set up a simple DHCP, TFTP, PXE netboot system that booted into the Debian install process. It wasn't quite what FAI promised, but for bringing up one machine it was perfect. Chances are good that I won't end up setting up FAI at all in the long run. A combination of saved lists of Debian packages and our own code from subversion should be more than good enough.
By Saturday night I had the backup server up enough that I was able to get the files I needed. So I started work on restoring the frontpage server. Debian has libapache-mod-frontpage-mirfak in contrib and so I decided to start from there instead of simply trying to install my own version of apache from scratch in /usr/local. In retrospect this may have cost me some time. Although I doubt that I could have simply copied over the files from the crusty old Red Hat machine to the new Debian one and had them work, and there was no way that I was going to try and install some crusty version of Red Hat unless I absolutely had to. If I was going to go through all of this work I at least wanted to end up with something I could maintain.
It took me most of Sunday to finish up, and I learned a lot about strace, but in the end I had the front page extensions working again. In the end the magic incantation wasn't even that difficult. I simply had to change the www-data user and group so that it used the uid/gid of 99 instead of Debian's default of 33. This is so that it would match the 99 (nobody) used on the NFS servers. I also had to replace the fpexec that came with mirfak with a small wrapper that basically calls Mark's perl script. I modified the existing httpd.conf so that it matched the new Debian setup, and I also had to make sure that permissions were kosher everywhere. Finally I had to create a conf directory in /etc/apache with symlinks to some of apache's config files. Frontpage looks for some files in odd places. I probably should have just created a symlink called conf in /etc/apache that pointed to /etc/apache.
Of course, finding out what needed to be done took the better part of a day. The debugging output for frontpage is horrible.
I was very glad that Mark wrote the frontpage stuff and not Matt.
Unfortunately I still have to make the banner server go, and that is all Matt's handiwork. So today should prove to be interesting.
Now it's Monday and I've been up since 6:00 running the backups (manually), preparing to get the banner server up and serving, and writing about my adventures. I hopefully should be able to get the banner server stuff in place by the end of today. If not, I am going to try and replace it instead.
Of course, that still leaves what I should have been working on this weekend. I'm very close on the new pay type, but I am not going to be able to work on it until the crisis is well and truly over.
I Almost Forgot: Update!
Along with the hardware issues that I had over the weekend, I also had to deal with two denial of service attacks this weekend. The first was an attack against our CGI server that caused everyone running CGIs serious problems. I stopped that with a bit of Apache configuration. The second attack on Saturday was more serious. Basically someone was attacking our sign up process and adding so many bogus accounts that real customers couldn't get in. I tried fixing the problem with an Apache configuration, but in the end I simply dropped packets from various ranges of IP addresses with iptables. In both of these cases I probably should have used some sort of configuration on the router, but I am not very familiar with it.
Nothing But Work
BullySports.com actually got sold, and that means that I am now busier than ever. I have been under so much pressure that today, after making sure that the NCAA tournament stuff would run I have gone home to work. I really need to get the new pay type finished, and it was simply impossible so close to my phone.
Working From Home
I've worked from home the last few days, and things really went well. There are just too many distractions at work. People know where to find me, and I get asked a million questions. Worse, Joe tracks me down and asks me to do BullySports.com stuff. External monitoring with Nagios is mostly set up, and I should now start getting email messages if things like mail, frontpage, or CGI goes down. The internal setup is likely to be a lot more complicated, but I've got a good start and I am feeling pretty positive about things.
Working at home has also afforded me some time to experiment with improvements for our mail system. Unfortunately there is no one particular way to do what needs to be done. Theoretically I should probably stick with qmail, but I just am not very happy with our current setup. Perhaps if it was a solid setup based on a Debian default install I would be happier. Currently what we have is an amalgam of bailing wire and chewing gum. Heck, even duct tape would be an improvement.
All Day on the Phone
Todd had a sick kid and so I was on the phone all day. I didn't get anything else done.
More Time on the Phone
Today worked out fairly well, but I spent a lot of time on the phone. Generally speaking that's not good for business, but today might have been an exception. I learned some interesting stuff, and I maybe helped make some deals.
I also got in a fat pile of kettlebell snatches today. I am going to be huge.
We had email problems over the weekend, and I was out of town doing my best to not check my email. So I did not notice there was an issue. That basically guaranteed that today was not a sweet day. The first work day after a three day weekend is never good. The first work day after a three day weekend in which email stopped working is a guaranteed nightmare.
On the bright side, Nagios got moved way up the list of things to do, ASAP.
That's basically what I accomplished today. I wiped the old helpdesk box. I Installed Debian and Nagios and tomorrow I am going to start knocking out hosts. Some of the changes I make will also apply to the external monitoring I do from my home server.
I haven't written in a while, and that is a good thing. things at 0catch.com are pretty much going according to schedule. I am currently working on a signup process that I will use in a number of 0catch.com sponsored products. It has taken me a little longer than I expected because I wanted to take a look at some newer technologies, but I am confident that the time is going to end up being well spent.
I continue to spend a good portion of each day on the phone with customers. In a way I sort of like it, but it does make it much harder to get things done. Things are likely to slow down even more as I am forced to spend more time on BullySports.com. The good news is that at least I won't have to write too much Java.
I finally was able to pick up my new glasses today, and I think that they are going to make a big difference in my life. Things are still a bit fuzzy as my brain gets used to the new stimulus, but apparently that is part of the condition that I have. My eyes are really good at overcompensating. The doctor told me the name of the condition, but it was long, and I didn't really care that much about the technical bits.
Uneventful Weekend Typical Monday
0catch.com mostly just worked over the weekend. manager2 ended up with such large log files that it killed apache, but that was easy enough to fix. I made a little progress with the new banner server software, and I helped Dr. Fimio a bit more.
More Fun With Fimio
www11 was dead (it responded to pings, but it was not serving up files or ansering to ssh requests) when I checked in on the servers this morning and so I spent a bit of time at the data center ressurrecting it. The second I got into the office I got a call from Dr. Fimio with more frontpage problems. I am running some tests right now. Hopefully I can either sort a fix or figure out how to get Frontpage 2003 to use FTP.
Further NVU Musings
Inspired by my failure to bend Frontpage to my will today I wrote several more paragraphs in my NVU Tutorial when I got home from work.
Frontpage 110 Jason 0
Frontpage is officially the crappiest piece of software ever. I just spent an hour on the phone trying to get the stupid thing to work and it just isn't giving me any joy.
It is possible that I have solved Dr. Fimio's longstanding issue with Frontpage. That makes me very happy as he is quite possibly the most patient person on the face of the planet. He not only called us on several occasions (and was always polite), but he called Microsoft's help desk in India. If you are looking for a classical Latin mass at a cathedral near you take a look at his site.
www20 had problems for most of yesterday. Apparently it ran itself out of memory and the oom-killer killed cron and sshd before stopping apache. That basically guaranteed that I had to actually go hard boot the machine. I really need to set up the remote power units, but that will probably remain a project for another day.
Too Much Time On The Phone
I spent too much time on the phone today. I have all of these things that I need to do, but it is impossible to concentrate when I have to teach people how to use Frontpage or how to set up Outlook to get mail. I am spending some more time tonight writing documentation and I am going to create a centralized FAQ documentation page so that I can simply point people in the right direction.
Today has been a pretty good day. bluehost.com got hacked and so I had to scramble to remove accounts from our servers, but other than that things went according to plan.
Much More Like It
For the first time in a very long time things went well at 0catch.com today. We've had no mysterious outages. No dire emergencies, and no huge problems. I even managed to make some progress on some of my actual tasks. I sorted out (I believe) the banner ad issue for onecoolhost.com, and I have a query into Adlandpro.com's DNS provider about wild card entries.I even spent some time on a Clickbank page.
I almost hate to have some hope again, but it would appear that 0catch.com is finally back on an even keel. Catchusers3 has behaved itself well, and we haven't had any serious network issues. There is a blip on our MRTG data for today, but I think that it was a problem with MRTG and not with the actual network. Certainly Nagios showed that we lost some packets, but we didn't go down for more than a minute or two.
I spent a good portion of today on the phone. Todd had something come up and so I logged into the phone system. Every time I put the phone down it rang again.
I had a lot to do today, and I didn't get that much of it done.
Matt apparently wants to make more changes to the network. I don't know what he wants to do, but it is not likely to be good.
Good Start This Morning
Everything was up when I woke up this morning. Even cooler it appears that my script that restarts NFS services on catchusers3 actually worked. I think that it ran once and kept everything running. It's hard to tell though because the clock on catchusers3 is still completely retarded.
The graphs on the system.0catch.com do something. Hooray for me.
It would appear that replacing the drive did not sort out the problems with the array in catchusers3. I can't even begin to explain how distressing that is to me. I wouldn't even care that the array keeps having problems (as it rebuilds itself just fine in very little time), but every time it hiccups the NFS services shuts off. That is very annoying. Before I go to bed I am going to see if I can't whip up a cron script that checks to see if NFS is up and if not restarts it. That's hardly ideal, but I am not sure what else to do.
So I restarted NFS on catchusers3, but not all of the webservers saw that it had returned. This means that I got to drive out to the data center and restart machines. Hooray for technology.
I whipped up a quick script in python that runs on catchusers3 every minute and checks to make sure that NFS is up and running. If it isn't running then it restarts NFS. I just tested it and it appears to work. If it does work then I might be able to sleep again some time soon.
The other thing that has to happen is that I have to sort out the issues with catchusers3's clock. It isn't even close to be correct, and even ntp doesn't seem to keep it in check.
Perhaps I just need to come up with a plan to replace catchusers3. If the raid card is dead then trying to fix it is probably just a fools errand. The real problem is that I need to find a way to migrate customer data off of this machine and to some other machine. It doesn't help that catchusers3 apparently does some magic for the cgi users (like routing their mail).
I love Saturdays.
Keeping My Fingers Crossed
I spent almost all of today shuttling back and forth between the data center and my office trying to finally get everything squared away. Unfortunately, I was only moderately successful. The good news is that catchusers3 has been solid since the rebuild last night. The bad news is that we had another outage today. Bluehost got DDOSed and one of the things that they did to alleviate the pain was to turn off ICMP on the core router. It would appear that the Alteon thinks it is dead if it can't ping its gateway. Since it is dead, it doesn't bother routing packets.
Steve said that he would make an exception for 0catch.com next time he turned off ICMP. I am also going to spend some time seeing if I can learn to turn that feature off. Hopefully we can avoid a repeat performance.
On the bright side Steve also hooked me up with some nice MRTG traffic graphs. This should save me from having to set up something similar in the short term.
Both the banner box and www13 decided to give up the ghost today. Actually, I think that the banner box has been dead for a couple of days and I just didn't notice with everything else going crazy. The banner box is back up and running. It had a bad power supply, but www13 probably won't be back up for a while. It runs for 30 minutes or so when I reboot it, so it probably has a dodgy power supply as well, but I am all out of spares.
Heading into the Weekend
Between the outage and the problems with the banner box I did not get Nagios set up today. That's especially unfortunate because it means that I am going to have to spend the weekened worrying about the 0catch.com network.
Catchusers3 All Better Now
Hopefully that will be the end of Catchusers3 getting tired of serving up files.
Catchusers3 on the Mend
The new network setting appear solid. We haven't had a single log entry on the Alteon in several hours. The array on Catchusers3 is rebuilding, but that shouldn't take long, Catchusers3 doesn't hold much information. Hopefully things will settle down enough so that I can clean up and get into work. The thought of a long night tonight is unappealing, but at least I will be less at the mercy of the bluehost folks.
The configuration of the Cisco really had me worried. Networking gear is not my strong point. It is a huge relief to see that the network seems to be working well.
Catchusers3 Has Problems
If you can see this page then catchusers3 must be still soldiering on. However, it definitely has some issues. I think that I found out why the NFS services keep turning off. Apparently one of the drives is having issues.
I Sure Am Glad I Stayed Up All Night
I ended up staying up until 3:30 am. Partly this was because I was sure that at moment I would get a call from Matt telling me that everything was broken, but it was partly because I wanted to make sure that the changes I made to the Cisco switch weren't going to have regressions. In fact, until about midnight I was really concerned about getting the switch to work reliably. I actually had catchusers3 and the cgi box connected directly to the Alteon for a while (don't tell Matt). That worked, but I am not going to pretend that I know all of the possible consequences of such a setup.
Still No Call
It's now 1:49 am and I was just getting ready to give up on hearing from the folks at bluehost and all of a sudden I can't ping 0catch.com. That just brilliant. On the plus side I can't ping bluehost either.
After a bit of experimentation it would appear that I can't ping Yahoo or Slashdot either. Perhaps xmission.com is having issues?
Whatever it was it cleared itself up fairly quickly. It's now 1:54 am and things are back up. I'd better get to bed before we have a real problem.
0catch Will Be Firewalled
It turns out that the configuration on our network hardware is as old and crusty as everything else at 0catch.com. Both the Alteon and at least one of the Cisco switches had the spanning tree protocol on and it caused us (and bluehost) major problems. I turned off stp on the Alteon and I actually reconfigured the Cisco from scratch. Matt will probably deny that these devices had stp enabled (or he'll blame me for turning it on), but I had to actually borrow a cable from the bluehost admins to even log into the Cisco.
I've finally made it home for the night, but I am guessing that at any moment I am going to get a call from the guys at bluehost telling me that they are about to isolate our network from their network. Matt said he would call, but it is 12:40 am and so far no joy. I'll probably get a page from Nagios as soon as I get in bed.
On the plus side all of our machines are absolutely using our new resolver boxes and the load on them is basically nil. It's nice to know that I won't have to airlift in servers to handle the load.
Perfect. More network issues caused another hour of downtime. I am starting to get really really sick of all of this crap. It turns out that this time the problem was that the Spanning Tree Protocol was activated on our switch and somewhere on bluehost's network one of their switches shared the same id. The best part was that whenever I would unhook our switch bluehost's office network would go down. That made the bluehost admins much less excited about making changes to the network as they tried to figure out what the problem was.
It turns out that Google has finally indexed this page. Hooray for Google. I wonder why it took them so long.
Sam wanted to see how this worked. Pretty straight forward, eh?
Today has not been my favorite day. Todd was unable to come to work today becauses he was sick, and that meant that I spent a good portion of the day on the phone and answering email.
For some reason catch3 decided to stop serving up files today. To make things worse it happened during Brooklyn's baptism so even if I was paying attention I wouldn't have been able to do anything about it. I have got to work something out so that this sort of thing doesn't happen. I was notified of the problem by one of our customers that just happened to have my email address. Thanks Ken. I really do appreciate it.
Catch3 wasn't down, and it came up with a simple restart of the NFS service. Unfortunately some of the web servers weren't so fortunate. I had to drive in and restart everything. It appears that somehow www20 didn't make it back up. I might have forgotten to turn it back on. I am not going to worry about it until Monday though unless something else comes up.
The last two days I have spent a ridiculous amount of time on client phone calls and crap. By crap I really mean trying to work out my stupid insurance issues. I really need to work harder at getting people off of the phone. Unfortunately the only thing I can do about my insurance issues is wait.
Despite the fact that I spent a lot of time on the phone today I still managed to get quite a bit done. Adlandpro.com is well on its way to being set up and I set up Dreamstation.com so that they have the PHP support that it needs. Both of those things were pretty big projects.
I made some headway last night on my NVU project. Actually, I successfully created a site, now I just need to document the process. I think that I am to the point where I at least need to ad an NVU link to the User Manager. I know that Todd loves WebDwarf, but I am not impressed.
NVU Quite Possibly Has Some Redeeming Qualities
Perhaps I was a little premature with my disregard of NVU. It would appear that it isn't entirely impossible to do div-based layouts with NVU. It also appears that the built-in CSS editor is somewhat handy. It took me a bit of time to find it, but once I did things seem to work fairly well.
I am still planning on using Emacs for all of my editing needs. Including my HTML editing needs (for one thing Emacs makes sure that I don't spell editing as "editting"), but NVU could make creating CSS-based layouts a lot easier.
One thing is certain. I have to do something. I can't hardly pretend that I am going to be able to teach newbies Emacs.
NVU Is Not My Favorite
It turns out that the Ange FTP stuff that Emacs uses doesn't like my firewall at work with the default settings. Setting "ange-ftp-try-passive-mode" to 't makes it work like a charm. I wonder why that isn't the default?
I hope to finally get around to using this page for something. Today has been a bit of a hair-raiser. The email server filled up and I basically spent the entire day cleaning up that particular mess. After that I received several calls where I was asked to help someone create HTML from random formats (Powerpoint, for example), and I decided that I needed to spend some time learning to use NVU.
What's more interesting is that while I certainly don't understand how I actually design pages in NVU, updating pages in NVU appears to be pretty straightforward. Of course, once you have the template editting this file wouldn't be difficult in Emacs either.
It is sort of neat using a WYSIWYG editor for this kind of thing. I could possibly get used to this.
Today I've had to deal with two calls where people were trying to use random Microsoft programs to create HTML. It is critical that I learn to use NVU so that I can point people in the direction of a tool that doesn't suck. Sorry, WebDwarf does not qualify as a tool that doesn't suck.