As an engineer working on libraries and being someone carrying about APIs in general, I always struggle with writing proper error messages. Worse than these are actual warnings: how do you best convey to someone using your library that he might be doing something “suboptimal”. It might even be right, but most probably… not. One way we do this in Ehcache, is warn the user when we bootstrap the CacheManager about “inconsistencies” in his configuration. While those won’t stop your application from starting, the reason for those warnings might actually cause your application to perform poorly. If your application relies on the Cache to support the load (e.g. your database won’t be able to support peak traffic), such a “mis-config” would turn out to be lethal at the worst moment imaginable: peak traffic. If you’re lucky though, it will make your app perform poorly immediately. At which point, we hope, you’ll look through the logs!
But users don’t read these messages!
Not only don’t they read them, but when they see it they don’t READ them! So whatever you put in those messages seem to never be good enough! Even when that message is the last thing the user sees before the everything stops working!
One day as a user
This morning, I woke up finding a mail about a web application I wrote some 7 years ago being down… I log on the machine and do indeed see httpd isn’t running. Well, that explains it all, easy one! But restarting httpd just doesn’t work… wtf?! So I go look at the logs, and there it is:
[crit] (22)Invalid argument: mod_jk: Could not set permissions on jk_log_lock; check User and Group directives
But how did that happen now all a sudden ?! This app has been running for years… and restarted many times (but that’s another story)! Never an issue… EVER! Permissions on the file and directory are fine… What happened ?! Double checking config, all seems fine. It’s using the proper locations and all…
Google to the rescue
I find a bug report with the same error message pretty quickly on ASF’s Bugzilla : Bug 39914. Well that was easy… but wait. Why now ? Well, as it turns out, the server uses netboot and inherited a brand new kernel! That must be it ! Somehow that must have triggered it. Even though that bug there, nor the one linked, should actually affect my version, let’s see… and upgrading never hurts I guess !
autoconf / automake fun !
Downloading the latest httpd and the latest tomcat connector. I link mod_jk statically in the httpd process, so I might just as well upgrade httpd as well, right ? And off I go…
- Crap! native connector won’t install… But I’m doing as the doc says!
- Ah! Right… I remember, need to configure httpd first!
- connector installed in httpd’s source dir
- re-configure httpd, make && sudo make install
- No!… error configuring
- Oh right, so macro didn’t get replaced, no problem, I can fix that
- Done… All installed !
- Start it …
- … fails ! SAME F$%^&ING ISSUE!!!
Alright, stay calm… It’s all good…
Stepping back for a second
So what’s busted here again ?
mod_jk: “Could not set permissions on jk_log_lock; check User and Group directives”
Check what… ?! User and Group directives ? Right, maybe that changed (This is where you are thinking: how the heck is that even supposed to change by itself you idiot?!)…
User nobody Group #-1
Nah… that’s what I expected. Or is it ? Awesome! The guys had the config template I used to configure this well documented, so a couple lines above I read:
# If you wish httpd to run as a different user or group, you must run # httpd as root initially and it will switch. # # User/Group: The name (or #number) of the user/group to run httpd as. # . On SCO (ODT 3) use “User nouser” and “Group nogroup”. # . On HPUX you may not be able to use shared memory as nobody, and the # suggested workaround is to create a user www and use that user. # NOTE that some kernels refuse to setgid(Group) or semctl(IPC_SET) # when the value of (unsigned)Group is above 60000; # don’t use Group #-1 on these systems!
Some kernels do what ?! Well… Duh! Explicitly setting group then… I guess…
- Starting apache…
- … all up and running!
WTF?! Maybe these guys could have been a little cleared and tell me to… Oh, wait. That’s exactly what they did… Stupid users!
Users don’t read these messages … indeed!
Or many don’t at least! And I’m one of these… I’ve been able to tell many times helping users with using our software properly… Generally all is seen is the stacktrace in our case, but the exception’s message, the one I tried to craft as good as I could, so it’s clear, still concise, is just ignored… Stupid users! All of them! But we’re all some else’s user…