Server!/Horror!

I have a magnet! And I don't mind using it!

Posts Tagged ‘Java

Everything is a fucking event stream!

leave a comment »

I got quite fed up with monitoring systems lately. I do not see a need for such systems any more, and I hate to introduce yet another component that does not actually add any value!

No go away and turn around and read something else if you think this is utter bullshit, but recently I took some time to think about what a monitoring system actually does.

In my opinion a monitoring system is nothing more that a logging system. You generate some log message (which actually isn’t a log message) and send it off to a central host (possibly with several level of intermediate hosts) just to fire off a notification your log message.

So what do we have we have a log stream. But since I don’t actually really care about log messages let’s generalize a bit and call everything an event!

Everything is a fucking event stream!

I do mean everything!

Let’s begin with syslog (yup classic syslog, no bells and whistles). Have you ever thought about what the separation marker for a syslog event is? I don’t know for sure what it is. For a long time I thought that the only true answer would be a newline.

Hey, who the hell wants to read Java stack traces in a log file or Python or any “multi line message”.

I do!

Of course I don’t want to use grep or less or anything to work throu those files. I want a tool that understands those messages. Better yet I want a tool that is usable and that let’s me define reusable rules to tag log messages events.

Think of Nagios:

You write some plug-in, the plug-in has a log, the plug-in reports to stdout for Nagios messages, the plug-in writes to stderr to tell about errors, there are different levels of verbosity to debug, it has at least 2 different return values

Why on earth is there:

A stream of plugin results (usually exit codes), a stream of messages on stdout (meta results?), a stream of log messages (meta2 results?), a stream of messages for stderr (is that logging, monitoring, meta3 results, ignorable?)

On top of that all those kinds of messages are incompatible. There’s no such thing as structured logging!
Please, please just let me send everything to some remote place where it will be persisted and I have a central view on all my events.

Add as much meta data as possible.

source host, receiving host (or hosts if there were several in between), reception times, timestamps — and please: do add fractions of a second, host names and ip addresses, a possibility to extend the amount of meta data

Now when I have a place where I can look at all my events, then and only then I want to make decisions about what I’m interested in. I’m nowhere near deciding on whether this any of this is an alert.

After the fact tagging

Nagios, Icinga, Zabbix all force me to make something up. Some test, a probe or whatnot where I have to write some script (Don’t get me wrong: I like scripting — scripting takes away those repetitive tasks I hate!) and make up arbitrary values that represent a certain level of goodness or badness. OK, CRITICAL, WARNING?

WTF? Who said I need three of them?

Just let me define some criteria that will match events. Please note that I am not restricting the criteria to regular expressions. Something like “Has the meta data field X”, “Does not have meta data field Y”, “After January 1st 1992 but not before May 3rd 1982”, “Only between 13:00h and and 13:15h when the load was higher than 3 on systems with with only 2 cores” and so on. Those are equally important.

Uh, oh! I just defined some plug-ins, now I’m back at monitoring! No I’m not!

I ran a cron job (in the simplest case) that generated an event which told me the load, and another cron job that told me the number of cores in the system. This event was sent to my event sink for later processing!

I need to be able to save those criteria collections as filters/view or whatever you want to call them and I need to be able to name those things so that I can find them later on. I simply want to label my events.

I need to be able to attach as many labels to events as I see fit. Also I need the ability to find unlabeled events.

Which brings us to alerts!

So now that I know what’s interesting and gives me the ability to make educated decisions about what’s actually interesting I can decide on when it’s worth to raise something that will wake me up in the middle of the night.

I do want to be able to generate alerts and send them of to some other system…

Hey look! Another event stream!

…and then I’d rather not want to specify when things go bad. I’d rather would like to specify when things are good. Everything else is just badness enumeration.

I’d rather triple the amount of time I invest in such a system than to create yet another monitoring system that doesn’t use what’s already there.

But today’s alerts aren’t worth anything tomorrow!

Any system that silently throws away data is useless (I’m looking at you Munin and friends). I’m not saying RRDtool is a bad thing. I love it, the problem is how people use it.

Throw away data? Come on, who wants this? I do want to the finest possible resolution, we have Hadoop, GlusterFS, Ceph. Storage is something that shouldn’t be in the way. I’d rather have only 7 days of data than a year of useless junk.

Of course there’s trends over long periods of time but those shouldn’t be the default, those should be something that are added on top of existing data!

How are today’s alerts helpful if I can’t possibly tell what happened yesterday between 15:03h and 15:07h?

Yes this actually is basic stuff

But it’s just a syslog server and some scripts!

Yup you are absolutely right, and that’s the reason why companies like Splunk and Loggly make money right? Because anyone just has stuff like that. It’s a default, no more to-dos, nothing to see just go along!

Ah so you don’t actually have it? Neither do I. But I’d love to! Please someone skilled create such a system and make it open source!

On top of that: make it near real time!

/serverhorror

Written by serverhorror

2011-12-27 at 08:00

Rethinking Deployment

leave a comment »

Rethinking Deployment

Probably everyone knows something about one of the following environment:

What you usually get from services like these is easy deployment and kind of a namespace for your app. By namespace I mean some kind of container that keeps your app from crashing because someone else did something bad on the server and vice versa. This is naturally what you want because you run in a hosted environment, that means you don’t actually have control about what kinds of apps run on the servers or even how many apps run on the server. Please keep in mind that hosted doesn’t necessarily mean that you pay a third party for hosting your stuff. It might as well be some service within the very own company.

Not a bad thing in and on itself. Just something to get used to.

Of course there are some things to keep in mind with that kind of automated deployment. One of the more problematic things will be how to ensure that only working apps will get deployed.

After all you don’t want some fancy, easy to use deployment mechanism just to take your main site down every 15 minutes. Of course you could then just easily deploy the last version again and be up and running. Still this is not a desireable solution.

We now know a few requirements:

  • easy deployment (more on that later)
  • versioned deployment

Versioned Deployment

Versioned deployment means that the same app must be deployable multiple times. Somehow the deployment system must be able to differentiate between the same app deployed multiple times. It’s only natural (at least to me) to use a version number. This actually brings in another requirement:

  • semantic version numbers

What does that mean? Don’t just use some random string, there needs to be a notion of comparison between versions. In simple Terms it must support the following operations:

  • is larger than
  • is smaller than
  • is equal to

There’s the nice semver.org site that tries to standardize. Personally I agree with the spec, except for the special version number, I just don’t see a need for it. I’m perfectly fine without it.

Another quite simple version number would be a unix timestamp or the date and time of the release. I suggest something like 201105261324 — but to be honest; just staying with the spec from semver.org is perfectly fine. It’s defined, it’s there; no need to reinvent the wheel!

Easy Deployment

Easy deployment actually means that developers have to live with quite a lot of constraints. I can’t imagine a system that allows you to do anything you can think of and still have a notion of easy deployment.

Contraints (or rather restrictions) have a bad side taste. Maybe I can’t write to disk. Maybe I can’t configure my logging the way I want. Maybe I don’t have the possibility of accessing the database I like best.

These kinds of contraints come in varying sizes and tastes. The most basic thing that won’t be available is random access to the file system. Or even the expectation that whatever you write to the file system may be there in half an hour.

To quote myself (yeah, I know…):

You want that people accept the system, hell they shouldn’t just accept it — it should be natural to them to use the whatever compliant services are provided. Because the are easy to use and because they do the job at least well.

If you want to have compliance and you do it in a way that users don’t like they will find ways around it.

What this actually means is that for every expectation (or at least most of them) people have regarding a system, there needs to be a solution. This solution must be well thought of. I couldn’t just throw some bad API at the users and expect that people will use the deployment system.

The most basic requirements to deploy (web) software are:

  • Application Entry Points
  • Application Configuration
  • Have a caching API
  • Have a persistence API
  • Have a data query API (specifically not resctricted to SQL)
  • Have a logging API
  • Have at least very good documentation for all of the above

This is what needs to be available to write some app against a restricted environment. Much of this is actually just what Google AppEngine provides.

Let me construct a system that has the basic properties mentioned above. We’ll restrict to the following:

  • Python Web Applications that are accessed thru an application object that is a WSGI instance
  • You will run on python2.5
  • The application needs to be self contained
    • this implies that if you use web.py you need to make sure that it’s importable from within you application. You cannot expect web.py to be somehow magically available
    • this also means that you can only use pure python modules

I’ll ber very unspecific here. I hope nobody expects me to to use a magic hat and just pull out a solution. I hope the basic idea will still come through.

Entry Points

I just implicitely defined that above. Just make a WSGI application, a simple one like the web.py cookbook sample is enough:

import web  urls = ( '/.*', 'hello', )  class hello: def GET(self): return "Hello, world."  application = web.application(urls, globals()).wsgifunc()

Application Configuration

This mostly about data that shouldn’t be the default. For web application this means:

  • cookie/session secret
  • API credentials (which may be different for different APIs)
  • other data that is not under the direct control of a user

Caching API

Note: Stealing from memcached here. You might want to read up on the memcached protocol

Some assumptions that need to be dealt with:

  • Cached items may expire at any time
  • seting something doesn’t mean that the next time I want to retrieve it it’s still there (yes this is actually very, very, very bad – just trying to simplify)

Basic idea:

[bytes]  set   get

Persistence API

This should be nothing special. Just write(bytes) and read(identifier). I like the notion of content-addressed storage for that part so I’m biased. But the identifier could be anything. A path a sha1-sum, really anything.

Again stealing from above (and yes, this works in the real world – just look at riak).

There’s something go keep in mind here: I want to make sure nobody writes vast amounts of data, either in a single object or by issuing a multitude of writes that’ll take down the system. I want some kind of quote. Say 100 GiB per application.

Basic idea:

[bytes]  write   read

Data Query API

This get’s a bit more tricky. Usually as soon as I take away SQL from some kind of data query I hear screams. I’m not saying SQl is bad or wrong in any way. But as soon as there aren’t dedicated DBAs involved that will take care of largish databases most servers run into a problem.

For the sake of simplicity let’s stay with a nice simple (for certain definitions of simple) SQL database. I’d allow people to:

  • CREATE TABLE,
  • DROP TABLE,
  • CREATE VIEW,
  • SELECT,
  • INSERT,
  • DELETE

Maybe even:

  • CREATE INDEX,
  • DROP INDEX.

Things I specifically wouldn’t allow (speaking in MySQL permissions here):

  • CREATE TEMPORARY TABLES
  • SELECT INTO OUTFILE
  • COPY FROM FILE

Basically I’d let everyone talk to his/her own database with basic usage rights. But especially deny everything that would need access to the filesystem or something near that.

Logging API

This is a special case. All of the above were external things. External in the sense that there’s no problem by only allowing communication to run thru some sort of tcp connection. Logging however would should be available as a “local” API call. I’d just stay with the standard python logging library.

The important point would be that users must not set up their own Handlers. Basically people are allowed to either use a StreamHandler or a NullHandler.

A StreamHandler would be exactly what any other log system does. Provide a stream of events. An event isn’t necessarily something that ends with a newline. But the point is that writing something to the filesystem through the logging API sets up expectatiions that these log events will later be available. This assumption is wrong. The only events that will be available will be the ones that have been emitted through the StreamHandlers provided by the system.

What’s the point in providing a NullHandler? To be honest I can’t think of one. But hey! There’s gotta be some choice :)

Documentation

I’m a believer of learning by example. This has 2 advantages:

  • diving in is easy
  • by writing examples I will be a user

The last point is a “eat your own dogfood” argument. But it will inflict the same pain on me that the users of the system have. Thus it will get better “real soon now”™.

Jokes aside. There’s a third point to it. Libraries will pop up for free. By writing code against the carefully designed APIs I will automagically create libraries to make my life easier, I will then either publish these libraries or incorporate the libraries in the existing API thus making everyone elses life easier.

The valid point of “If you need to create a library it’s not good enough” has a shortcoming here. If the API is just some HTTP-Restful API this API could be very good. But why not just create a library that will do directly in Python (or Ruby, Java, Go), after all nearly everyone will be using these libraries and by providing them I have the potential to take serverload away by optimizing a single library and updating it on the server.

Original Post: http://serverhorror.eu/rethinking-deployment New Blog Location: http://serverhorror.eu