devopsdays Minneapolis 2019 – Serena Tiede – Timber! Security Logging at UnitedHealth Group

devopsdays Minneapolis 2019 – Serena Tiede – Timber! Security Logging at UnitedHealth Group


So, one, I can’t believe everyone in the approval process let me get away with this title.
[Laughter]. But, yeah, I’m super psyched to be up on stage for my first time. I’m Serena. We’re going to talk about how UHG does security logging. I want to thank all of the organizers, sponsors, volunteers that make all of this work. Shout out to the AV people. I had some issues with my slides this morning and had to fix them. So, who is Optum, because some people aren’t super familiar and super plugged into the healthcare space. Kind of the short version is, we’re the technology healthcare service providers that make up UnitedHealth Group. Then you’re probably more familiar with UnitedHealth Care. Interned at Optum in 2016. Started full time after I graduated in 2017. First rotation, came in through the technology development program. I was on tier II help desk. Learned a lot of empathy for the underbelly of IT. Then, another six month rotation, worked with Heather’s group. I got super into Prometheus and at the end, one of the people was like, oh, yeah, you did good SRE work. I’m like, that’s the word for what I’m doing. Awesome. [Laughter]. Then when it came to final placement, my old boss said, hey, come back to security. We’re doing a bunch of Kafka stuff and we need to get off the ground. Launches into this whole the launch spiel. I was sold five minutes in but it seemed rude to interrupt. [Laughter]. So, kind of my whole thing is platform developer, SRE, alphabet soup job titles. I prefer, vanilla sysadmin. I keep the lights on and that brings me joy. So, like all engineering endeavors, what problems are we actually trying to solve? Define them. So, we had long log in times. It took sometimes 10 minutes. Sometimes jobs would fail. It took awhile and someone said, can we make it faster? Yes we can. The old process I’ll probably touch on it later you get into a queue and write parsers and there you go is the smart connectors do their own thing. People writing it doesn’t scale when you’re at hundreds of applications and sometimes log format might slightly change and break all of it. Vendor log in, we kind of got commercial logging system and just kind of hard to get data out of it. We were like, let’s see what we can do about that.
So, kind of outlined for today, going to talk about your enterprise wide log schema, and how to sell that because, boy, asking people how to change their logging is [Laughter].
difficult. [Laughter]. Then we’ll dive into realtime, near realtime, depends how you want to define “time.” We used to be four, now we’re around eight, thankfully. And, go over everything. We stood up everything. And then give you some tips so you can learn from my mistakes. So, we got our schema right up here. We got eight required fields, but we have a bunch more. But these are our bare minimum. Device.vendor, that’s your company name. For me, yeah, it’ll probably be Optum. Your product, what’s your application called? I’m collecting logs from my own Kafka. Vendor, that’ll be Apache. Zookeeper, et cetera. Device IPv4, we used to get weird dotted decimal parsing issues. We said, okay, okay, okay, okay. Convert it to the long integer format. Not a string. We’ll convert it for our analysts later. And then, couple validation rules. Don’t set a 0.0.0.0. That just tells us you’re binding to all interfaces and that gives us zero context. 127.0.0.1. I don’t care. Local host? Local to whom? Timestamp? We do milliseconds after the Unix. So, we just kind of said, okay, okay, fine, fine, fine. Give us Unix time. Message, human readable string. Application ID. We have an internal database we use, at UHG, to keep tabs on all of our apps. It’s mainly used for billings, charge backs, et cetera. For our use case, since every application has a unique I D, we get to keep tabs on them. Then, application.name. More the human readable name, our CMDB. I’m pretty sure everyone’s got there. I forgot to cover at the beginning, we get asked this a lot, what is an actual security event? Things were interested. We went log ins, log offs, session expiration, give us the query. Don’t give us all the data because they say, this returns a thousand results. No, no, no, keep the logs nice and small. We’re shipping it over the network. API calls. And basically, these are the bare minimum. Obviously, have more. Response codes. Like, the actual API end points, that would be all great stuff to have. Then, why you want a schema. So, like I said earlier, it’s hard to parse infinitely. I got real good. Yeah. It’s not fun. So, we’re just kind of telling people, you know, doing the whole shifting security left and just say, “please log our way.” Please, pretty please. So, it’s all nice and standardized. And kind of the issue is, now we’re trying to get the whole company on a structured logging chain. I’m writing my automation stuff, I rarely think about automation. They’re there to help me out.
Main thing, it really helps your security folks actually have a common common enterprise wise schema so every event looks pretty similar beyond your typical message fields. Avro, we use Apache Avro. When people send us stuff that doesn’t conform, we don’t need your stuff. Good bye. Sorry. [Laughter]. We used to take in whatever stuff that people sent us. It’s hard to do real good alerting cases off of those. Since everything’s nice coming in standardized, you don’t have to do a whole lot of clean up. It doesn’t necessarily mean the data coming in is high quality. We haven’t quite automated that yet. But everything’s coming in. It’s looking good. Now, all right, so you go up and say, hey, your security folks say, I want you to log our way. K. Oh, by the way, if you don’t conform to our schema, we’re going to drop your messages. You’re asking your developers to do a lot of work and, yeah, it takes time. So, being security, yeah, we get to go out and make a security policy. Made a change. And say, yeah, developers, you must log this way. But, on the flip side, with a whole kind of heavy handed mandate like this is, you do need to acknowledge it takes time and help them out along the way. So, couple slides, kind of next couple will have screenshots of the tools we’re using to give feedback to our developer community. And but, yeah, again, be really patient. This stuff does not happen overnight. And we’ve had to make some exceptions and meet people in the middle. Like, for example, I didn’t know mainframe applications don’t have I P addresses. They also don’t have the they don’t have Unix time. And, coolest thing I saw and I kind of want to get access to their source code just to look at it. We had Cobalt people say, we’ll write all the stringing to write in JSON in Cobalt! [Laughter]. I am both amazing and slightly terrified of mainframe developers. That is super creative and awesome. [Laughter]. So, one of the things, we had one of our devs I don’t know if Kyle’s watching at this time. Shout out to you, dude. This validator is amazing. People are saying, hey, how do we actually know if our messages are valid? You’re doing your development. You’re going to print off a couple log messages, copy, paste, and we’ll tell you, yeah, this is good, send it to us. This is from my local development with my Kafka logs and yells at me saying your host name doesn’t resolve off of DNS. 100% expected, but very helpful. Then we got our customer view. Giving access back to security data is a little because we’re having everyone come in, go into a common topic. We’re looking to change and give everyone their own topics. It’s a little hairy. Right now, our current solution is, yeah, we can give you the counts but we can’t actually show you the text because we don’t currently have it. We’re working on it. It’s on the road map. This is coming off of our validation streams and kind of telling people, like, hey, if it passes a validation stream, we got it, you don’t got to worry about it after that. The rest of that, that’s a problem.
Now, streaming, why should you stream your logs? So, being security, I love thinking of the worst possible things that can happen.
[Laughter]. So, should the worst happen heavens forbid, it doesn’t happen. A hacker popped your box and is messing up all your services and, sure, you can’t stop them from logging. If you’re streaming them off the compromised end point, you can at least get whatever data essential log data’s on there before they shut that off or even if they don’t know what’s running. You get to watch you know, watch and see, oh, yeah, that’s 100% an attacker. Let’s go shut that down. Then, another thing, streaming realtime. If you start noticing something weird going on and your people in the SOC are saying, yeah, there’s something up, you can lower that time to the detention nice, get that down, save the day, get a nice “there you go” from your head and everything’s great. Good guy’s win and all that jazz.
So, Kafka, super basic. We’re all using there’s no secret sauce. This is just kind of the way we’re doing it. Super common. It’s resilient. Set your replication factor’s good. It’s got some pretty good built in resiliency, even though we’ll cover later that, yes, it’s resilient, but you might have some Kafka gremlins. Other nice thing is, why I like Kafka, the docs are amazing. Everyone, and their moms, has written a blog post about Kafka. When you’re a beginner, the collective wisdom of the internet says, oh, here’s some sane configs and there’s a group of people running Kafka and say I may or may not have borrowed a couple and forked a couple Ansible roles. Also, so far, we haven’t hit anything in scale. I was sarcastic going, whoa, we have a whopping one mega second going through right now. We just started out and getting our stuff stable in production.
Then, again, finally talking about “how do people actually send us logs?” We got a couple options that you, and the whole family, will enjoy. You got you can send them straight to a Kafka topic. We’re heavy with Chef. We’re like, yeah, we’ll install File Bee. Set your log directory, make sure it’s formatted. We’ll take care of it from there. Also, we’re pretty heavy Java shop so we had again, since we were pretty small, we had our internal teams kind of think contracting, but, you know, internal. And kind of commissioned a, hey, could you guys build us, like, a nice, little Java package so people can call all those methods and build our logs and just give them a send method and then again, it’ll handle, like, IP conversion and epitome and milliseconds. We’ll have to revisit that in 2038 [Laughter]. but that’s a long ways away.
And finally say, well, I don’t do Java and I need more parsing than File Bee can give me. That’s fine. We don’t care. Send it to our producer topic. I’m supporting a vendor topic off the shelf, I can’t change this. That’s fine. We recommend Fluid D. Due to our constraints and pain of installing Ruby, I said, okay, I’ll use Fluid. It’ll be fun and whatever heavy lifting I need to do, Google can teach me in a few seconds. [Laughter]. Then, Kafka streams. I love the Kafka streams. It’s super nice and fun to work with. It’s just a message queue. It’s super small. It’s super basic, but there’s a lot of cool things you can do it. Multiple uses, different consumer groups. So and then, we start getting our data nice, fast, snappy. This has lowered our latency in tens of minutes and now we’re down to I’m trying to think of the actual empirical numbers. Five or less?
And then again, really hammering down on this point. Lower your time of detection. It’ll be great. Get those alerting use cases. Like you know, like, analytic streams. Awesome. I support that. And then most of our stuff runs in Kubernetes because I manage multiple virtual machines. I wish I didn’t have to be on virtual machines because you can mess something up weird with a state. But, this is the way of life with stateful workloads. And then kind of our streams, kind of just fell into the microservices architecture. If that reports your stuff passed our validation, as far as the developers are concerned, that’s it. That’s your responsibility. We run a bunch of analytics on all of our end point logs because you got your firewall switches, network appliances, they’re all quite noisy. So, we got our validation. Then we got our Elastic Search and Dataleg. One, go put your stuff in Elastic and people will start messing around with Spark and we’ll see what cool things we can do. Now, this is the section of “do as I say, not as I did.”
[Laughter]. I have to operation in my organization’s DMZ. Fortunately, we got some basic we’re running Rel. But, unfortunately, you can’t say, oh, yeah, uninstall Kafka, it’ll be fine. So, interesting thing is, so we have vendor product that kind of mimics the S3 API, which has a DMZ and core end point. Awesome. Mess around and, oh, great, I got my whole Yam repo up. Amazing. Got to package the stuff myself. Hey, I’ll take it over a tarball. So, order of operations, going back to the DMZ, this stuff matters. Do your bootstrapping, non Kafka stuff. I’m going to set my time zone to UTC. Install node exporter. I’m a Prometheus gal. Got to add the node to your firewall roles first. Traffic won’t be going to it because Kafka’s not installed. You can add it to the cluster and go and rebalance your partitions. Make sure to spread out that load because I accidentally switched the steps because I added the Node to the cluster and rebalanced and I was like, aahhh, I can wait. It’s not like it will be a liter of partitions or anything. Oh, boy, was I wrong! [Laughter].
I learned that mistake the hard way and then clients were complaining, I can’t talk to partitions. was like, what do you mean? Cluster’s up and oh, firewall. Neato. All right. And then package up everything nicely. There’s Nebulous OS package. It’s great. Then infrastructure change management. I’m supporting a health insurance company. Our peak season, you got your Medicare open enrollment, individual, everyone loves that end of the year where you’re signing up for your health plans and you want to tell the employer, hey, I just want the one I had last year. But, however, since peak season, things are maybe a little difficult because we have change blackout dates. Do as much as you can outside of that time because I was trying to set up all this infrastructure during our peak season and it took a little while. And, also, try to, you know, build in observability. Ahh, perfect, next point. Instrumented all the things. This isn’t a super pain point no, I was saying, I got to ship out features, get the cluster stood up, realized I have zero idea what my Kafka cluster’s doing. Okay. And there was a whole sprint I was like, I’m doing metrics. Going to grab Node exporter, get a Prometheus out in the DMZ, data sources. It’s like, great, I got some I can see all the health. Everything looks good. Now, we’re getting logs out of our platform stuff so we’re running Kafka, Zookeeper, open source to make sure everything is nice and compatible and getting logs from our attribute manager because, yeah, we sure would love to know who is changing log directories on peoples’ boxes. I think that’s all we’re collecting right now.
Still working on setting up tracing, see what people see. Cool thing is, we inadvertently rolled around we have a heartbeat generator because we’re all talking you know, we want to see, hey, what does the customer see when we write messages? So, I think it’s like every five seconds, testing. Testing. And it sends a message through. We set up a watch and an Elastic Search. Most of the time, I got my message in a five minute period. I’m happy. Sometimes I’m a message short. Which normally means it took awhile to get through. Being a small team, what a blessing and a curse. Because it was nice. We used to all be out in Minnesota. We have people out in Colorado, people out in Maryland, Carolina. We’re all distributed everywhere. But, you know, we all can talk to each other, like, say I broke the Kafka cluster and, you know, and say, hey, I think something’s wrong. However, be realistic with your expectations. We used to be, again, four people. We said we can’t realistically can’t do an on call rotation. Yes, this is super important but we’re just not adequately staffed. Because if we’re all on call, then, yes, we’ll all burn out and none of us will be keeping eye on the platform and that’s just worse than no support. And then automate everything. I’m super lazy. I was thinking, okay, I can go hand patch and looking at my inventory list and say, oh, dear, that’s 40 servers, yeah, no. Code’s going to solve that.
Big thing, “no” is a complete sentence. [Laughter]. You got someone saying, important person, hey, I want this feature in, like, right now. And it’s, like, hey, we’re working on some stability stuff. Trying to reduce our tech, pin that down. It’s like, hey, listen, we hear you and I know we have a lot of dev community saying, hey, we want some limited access back to our logs. It’s like, yep, we hear you. We got that feature request in loud and clear. However, got to have your manager, the manager your boss’s boss say, yes, I respect this team’s autonomy. So, depending on where you are, it may vary. Then, finally thoughts is so, this is so, in college, my professor, when I took digital signal processing. Once upon a time, I was like, I’ll be an electrical engineer and do embedded systems. He concluded and said, when your mother asks you at the dinner table what you learned today, this is what you learned today. This is what you go back to management and tell your team. Enterprise wide schema and acknowledge that, yes, it will take time. There will be pushback. Two, stream everything. You know, we’ve gotten a lot more network bandwidth and we’ve gotten accustom to “I want that web page to load up right now.” Similarly, I want that log data right now. And then, be patient. Get your devs involved. Like, actually we kind of started doing the intersource thing and got more devs kind of looking at our Java package and saying, hey, that was a good first cut. Are you open to a good PR? Yes, I love a good PR. You want a feature request? We have all of our repos public. Come talk to us and on our Slack channel. We don’t bite. If you have questions, please ask because we would much prefer you to ask a question and to get answers so you can log better. Then, you know, we kind of all win. Kind of the main thing is it’s DevOpsDays and someone who’s development operations and security, it’s nice to realize, yes, we all should be friends. We’re all on the same team. No one wants to be in the news.
[Laughter]. That’s like, that’s the thing. You can’t be in opposition. We’re all on the same side here. And then, kind of some links. So, if you’re looking at this on GitHub, I put the link out on Twitter last night. Got a PDF, which PSA, if you’re speaking at a conference, do yourself that favor. I love weird German last names. GitHub, Lady Serena. Thank you, everyone.
[Applause].

Leave a Reply

Your email address will not be published. Required fields are marked *