7 May 2010
Steve Huffman on Lessons Learned at Reddit
At Future of Web Apps Miami 2010 Steve Huffman gave a great presentation on the 7 key lessons he learned during his time at Reddit. If you are in the business of developing large scale web apps or just taking your first steps into thinking about building your web app there’s plenty to take note of.
Watch this and other videos from Future of Web Apps Miami 2010 on our Vimeo channel
Transcript
Ryan: Next up we have Steve Huffman. Who is a Reddit user or fan? A lot of developers consume it like you wouldn’t believe. Steve is one of the founders, along with Alexis, and he’s going to share some of the things that they did right and they did wrong when they built Reddit and then sold it. Please give Steve a big hand.
Steve: Hi everyone, I’m Steve Huffman. Thank you all for having me. I’m going to talk today about the lessons we learned building Reddit. We’ve told the Reddit startup story many times. This is going to be a new presentation. This is the lessons I learned from the engineering point of view, the mistakes we made building and growing Reddit, and keeping it online, and occasionally fast.
I’ll start with what Reddit is. I didn’t see the show of hands. How many of you have used Reddit or are familiar with Reddit? Awesome. This should be a review then. It’s a social news site. Users submit links and vote on them, up or down, and our front page is basically the most popular links at any given moment on Reddit. It’s like a Top 40 for news. Stories kind of rise and fall.
We don’t just do links anymore. Now we can do kind of forum posts, ask me anything, and all sorts of interesting things. Reddit is basically a big, vibrant community. I’m very proud to say that I think it’s probably the premier way to waste time at work, including my own time. I think Reddit would have done a lot better if I hadn’t been “testing” it for 6 hours out of every working day.
Reddit was founded in June of 2005. That was immediately after me and my co-founder, Alexis, who is the friendlier presentation-giving half of Reddit – we founded it immediately after graduating from college. Because of that, not to knock on my education that I received at UVA, but I knew nothing. I made so many silly mistakes along the way.
We got fairly big and were acquired by Conde Nast, 17 months into our adventure, October of 2007. I left Reddit. I worked for Conde Nast under contract, and I left just this past fall, the third anniversary of our acquisition.
When I left, we were at about 7.5 million users per month, and 270 million page views per month, which we were very proud of and it was kind of a long road for us to learn how to handle that kind of traffic, and how to build this site.
What I hope you guys can get out of this is if this talk had existed when we were starting Reddit, I think it would have saved me a lot of time. I hope that something in here isn’t obvious to you now, and you can learn from it. If it is obvious, you’re in great shape.
1: Crash Often
What I mean by this is in the early days of Reddit, we didn’t really have any crash protection. We had tons of errors, and Reddit would occasionally lock up, freeze, or get in an infinite loop, any number of ways of bringing down the site. I used to have to sleep with my laptop and I would wake up every couple of hours and see if Reddit was working, and restart it.
It was the worst feeling in the world, totally, mentally draining. I couldn’t go out. I dreaded my phone ringing. Nobody would call me for any other reason than to tell me Reddit was down. Even my mother would call me and say, “Steve, your website’s not working.” Thanks mom.
I’d get interrupted many times at dinner and I’d have to leave dinner. One time, I remember we were at an important dinner meeting and I had to leave, run across the street to the Apple store to find a terminal so I could try to fix Reddit. It was really bringing me down.
Then we had this realization that we were idiots and we started using this piece of software called “supervise”. It’s part of the daemontools package. What supervise does is really simple. When your application crashes, in UNIX, supervise will restart it.
It’s kind of a no brainer, totally obvious now that we should have been using this the entire time, but it works great. If you’re not running your app in supervise, you’re destined to crash.
That changed the way we designed the Reddit. If we had a memory leak, we would just kill the application process. We’d have special monitoring scripts that would watch for a Reddit application process using too much memory, or using too much CPU, or not handling requests anymore.
Instead of worrying too much about it, we would just restart the application server, supervise would catch it and restart it gracefully, and everything keeps going on.
The important thing to remember is to read the logs. If your application server is crashing over and over again, and you’re not reading the logs – ultimately, you have to solve the problem that’s crashing the application but this can at least keep things going while you’re trying to sleep or eat a meal.
2: Separation of Services
This one is again really obvious in hindsight, but it’s the concept of separation of services. When we started, we had one machine. It had a web server, an application server, and a database server. Reddit was kind of slow but it wasn’t clear why. CPU usage wasn’t astronomical.
Memory wasn’t astronomical, and that’s basically the cost of doing too much work on one box. This was a one core machine. At the time, 4 years ago, multiple cores weren’t quite as popular and we couldn’t afford the dual core boxes that existed then anyway.
When we got our second machine, and put our web and app server on one machine and the database on the second machine, things got so much faster, more than twice as fast. I was obviously expecting a performance gain because we doubled our capacity from one to two, but it got so much faster.
The reason was we were spending all of our time context switching between different jobs on our machine and it just killed us. We’ve learned this lesson over and over again; just separating, based on type of jobs or even the type of data.
Now, we have 20 or so database servers. We try to make sure that each database server is handling only 1 specific type of data that is accessed in one say. That means all the indexes for that data are cached in the same way; you don’t have to constantly swap out one chunk of index in the memory and the disc and deal with that madness, just keeping everything as similar as possible together, and anything that is different, keep them separate.
The other lesson we learned is don’t use threads for anything. We’re in Python, so Python threads are just the kiss of death, the recipe for slow. We just run everything in multiple processes. Once you solve that problem of having different jobs and different processes, for us it was things like spam and thumbnails, and query caching sort of stuff; it allows you to put it on different machines.
When you start adding more machines, when each of those individual tasks start doing too much work, you can put it on separate machines easier, without having to think about it. You’ve already solved that hard problem of communicating between processes and all that, which is the hardest part of doing that type of scaling. Once we learned that lesson, it kind of kept our architecture growing smoothly, and everything worked better from then on.
3: Open Schema
An open schema. In the early days, our database was a very traditional kind of relational database. We had a table for links, a table for users, and a table for comments. The columns in the database were what you’d expect. Everything was normalized. This would be an example of the link table. There were IDs and the number of up votes and the number of down votes, and then title, URL, and tons of foreign keys and these complex mini relationships. We spent a lot of time thinking about the database and working on it.
It seemed okay at the time, but there were a lot of problems looking back. One of the things was we spent too much time thinking about the database. We really shouldn’t have to do that. Every time we added a feature, when you can save links, or hide links, or when we added commenting – we didn’t have that at first – we had to update our schema.
Schema updates are really painful, as you grow. The database gets bigger and if your database is under a lot of load, you can’t just add another column or add another table and expect it to be totally fine. Adding a column to a table that has 10 million rows in it takes a long time and will totally kill your database.
Replication – we use database replication for backup and for scaling. Trying to do schema updates and maintain replication is a total pain. We would often have to restart our replication so then we had these periods where we would have a day with no backups and we were really kind of playing with fire there.
Deployments were complex because updating the database took so much time that we would have to walk this line of when do we deploy our code versus when do we update our database. It wasn’t a fun time.
The way we’ve changed is we use an “open schema”. Sometimes it’s called “entity attribute value”. It’s basically a large key value store. We have two types of tables for every data type. There is a “thing” table, and then a “data” table. Everything in Reddit is comprised of what we call things: users, links, comments, sub-Reddit’s, awards.
Everything on Reddit is a thing. The schema for those elements look the same. It looks like this top table here: ups, downs, a type, a creation date, some properties that are fundamental across all of the objects in Reddit.
Then we have what’s called the “data” table, which is basically this huge table with three columns: the thing idea we’re talking about is the left-most column, then a key, and a value. For example, these two links would be represented by two links in a thing table, and then one row in the data table for every value on that link. There would be a key for title, and a value for that title for that link; and a key for URL and a key for the author, and then a key for how many spam votes that are on it.
What this allowed us to do is whenever we add new features; we don’t have to deal with the database anymore. We just store new data; we just add more properties to whatever things we’re storing. If we add a new thing, we don’t need to add new tables anymore.
It kind of frees of from all those headaches of how are we doing to update our database to fit this new feature; how are we going to maintain this replication; how are we going to distribute this? All of those problems basically went away.
One of the things we don’t do anymore is we don’t do any joins in the database anymore, at least not in those databases so it becomes really easy to distribute. We can put different chunks of data on different machines, and it scales really nicely.
We don’t have to worry about foreign keys and doing joins, and how are we going to split this piece of data up. It just all splits up very nicely and it made life a lot simpler for us.
The only downside is you’re not using – we’re using Postgres, and if you’re using Postgres on MySQL, they’re designed to store data relationally and we’re not using it in that way, so we don’t get to take advantage of all the cool relational things that those databases do. We have to maintain consistency ourselves because we’re just storing chunks of data that isn’t related to anything. But in the long run, it’s worked out really well for us, and I’m really happy we made that switch.
If you’re using Google App Engine, or you’re on Amazon and using SimpleDB, this is basically where the trend is going. This is the type of storage that you get with them, this document-based storage. If you’re using CouchDB, it’s up and coming. I don’t know if it’s ready for production yet, but they’re getting there. Hopefully, the worries of using a relational database are kind of a thing of the past.
4: Keep it Stateless
The goal here is for every app server to be able to handle any type of request. In the early days, we had only one application server. It was this list process that ran on one machine and just served all of our requests, and for caching, we would just store data we needed in some hash table in memory. It worked great; it was super fast.
Then eventually we switched to Python and kind of maintained these bad habits. When we added a second machine, all of a sudden, we were totally screwed. We had this cache that we were depending on in one app server and we had a second app server and all of a sudden it doesn’t have access to that cache that was in memory. We had to do all these hacks to kind of keep those caches in sync with each other.
We were basically wasting memory. Every time we added a server, we were duplicating the cache. We had all this extra memory storing the same data over, and over again. It was really bad. We couldn’t just switch the Memcache at the time because we stored so many things in the cache at such a granular level that it was too slow.
When we finally rewrote things, we switched the Memcache and we don’t store any state on any of the app servers. This makes life so much easier. Any app server can fail, or freeze, or restart, and it doesn’t affect anything. That app server wasn’t important. It’s not storing anything specific. It’s just generating HTML. Scaling is very easy; just add more app servers. Everything grows nicely. You don’t have to think about it anymore. Of course, as I was getting at before, the caching layer must be independent from the app server layer if you want this to work, which requires something like Memcache, which is lesson 5.
5: Memcache
We use Memcache for everything. I mean, I assume most of you use Memcache for everything also. It is the Swiss Army Knife of storing things. We use it for traditional things like storing database data; all of our queries on Reddit are generated by the same piece of code. That code caches every single query through Memcache and it works great. It’s pretty obvious stuff.
Any session data we have, we don’t have a whole lot but things like when you go to change your password and we generate “Click on this link to reset your password,” that link, all the state associated with that link is stored in Memcache and it only survives for 20 minutes or so.
It’s the same with CAPTCHA. We use Memcache as kind of a temporary storage for links that we don’t want to work forever: memoizing internal functions, memoizing – if you have a long function, you just wrap it in Memcache so you call it once, it computes whatever it needs to compute and from then on it looks up the answer out of the cache. We use this all over the place in Reddit; it’s kind of built into our framework. We’re doing normalized pages and listings, and everything.
One of the cool things we do with Memcache is rate limiting. When you store something in Memcache, you can give an expiration time of an hour, a minute, or down to one second. We rate limit everything.
This is a really good way to keep your site online. Many times, Reddit would be going fine, everything is going great, we had our couple app servers and then some guy comes along and is like, “Let’s see how much damage we can do to Reddit today,” and would pound and pound us with requests. We had no rate limiting protection. That would take us down and it’s not really acceptable to be at the mercy of somebody like that, who just wants to whack some slow page of yours.
What we do is for any user who comes in, or specifically search engines too, we just keep a note of it in Memcache, that this person came. This person was here, and we set that with an expiration time of one second. If they come back within that second we just say go away. Regular users don’t notice it. Humans don’t click that fast when they’re clicking around.
But, the Google crawler will hit you as hard as you’ll let it. When things get bad, we just crank up the rate limiter and it quiets everything down without ruining the site for everybody else. It ruins it for just one person and that person was probably up to no good in the first place, so that’s great.
Storing pre-computed listings – on Reddit, everything is a listing. You’ve got the front page, the inbox, and comment pages. All of those are pre-generated and dumped in the cache so that when you load a page on Reddit, we just pull it out of the cache and it’s ready to go.
Every link, every comment is stored in probably a hundred different versions. If you have a link with one point and then somebody votes up on it, now it’s rendered with an orange up arrow and it has 2 points, and it’s one minute old instead of 30 seconds old; that’s a new version of that rendered link in the cache and we just dump everything in the cache.
When we show links, we say they’re 1 second old, 2 seconds old, 3 seconds old. Each of those is its own rendered version. Eventually it goes from 1 minute to 2 minutes; we store fewer and fewer as we go, but every little piece of HTML comes from our cache so we’re not spending time wasting CPU rendering things. If things get slow, we just add more cache and it works great.
Global locking – when we’re messing with our really fragile, inconsistent database thing that I was describing before, we use Memcache to dump a key in Memcache and use that as a global lock to keep everything in sync. It’s not super reliable. Google probably wouldn’t do it that way, but we’re not Google. It works for us and it’s worked for us for quite a while. It works great.
The other software that if you’re not using is MemcacheDB, which is like Memcache but it’s persistent. Instead of being totally in memory and occasionally forgetting things, MemcacheDB actually writes things to disc. It’s very fast and super handy. We store far more data in MemcacheDB now then we do in Postgres because we use that as our staging area. This is my next point.
6: Store Redundant Data
The fast track to making a slow website is to have a totally normalized relational database and then on every request, get all this data from a bunch of different points and put it all together, and then render it. That takes forever on every single request.
If your data is displayed in multiple, different formats – for example, we have a link that might be on the front page or it might be in somebody’s inbox or be on their profile page. We store all of those listings separately. When somebody comes to get that piece of data, that listing page, that comment page, that inbox, it’s ready to go.
When we write it to disc, we write it to disc in something like – every listing has about 15 different sorts because you can sort it by hot, new, top, old, and then this week, this month, today, all time. Whenever somebody submits a link, we update all of the listings that link could possibly affect. It’s a little wasteful up front, but it’s better to be wasteful up front then to be wasteful while your users are waiting for you.
Wasting disc in memory is totally fine if you’re not making your users wait. Discs and memory are far cheaper than annoying your customers. That’s worked really well for us, and that was the key to speed for us; pre-compute everything and dump it in MemcacheDB or Memcache, or one of our bazillion slave databases.
7: Work Offline
Do the minimum amount of work on the backend to finish the request and then tell the user you’re done. If you need to do something else, do it while they’re not waiting for you. Dump it in a queue. When a user votes on Reddit, that updates all the listings that link is in, it updates a bunch of users’ karma, it updates his liked page, it updates all sorts of stuff.
When a user votes, we just update the caches, and we update the master database just enough to remember that vote happened, and then we dump a job in a queue. This queue has a bunch of works that pull from it, that do things like “This user votes. Now I need to update these 20 things.” It does it all in the backend and when the user comes back, everything has been pre-cached, and it’s ready to go. That works very nicely for us.
The type of things that we do offline, pre-computing listings that I’ve mentioned a couple of times – when you submit a link we download the thumbnail for that link. We don’t make the user wait on that. When you submit a link, our thumbnail job will get around to it shortly. It usually only takes a few seconds, but it will get there and it will download a thumbnail, store it, and if for some reason that job breaks, no big deal.
Detecting cheating is a big thing we do, all the time. The incentive now for cheating on Reddit is much higher, now that we have so much traffic; getting a link on the front page of Reddit will drive tens of thousands of clicks to whatever link you want. So, we spend a lot of time computing in the background, live, as people are voting, looking for invalid votes, and who is cheating.
We do all that offline, but we try to make sure it’s always cruising in the background, but not slowing down the user experience. Removing spam, computing awards, updating our search index, all that can be done offline. There is no need to do that while the user is waiting on you.
What this architecture looks like – everything in blue, all these blue arrows starting with the requests coming to our app servers, in the upper left there; the blue arrows is what we do when a request comes in. Somebody submits a link or vote. We update the cache, we update the master database, and then we dump it in our job queue. Then we return back to the user and say, “Congratulations on your …,” “Thanks for the spam,” or whatever.
Then all these pink arrows are basically things that we do offline. The spam, the pre-computing, the thumbnail all read from this queue. We update the cache as required; update the worker databases or the master databases as required. That keeps things snappy. We’ve had a really good experience with this. This was kind of the last main architecture change I made before I left Reddit and it’s really helped us grow, and organize our backend architecture a lot better.
The key piece of technology there, the protocol is called AMQP for the queues, and the implementation of it that we use is called RabbitMQ. It’s a good piece of software. Without it, we would be in a bad way, I think.
Those are basically all my lessons. If you have any questions, I would love to hear them.
Q: Is storing Memcache and MemcacheDB wasteful?
Steve: Yes and no, we store lots of duplicate information and we do that on purpose, so that when we need it, it’s there and ready to go quickly. Yes, we store lots of duplicate information but we do that specifically for speed. Nothing we store in Memcache or MemcacheDB is authoritative.
All of the authoritative data is in our master databases. Anything in Memcache or MemcacheDB can be recomputed or lost and it’s no big deal, but we try to keep those caches hot to keep things fast, even if it means storing the same piece of data all over the place, in different formats. We want each one to be ready to go in case somebody wants it.
Q: How do you get around not having to do joins on the database, because you need it sometimes? It’s there for a reason, so how do you get around it?
Steve: We often export the data from our databases. We actually have surprisingly few joins, but in the cases where we do, especially like awards, when we’re trying to figure out who had the top 10 comments of the day. That is a complicated one. We dump all of the relevant data, which would be like the last day of voting on links, for example, into a text file. Then we do the computation on that text file.
We’ve actually been using Hadoop, Amazon’s Hadoop implementation to compute awards. If we need to do a complicated query like that, we store the data, we dump our database, or at the right time we store it in a way that will make those joins possible down the road. That being said; we’ve tried to avoid doing joins as much as possible, and when the data comes in we store it in the way we’re going to need it. That’s worked much better than trying to do it at run time.
Ryan: Cool, that’s all we have time for. Please give him a big hand. Thanks Steve. Are you going to be around all day? Cool, so he’ll be here all day if you want to grab him at lunch or in the breaks, if you have more questions.
We're big fans of 
Ed
# May 8, 2010 - 3:16 pm
Thanks for posting. I always like hearing the tips from the guys who have actually done it.
redsonic
# May 9, 2010 - 1:17 pm
Great experience !
Aswani Kumar
# May 9, 2010 - 4:34 pm
nice inputs. thank you for sharing performance info. i guess i know all this except rabbitmq.
what about stored procedures? does sites like YouTube, LiveJournal, Wikipedia/Wikimedia, Amazon.com, Wikia, SourceForge, Metacafe, Facebook, Twitter, Fotolog, The Pirate Bay and Netlog. etc use them? i guess the answer is “no”.
please update the info on usage of stored procedures in high performance web sites.
Adam
# May 10, 2010 - 9:00 am
I can’t help but think the database structure they’re using is really messy. I can see the advantage when it comes to adding new features and not having to alter database tables but I can’t imagine how complicated it would make writing a query. For me, a normalised, relational database would be the way forward all day.
Coolface
# May 10, 2010 - 7:41 pm
They’re using both relational and non-relational databases.
In actuality what Steve described made perfect sense. You use relational databases for the basics and non-relational databases for everything you plan on fiddling with.
Like all methods of storing data they both have their ups, downs, and (intended) purpose.
Sachin
# May 10, 2010 - 9:30 am
awesome tips ..already bookmarked…thanks
Brade
# May 10, 2010 - 8:35 pm
Thanks for posting this — lots of good insight, especially the ideas about how specfically to use Memcache!
Ryan
# May 11, 2010 - 11:18 pm
As the business founder of my startup company, how realistic is this presentation for implementation in my startup. I do create many lists, but was this just a great thing for Reddit or could it be “copied” for other sites.
Ricky Thakrar
# May 30, 2010 - 10:11 pm
Best article on scalability I’ve read so far. Thanks!