tumblr里的将外文期刊翻译为中文留言翻译成中文

Tumblr 架构设计 - 每月150亿页面浏览,比Twitter更难扩展(翻译中,不断更新)
拥有每月150亿次页面访问量的Tumblr成为了最受欢迎的博客平台。用户可能因为Tumblr的简洁、美观、它在用户体验上的强烈重视和友好度而喜欢上它。
每月高于30%的增长速度并非没有挑战,例如可靠性。Tumblr拥有可怕的规模:每日5亿页面浏览,高峰时能够达到每秒4万次请求,每日新增3TB的数据存储,总共运行在1000台以上的服务器上。成功创业的常见现象之一便是在快速发展的过程中留下危险的隐患。在处理每月不断增长的巨大流量的同时,招人、不断增加的基础设施、维护旧的基础设施,并且总共就4个工程师,也就是说你必须很艰难地决定到底做哪些事情,这就是Tumblr遇到的情况。现在拥有20位工程师的Tumblr已经有了足够的精力去解决问题,并且提供有趣的解决方案。Tumblr最开始基本上是个大型的典型LAMP(Linux+Apache+MySQL+PHP)应用。他们现在的方向已经转向由Scala、HBase、Redis、Kafka、Finagle和有趣基于Cell架构(他们用来做Dashboard的呈现)的分布式服务模块。Tumblr现在的主题是向大规模系统过渡,从LAMP向最为前沿的方式过渡,从一个小型创业团队向为开发新功能、设备做好准备的全副武装的团队。创业老将将来帮助我们理解Tumblr的变革,现任Tumblr分布式系统工程师。网站: &
每日5亿页面浏览
每月150亿页面浏览
约20名工程师
高峰时每秒处理4万次请求
Hadoop集群每日增加1TB数据
每日 MySQL/HBase/Redis/Memcache 都会增加TB级数据
每月增长30%
生产环境下有约1000台服务器
平均每位工程师每月要处理10亿级别的页面访问
每日大约发布50GB新内容。关注列表每日更新大约2.7TB
Dashboard 每秒100万写入,5万读取,并且还在增长
开发用OSX,生产环境用Linux (CentOS, Scientific)
PHP, Scala, Ruby
Redis, HBase, MySQL
Varnish, HA-Proxy, nginx,
Memcache, Gearman, Kafka, Kestrel, Finagle
Thrift, HTTP
- 一个安全、可编写脚本的远程控制框架和API
Git, Capistrano, Puppet, Jenkins
500 台 web服务器
200 台数据库服务器(many of these are part of a spare pool we pulled from for failures) 30 memcache servers
22 redis servers
15 varnish servers
25 haproxy nodes
14 job queue servers (kestrel + gearman)
Tumblr有跟其他社交网站不一样使用模式,Tumblr的服务器托管在一个地方。有考虑在将来进行地理区域的分布式部署。
每日会发表5000万以上的新内容,平均每篇内容会推送成千上万的用户。并非仅仅只是1到2个用户拥有百万级别的粉丝。Tumblr用户图谱中每人都有成千上万的粉丝,这跟其它社交网站不同,也使得Tumblr在扩展性方面面临挑战。
做为在用户在线时长方面排名第二的社交网站,Tumblr的内容是迷人的,包括图片、视频。用户写下有深度值得用户话几小时阅读的内容。
当用户跟其它用户建立联系后,在Dashboard界面将新增成千上万个可以阅读的页面,在其它社交网站上并没有这么多。
言外之意是,由于数量众多的用户、用户的平均使用范围、用户的高度活跃的发帖行为,使得有数量巨大的更新量需要处理。
Tumblr平台的两个重要组件是&和&
公开的Tumblelog就是通常定义的博客,内容相对静态,可以方便的缓存住。
Dashboard跟Twitter的时间轴类似。用户可以看到他们关注的用户的实时更新。
跟博客比起来这些是非常不同的特征。因为每次请求都不一样,所以缓存不怎么有用,特别是很活跃的用户。
由于在一致性和实时性上的要求,不应该显示陈旧的数据。还有其它大量数据需要处理,新内容数据每日大约有50GB,用户关注列表每日更新量为2.7TB,多媒体文件存储在S3上。
大量用户将Tumblr视为消费内容的工具,每日5亿以上的页面访问中,70%来自于Dashboard。
Dashboard的可用性一直不错。Tumblelog的可用性就差一些,因为运行在很难迁移的遗留设备上。
老版 Tumblr
在最开始时使用Rackspace服务,给予每个自定义域名的博客一个A记录。他们成长速度过快超过了Rackspace,拥有了大量的用户,这是2007年。他们依然在Rackspace拥有自定义域名,但是通过HAProxy和Varnish将用户从Rackspace路由至他们托管的服务器上。有了大量类似的历史遗留问题。
一个传统的LAMP前进史。Dashboard使用分散-聚集方法。当用户访问Dashboard时采用拉的模式请求并显示内容。这个规模撑了6个月,自从数据是根据时间排序时, 分片表设计就不怎么管用了。
以前由PHP开发,基本上每位工程师都会PHP。
由一个Web服务器、数据库服务器和一个PHP应用服务器开始。
为了扩展,他们用来memcache、前端缓存、在前端缓存前面用上了HAProxy,然后MySQL分片。MySQL分区帮了很大的忙。
使用将所有东西排挤出单台服务的方案(即更细的服务拆分)。在过去的一年中,他们用C语言开发了一系列的后台服务: 和 ,使用Redis来驱动Dashboard消息提醒。
新版 Tumblr
由于快速发展和雇人的原因,改变为以JVM为中心的方案。
目标是从PHP应用里面抽离出各项服务,使得应用成为服务之上的更精简的层,例如请求验证、展现等。
Scala 和 Finagle SelectionInternal 服务正由基于C/libevent转向为基于Scala/Finagle。
在内部,他们有很多人有用PHP和Ruby的经验,所以呼吁Scala。
由于选择了Scala,Finagle基本上是强制性使用了。它是一个有Twitter提供的库,它处理大多数分布式问题,像 distributed tracing, service discovery, 和 service registration。你不用自己实现这个东西,这个库是免费的。
Finagle在JVM上提供了他们所需要的基础事物(Thrift、ZooKeeper等)。&
Foursquare和Twitter在使用Finagle,Meetup在使用 Scala。
像Thrift应用,拥有很好的性能。
像Netty,但不像用Java,所以 Scala 是个很好的选择。
选择 Finagle 的原因是它很酷,不需要大量的网络代码来运行,并且做了在分布式系统里面所需要的事情。
没有选择Node.js的原因是在JVM的基础上很容易扩大团队规模,并且Node.js并没有很好的实践和开发规范,及大量测试过的高质量代码。有了Scala你可以使用所有Java代码,并需要你在可扩展编程上拥有太多的知识,并且他们目标是5ms的响应速度,4个9的高可用性,每秒4万次请求有时还能达到40万,在Java生态系统里面还有很多内容可以使用。
最近,像HBase和Redis的非关系型数据存储正在使用,但是他们大部分数据现在是被重度分区的MySQL架构中。没有用HBase取代MySQL。
HBase支持了他们的URL网址缩短服务,十亿级别的数据量,并且包括所有历史数据和数据分析工作。表现还是很牛的。HBase被用在了高度写需求的情况下,像Dashboard的每秒1亿次写入替换。HBase没有将MySQL给替换掉的原因是,他们不能用现有的人手来赌,所以他们在较小、重要性稍弱的项目上使用HBase来增长经验。
对于时间连续的数据来说,MySQL分片的问题在于,一个分片的数据总是太热了。使得在从数据库由于并行插入造成读延迟。
创建通用服务框架。Front-end 层使用HAProxy。 Varnish 为公开博客服务,40台服务器。
预先在如何管理分布式系统的运维问题上,花费了大量的时间。
建立一种Rails脚手架,但是给服务用,用于内部引导服务的模板。
从运维的视角来看所有服务都是一样的。检查统计、监控、启动和停止所有服务的方式都是一样。
在&&(a Scala build tool)的构建流程中,大量使用工具,使用插件和帮手来负责日常任务,例如在git进行tag、发布到代码仓库等。大多数开发者不需要管构建系统的内部情况。
500台Web服务器运行着Apache和PHP应用。
200台数据库服务器。很多数据服务器使用的原因是为了高可用性。 Commodity hardware is used an the MTBF is surprisingly low. Much more hardware than expected is lost so &there are many spares in case of failure.
6个后台服务支撑PHP应用。一个团队专门开发后台服务。一个新服务的发布在2-3周,包括Dashboard提醒、Dashboard第二索引、URL网址缩短、为了处理分区的memcache透明代理。
花了大量的时间和努力在&。尽管在纽约(Tumblr所在地),MongoDB蛮流行的,但是他们并没有用。MySQL扩展性很好。
Gearman,一个任务队列系统,用了很长时间了,基本上都忽略了它。
可用性是衡量到达率的术语,用户是否可以到达他们的自定义域名或dashboard,同时也是错误率的术语。
从以往来看,优先级最高的术语是修复。现在故障将被系统自动地分析和定为。意图是从用户和应用的角度来衡量成功,请求的每部分是否能够履行他们的职责。
最初通过Finagle使用Actor模型,后来放弃了。For fire and forget work a job queue is used. In addition, Twitter’s
contains a
implementation and services are implemented in terms of futures. In the situations when a thread pool is needed futures are passed into a future pool. Everything is submitted to the future pool for asynchronous execution.
Scala encourages no shared state. Finagle is assumed correct because it’s tested by Twitter in production. Mutable state is avoided using constructs in Scala or Finagle. No long running state machines are used. State is pulled from the database, used, and writte n back to the database. Advantage is developers don’t need to worry about threads or locks.
22 Redis servers. Each server has 8 - 32 instances so 100s of Redis instances are used in production.
Used for backend storage for dashboard notifications.
A notification is something &like a user liked your post. Notifications show up in a user’s dashboard to indicate actions other users have taken on their content.
High write ratio made MySQL a poor fit. &
Notifications are ephemeral so it wouldn’t be horrible if they were dropped, so Redis was an acceptable choice for this function.
Gave them a chance to learn about Redis and get familiar with how it works.
Redis has been completely problem free and the community is great.
A Scala futures based interface for Redis was created. This functionality is now moving into their Cell Architecture.
URL shortener uses Redis as the first level cache and HBase as permanent storage.
Dashboard’s secondary index is built around Redis.
Redis is used as Gearman’s persistence layer using a memcache proxy built using Finagle.
Slowly moving from memcache to Redis. Would like to eventually settle on just one caching service. Performance is on par with memcache.
Internal Firehose
Internally applications need access to the activity stream. An activity steam is information about users creating/deleting posts, liking/unliking posts, etc. &A challenge is to distribute so much data in real-time. Wanted something that would scale internally and that an application ecosystem could reliably grow around. A central point of distribution was needed.
Previously this information was distributed using Scribe/Hadoop. Services would log into Scribe and begin tailing and then pipe that data into an app. This model stopped scaling almost immediately, especially at peak where people are creating 1000s of posts a second. Didn’t want people tailing files and piping to grep.
An internal firehose was created as a message bus. Services and applications talk to the firehose via Thrift.
LinkedIn’s Kafka is used to store messages. Internally consumers use an HTTP stream to read from the firehose. MySQL wasn’t used because the sharding implementation is changing frequently so hitting it with a huge data stream is not a good idea.
The firehose model is very flexible, not like Twitter’s firehose in which data is assumed to be lost.
The firehose stream can be rewound in time. It retains a week of data. On connection it’s possible to specify the point in time to start reading.
Multiple clients can connect and each client won’t see duplicate data. Each client has a client ID. Kafka supports a consumer group idea. Each consumer in a consumer group gets its own messages and won’t see duplicates. Multiple clients can be created using the same consumer ID and clients won’t see duplicate data. This allows data to be processed independently and in parallel. Kafka uses ZooKeeper to periodically checkpoint how far a consumer has read.
Cell Design for Dashboard Inbox
The current scatter-gather model for providing Dashboard functionality has very limited runway. It won’t last much longer.
The solution is to move to an inbox model implemented using a Cell Based Architecture that is similar to .
An inbox is the opposite of scatter-gather. A user’s dashboard, which is made up posts from followed users and actions taken by other users, &is logically stored together in time order.
Solves the scatter gather problem because it’s an inbox. You just ask what is in the inbox so it’s less expensive then going to each user a user follows. This will scale for a very long time.
Rewriting the Dashboard is difficult. The data has a distributed nature, but it has a transactional quality, it’s not OK for users to get partial updates.
The amount of data is incredible. Messages must be delivered to hundreds of different users on average which is a very different problem than Facebook faces. Large date + high distribution rate + multiple datacenters.
Spec’ed at a million writes a second and 50K reads a second. The data set size is 2.7TB of data growth with no replication or compression turned on. The million writes a second is from the 24 byte row key that indicates what content is in the inbox.
Doing this on an already popular application that has to be kept running.
CellsRequest flow: a user publishes a post, the post is written to the firehose, all of the cells consume the posts and write that post content to post database, the cells lookup to see if any of the followers of the post creator are in the cell, if so the follower inboxes are updated with the post ID.
A cell is a self-contained installation that has all the data for a range of users. All the data necessary to render a user’s Dashboard is in the cell.
Users are mapped into cells. Many cells exist per data center.
Each cell has an HBase cluster, service cluster, and Redis caching cluster.
Users are homed to a cell and all cells consume all posts via firehose updates.
Each cell is Finagle based and populates HBase via the firehose and service requests over Thrift.
A user comes into the Dashboard, users home to a particular cell, a service node reads their dashboard via HBase, and passes the data back.
Background tasks consume from the firehose to populate tables and process requests.
A Redis caching layer is used for posts inside a cell.
Advantages of cell design:
Massive scale requires parallelization and parallelization requires components be isolated from each other so there is no interaction. Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
Cells isolate failures. One cell failure does not impact other cells.
Cells enable nice things like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
The key idea that is easy to miss is: &all posts are replicated to all cells.
Each cell stores a single copy of all posts. Each cell can completely satisfy a Dashboard rendering request. Applications don’t ask for all the post IDs and then ask for the posts for those IDs. It can return the dashboard content for the user. Every cell has all the data needed to fulfill a Dashboard request without doing any cross cell communication.
Two HBase tables are used: one that stores a copy of each post. That data is small compared to the other table which stores every post ID for every user within that cell. The second table tells what the user’s dashboard looks like which means they don’t have to go fetch all the users a user is following. It also means across clients they’ll know if you read a post and viewing a post on a different device won’t mean you read the same content twice. With the inbox model state can be kept on what you’ve read.
Posts are not put directly in the inbox because the size is too great. So the ID is put in the inbox and the post content is put in the cell just once. This model greatly reduces the storage needed while making it simple to return a time ordered view of an users inbox. The downside is each cell contains a complete copy of call posts. Surprisingly posts are smaller than the inbox mappings. Post growth per day is 50GB per cell, inbox grows at 2.7TB a day. Users consume more than they produce.
A user’s dashboard doesn’t contain the text of a post, just post IDs, and the majority of the growth is in the IDs.
As followers change the design is safe because all posts are already in the cell. If only follower posts were stored in a cell then cell would be out of date as the followers changed and some sort of back fill process would be needed.
An alternative design is to use a separate post cluster to store post text. The downside of this design is that if the cluster goes down it impacts the entire site. &Using the cell design and post replication to all cells creates a very robust architecture.
A user having millions of followers who are really active is handled by selectively materializing user feeds by their access model (see ).Cell size is hard to determine. The size of cell is the impact site of a failure. The number of users homed to a cell is the impact. There’s a tradeoff to make in what they are willing to accept for the user experience and how much it will cost.
Different users have different access models and distribution models that are appropriate. Two different distribution modes: one for popular users and one for everyone else.
Data is handled differently depending on the user type. Posts from active users wouldn’t actually be published, posts would selectively materialized.
Users who follow millions of users are treated similarly to users who have millions of followers.
Reading from the firehose is the biggest network issue. Within a cell the network traffic is manageable.
As more cells are added cells can be placed into a cell group that reads from the firehose and then replicates to all cells within the group. A hierarchical replication scheme. This will also aid in moving to multiple datacenters.
On Being a Startup in New York
NY is a different environment. Lots of finance and advertising. Hiring is challenging because there’s not as much startup experience.
In the last few years NY has focused on helping startups. NYU and Columbia have programs for getting students interesting internships at startups instead of just going to Wall Street. Mayor Bloomberg is establishing a local campus focused on technology.
Team Structure
Teams: infrastructure, platform, SRE, product, web ops, services.
Infrastructure: Layer 5 and below. IP address and below, DNS, hardware provisioning.
Platform: core app development, SQL sharding, services, web operations.
SRE: sits between service team and web ops team. Focused on more immediate needs in terms of reliability and scalability.
Service team: focuses on things that are slightly more strategic, that are a month or two months out.
Web ops: responsible for problem detection and response, and tuning.
Software Deployment
Started with a set of rsync scripts that distributed the PHP application everywhere. Once the number of machines reached 200 the system started having problems, deploys took a long time to finish and machines would be in various states of the deploy process.
The next phase built the deploy process (development, staging, production) into their service stack using Capistrano. Worked for services on dozens of machines, but by connecting via SSH it started failing again when deploying to hundreds of machines.
Now a piece of coordination software runs on all machines. Based around Func from RedHat, a lightweight API for issuing commands to hosts. Scaling is built into Func.
Build deployment is over Func by saying do X on a set of hosts, which avoids SSH. Say you want to deploy software on group A. The master reaches out to a set of nodes and runs the deploy command.
The deploy command is implemented via Capistrano. It can do a git checkout or pull from the repository. Easy to scale because they are talking HTTP. They like Capistrano because it supports simple directory based versioning that works well with their PHP app. Moving towards versioned updates, where each directory contains a
so it’s easy to check if a version is correct.
The Func API is used to report back status, to say these machines have these software versions.
Safe to restart any of their services because they’ll drain off connections and then restart.
All features run in dark mode before activation.
Development
Started with the philosophy that anyone could use any tool that they wanted, but as the team grew that didn’t work. Onboarding new employees was very difficult, so they’ve standardized on a stack so they can get good with those, grow the team quickly, address production issues more quickly, and build up operations around them.
Process is roughly Scrum like. Lightweight. &
Every developer has a preconfigured development machine. It gets updates via Puppet.
Dev machines can roll changes, test, then roll out to staging, and then roll out to production.
Developers use vim and Textmate.
Testing is via code reviews for the PHP application.
On the service side they’ve implemented a testing infrastructure with commit hooks, Jenkins, and continuous integration and build notifications.&
Hiring Process
Interviews usually avoid math, puzzles, and brain teasers. Try to ask questions focused on work the candidate will actually do. Are they smart? Will they get stuff done? But measuring “gets things done” is difficult to assess. Goal is to find great people rather than keep people out.
Focused on coding. They’ll ask for sample code. During phone interviews they will use Collabedit to write shared code.
Interviews are not confrontational, they just want to find the best people. Candidates get to use all their tools, like Google, during the interview. The idea is developers are at their best when they have tools so that’s how they run the interviews.
Challenge is finding people that have the scaling experience they require given Tumblr’s traffic levels. Few companies in the world are working on the problems they are.On the Tumblr Engineering Blog they’ve posted memorials giving their respects for the passing of
& . It’s a geeky culture.
Example, for a new ID generator they needed A JVM process to generate service responses in less the 1ms at a rate at 10K requests per second with a 500 MB RAM limit with High Availability. They found the serial collector gave the lowest latency for this particular work load. Spent a lot of time on JVM tuning.
Lessons learned
Automation everywhere.
MySQL (plus sharding) scales, apps don't.
Redis is amazing.
Scala apps perform fantastically.
Scrap projects when you aren’t sure if they will work.
Don’t hire people based on their survival through a useless technological gauntlet. &Hire them because they fit your team and can do the job.
Select a stack that will help you hire the people you need.
Build around the skills of your team.
Read papers and blog posts. Key design ideas like the cell architecture and selective materialization were taken from elsewhere.
Ask your peers. They talked to engineers from Facebook, Twitter, LinkedIn about their experiences and learned from them. You may not have access to this level, but reach out to somebody somewhere.
Wade, don’t jump into technologies. They took pains to learn HBase and Redis before putting them into production by using them in pilot projects or in roles where the damage would be limited.
I’d like to thank Blake very much for the interview. He was very generous with his time and patient with his explanations. Please
if you would like to talk about having your architecture profiled.

我要回帖

更多关于 外文翻译成中文的歌曲 的文章

 

随机推荐