Next-Generation Resiliency 下一代彈性 FOCUS I SEPTEMBER 2017 By Andy Lawrence, Executive Director, Uptime Institute &451 Research and Todd Traver, Vice President IT Optimization and Strategy, Uptime Institute Andy Lawrence,常務(wù)董事,Uptime Institute & 451 Research Todd Traver,IT優(yōu)化和戰(zhàn)略 副總裁, Uptime Institute 接上部分:數(shù)據(jù)中心行業(yè)必讀:下一代彈性(上) Next-Generation Resiliency None of these challenges are remotely new, and many systems for distributing data and locking and unlocking databases were developed in the 1980s. (Early papers by engineers at IBM and Tandem, among others, are still available. Influential relational database pioneer Ted Codd published rules for distributed database management systems in the 1980s.) However, cloud providers that have huge amounts of data in multiple locations, and that offer in- and out-of-region replication and backup, now have to deal with the these issues on an altogether new scale. 這些挑戰(zhàn)都不是新問(wèn)題,很多分布式數(shù)據(jù)系統(tǒng)和鎖定解鎖數(shù)據(jù)庫(kù)是20世紀(jì)80年代開(kāi)發(fā)的。(IBM和天騰工程師的早期論文仍然是可用的。有影響力的關(guān)系數(shù)據(jù)庫(kù)先鋒Ted Codd在20世紀(jì)80年代發(fā)表了分布式數(shù)據(jù)管理系統(tǒng)的規(guī)則。)然而,擁有在多地海量數(shù)據(jù)的云供應(yīng)商提供區(qū)域內(nèi)外復(fù)制和備份,現(xiàn)在必須在一個(gè)全新的規(guī)模上處理這些問(wèn)題。 Professor Eric Brewer of Stanford University (now VP of Infrastructure at Google) identified a key issue. His theorem (see Figure 2) states that it is not possible to design a distributed system that guarantees both availability and complete integrity in the face of the loss of a network partition or node. 斯坦福大學(xué)的Eric Brewer教授(現(xiàn)谷歌基礎(chǔ)設(shè)施副總裁)證實(shí)了一個(gè)關(guān)鍵問(wèn)題。它的定理(如圖2)表明:當(dāng)面對(duì)網(wǎng)絡(luò)分區(qū)或者節(jié)點(diǎn)失效時(shí),不可能設(shè)計(jì)出一個(gè)可以同時(shí)保證可用性和完全的完整性的分布式系統(tǒng)。 CAP theorem, also called Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously guarantee all three of the following attributes: CAP定理,也稱(chēng)作Brewer定理,表明一個(gè)分布式計(jì)算機(jī)系統(tǒng)不可能同時(shí)保證以下全部三個(gè)屬性: · Consistency: Every read receives the most recent write or an error. · 一致性:每一個(gè)讀操作接受最近的寫(xiě)操作或者一個(gè)錯(cuò)誤; · Availability: Every request receives a response, though without a guarantee that it contains the most recent version of the information. · 可用性:每一個(gè)請(qǐng)求接受一個(gè)響應(yīng),雖然沒(méi)有保證它包含信息的最近版本; · Partition tolerance: The system continues to operate despite arbitrary partitioning due to network failures. · 分區(qū)容錯(cuò)性:盡管由于網(wǎng)絡(luò)失效產(chǎn)生任意分區(qū),系統(tǒng)仍繼續(xù)操作 圖2:CAP定理 Next-Generation Resiliency This theorem is important when it comes to resiliency planning using more than one active site. Organizations typically place a very high value on database accuracy, but availability is also critical for many applications, especially transactional, customer-facing ones. This rule shows that, by moving to a distributed environment, a company may have to prioritize one guarantee over the other. Brewer’s theorem also points to the critical importance of the network, which, if it is highly available at all times, can reduce if not eliminate the need for that choice. This explains why hyperscale operators such as Google have invested so heavily in intra-data center fiber and other networking equipment to ensure high availability and capacity. 當(dāng)使用一個(gè)以上活動(dòng)站點(diǎn)做彈性規(guī)劃時(shí),這個(gè)定理非常重要。組織典型地很重視數(shù)據(jù)準(zhǔn)確性,但是可用性對(duì)于許多應(yīng)用也同樣關(guān)鍵,尤其是那些事務(wù)型、面向客戶(hù)的應(yīng)用。這個(gè)規(guī)則顯示了,通過(guò)遷移一個(gè)分布式環(huán)境,一個(gè)公司不得不將一個(gè)優(yōu)先級(jí)置于另一個(gè)之上。Brewer定理也指出網(wǎng)絡(luò)的重要性,如果網(wǎng)絡(luò)總是高可用的,即使不能消除也會(huì)減少這種選擇的需要。這也解釋了為什么像谷歌這樣的超大型運(yùn)營(yíng)商在數(shù)據(jù)中心內(nèi)的光纖和其他網(wǎng)絡(luò)設(shè)備上投入如此巨大以保障高可用和容量。 Base, Acid and Databases Until recently, the organizations that most used distributed resiliency were those for which even a small outage could be catastrophic. This group - investment banks, for example - writes all data to two data centers simultaneously (synchronous replication). While one set of data may act as the master, the second is a real-time copy, and if there is a failure, traffic is switched to the second site. There is no danger of an integrity issue, because the software only allows writes to one live master. Suppliers of databases and storage systems and software, such as IBM, HP, Hitachi, Oracle, EMC and others, have long engineered systems for this high-spending category. BASE,ACID和數(shù)據(jù)庫(kù) 直到最近,那些最多使用分布式彈性的是那些即使遇到一個(gè)小故障也是毀滅性的組織。這些集團(tuán)比如銀行,將所有數(shù)據(jù)同時(shí)寫(xiě)入兩個(gè)數(shù)據(jù)中心(同步復(fù)制)。當(dāng)一組數(shù)據(jù)作為主,第二組作為實(shí)時(shí)拷貝,這樣即使有故障,流量會(huì)切換到第二個(gè)站點(diǎn)。沒(méi)有完整性問(wèn)題的風(fēng)險(xiǎn),因?yàn)檐浖辉试S寫(xiě)在一個(gè)活的主系統(tǒng)。數(shù)據(jù)庫(kù)、存儲(chǔ)系統(tǒng)和軟件供應(yīng)商,比如IBM、HP、Hitachi、Oracle、EMC和其他,對(duì)這種高支出類(lèi)別都有長(zhǎng)期的工程化系統(tǒng)。 Systems that allow no compromise on integrity are sometimes called ACID systems, to denote Atomicity (each transaction is all or nothing), Consistency (transactions complete according to all valid rules), Isolation (each part of the transaction is isolated from others, as if performed sequentially) and Durability (the transaction is permanent). ACID favors consistency over all else. When ACID databases work together, or if a single database is spread across multiple locations, protocols and processes ensure agreement between multiple endpoints before a transaction can go ahead. Recent advances in so-called NewSQL databases, including Google’s Spanner, replicate this on a distributed, wide scale, with some limited trade-offs. 不允許對(duì)完整性做妥協(xié)的系統(tǒng)有時(shí)被稱(chēng)作ACID系統(tǒng),代表了原子性(每一個(gè)事務(wù)要么是全部要么什么不存在),一致性(事務(wù)完全符合所有有效原則),隔離性(如果被順序執(zhí)行,事務(wù)的每一部分都與其它隔離),持久性(事務(wù)是永久的)。ACID偏愛(ài)一致性超過(guò)所有。當(dāng)ACID數(shù)據(jù)庫(kù)們一起工作時(shí),如果一個(gè)數(shù)據(jù)庫(kù)散布在多個(gè)地點(diǎn),協(xié)議和過(guò)程保證多個(gè)端點(diǎn)在一個(gè)事務(wù)進(jìn)行之前的一致性。最新的進(jìn)展是所謂的NewSQL數(shù)據(jù)庫(kù),包括谷歌的Spanner,在分布式的,廣泛的范圍內(nèi)復(fù)制這個(gè),當(dāng)然有一些受限的折中。 In recent years, with the aid of lower-cost, homogenous and virtualized architectures, it has become much easier (and cheaper) to replicate IT environments in several active data centers in different locations. This has led to the development of architectures that temporarily (usually momentarily) sacrifice integrity for availability if there is a contention issue. Processes are put in place to resolve any conflicts, in some cases reversing one of two transactions that may have happened independently of each other. 最近幾年,在更低成本、同構(gòu)、虛擬化的架構(gòu)幫助下,在多活異地?cái)?shù)據(jù)中心復(fù)制IT環(huán)境變得更加容易(更加便宜)。這已導(dǎo)致架構(gòu)發(fā)展為當(dāng)有競(jìng)爭(zhēng)問(wèn)題時(shí)臨時(shí)(短暫的)犧牲完整性以保障可用性。一些處理被采取以解決沖突,這些處理可以在某些情況下回退相互獨(dú)立發(fā)生的兩個(gè)事務(wù)中的一個(gè)。 These database design architectures are known as BASE, to denote the characteristics of Basically Available, Soft State and Eventual Consistency. These architectures, supported by modern open source NoSQL databases such as MongoDB and Apache’s CouchDB, incorporate mechanisms for allowing and then resolving conflicting transactions. 這些數(shù)據(jù)庫(kù)設(shè)計(jì)架構(gòu)被稱(chēng)為BASE,以代表基本可用的特性,軟狀態(tài)和最終一致的特性。這些被現(xiàn)代開(kāi)源NoSQL數(shù)據(jù)庫(kù)(比如MongoDB和Apache的CouchDB)支持的架構(gòu)包含允許和解決事務(wù)的沖突的機(jī)制。 Next-Generation Resiliency The use of BASE architectures is now very common, especially in cloud environments, and effectively tolerates failures. But there are classes of application for which it is unsuitable - for example, trading systems or control situations where eventual resolution or reversible transactions are not acceptable. Even so, given that the conflicts may often be rare and easily resolved, this architecture is now being widely adopted, reducing costs and enabling more use of distributed architectures to improve resiliency. BASE架構(gòu)的使用非常普遍且能夠有效地容錯(cuò),尤其是在云環(huán)境中。但是有些類(lèi)別的軟件不適合,比如,在交易系統(tǒng)或者控制情況中,最終解決和可逆的事務(wù)是不可接受的。即使這樣,考慮到?jīng)_突通常很少見(jiàn)并且容易被解決,BASE架構(gòu)正在被廣泛采用,同時(shí)減少成本和使能分布式架構(gòu)更多的被使用以提升彈性。 BASE architectures rely very heavily on fast, reliable networks. The longer the latency, the more likely it is that conflicts between reads and writes from different users will occur. While these will mostly be resolved easily, too many conflicts could cause problems with clients or control systems in real-time networks. Some Internet of Things (IoT) applications will not sit comfortably on cloud platforms that use BASE architectures. BASE架構(gòu)嚴(yán)重依賴(lài)快速穩(wěn)定的網(wǎng)絡(luò)。延遲越長(zhǎng),越可能在不同用戶(hù)的讀寫(xiě)之間發(fā)生沖突。雖然這些大部分都將會(huì)被輕松解決,但在實(shí)時(shí)網(wǎng)絡(luò)中太多的沖突可能導(dǎo)致客戶(hù)端或者控制系統(tǒng)出現(xiàn)問(wèn)題。一些物聯(lián)網(wǎng)應(yīng)用將不會(huì)舒服的坐落在使用了BASE架構(gòu)云平臺(tái)上。 Types of Distributed Architecture 分布式架構(gòu)的種類(lèi) As we have seen, differing business requirements, including legacy investments, will influence the degree to which newer, distributed systems and databases can be used; similarly, the business requirements and the design of the existing systems will, to some extent, point toward certain resiliency architectures. We see the models in Figure 4 being used for resiliency, with the cloud- based models being markedly different from the earlier ones. 如我們所見(jiàn),不同的業(yè)務(wù)需求,包括歷史投資,將會(huì)影響新的分布式系統(tǒng)和數(shù)據(jù)庫(kù)被使用的程度;同樣的,業(yè)務(wù)需求和現(xiàn)存系統(tǒng)的設(shè)計(jì)在一定程度上指向了某一確定的彈性架構(gòu)。我們看圖3中用于彈性的模型,基于云的模型顯著的與早期模型不同。 Figure 3: Types of Distributed Architecture 圖3:分布式架構(gòu)的種類(lèi) This is the traditional setup, with high levels of redundancy at the infrastructure level, including facilities and basic IT. With sufficient redundancy and planned design, operations can continue in spite of planned (concurrent maintainability), and in some cases unplanned, facilities failure. At the IT level, resilience is further assured by internal replication (e.g., clusters), so that loads may be replicated elsewhere and data/applications/configurations backed up to an offsite DR. 單站點(diǎn)可用性 這是一個(gè)傳統(tǒng)配置,包含物理設(shè)施和基礎(chǔ)IT的基礎(chǔ)設(shè)施層具備高級(jí)別的冗余。通過(guò)充分的冗余和規(guī)劃的設(shè)計(jì),在計(jì)劃內(nèi)的(并發(fā)維護(hù)性)以及某些情況下計(jì)劃外的物理設(shè)施故障時(shí),運(yùn)營(yíng)仍然能夠繼續(xù)。在IT層,彈性通過(guò)內(nèi)部復(fù)制(比如集群)得到進(jìn)一步的保障,負(fù)載可能被復(fù)制到別處,數(shù)據(jù)/應(yīng)用/配置備份到一個(gè)離線容災(zāi)節(jié)點(diǎn)。 Linked Site Resiliency This describes two or more lower-tier data centers within a campus, region or zone using a dedicated network to achieve a higher level of availability than is possible at any individual site, typically within synchronous replication distance. (This means that the two data centers are near enough to each other and to customers that they are always synchronized. This distance will depend on the applications, but is usually less than 50 miles.) In order to achieve the same or higher level of facility availability as a high-availability single-site data center, linked sites may double up and share some less-critical infrastructure with nearby in-zone data centers. This assumes resilient and sufficient network capacity with predictable and independent pathways. 鏈接站點(diǎn)彈性 這描述了在同一園區(qū)、地區(qū)或者區(qū)域內(nèi)的兩個(gè)及以上低級(jí)別數(shù)據(jù)中心,它們通過(guò)使用專(zhuān)用網(wǎng)絡(luò)來(lái)達(dá)到比任一單站可能達(dá)到的更高級(jí)別的可用性。(這意味著兩個(gè)數(shù)據(jù)中心相互之間以及到客戶(hù)之間足夠近,它們一直是同步的。這個(gè)距離會(huì)取決于具體應(yīng)用,但通常小于50英里。)為了達(dá)到與高可用單站數(shù)據(jù)中心相同甚至更高的物理設(shè)施可用性,鏈接站點(diǎn)可能共享在一些附近同一區(qū)域內(nèi)數(shù)據(jù)中心的非關(guān)鍵基礎(chǔ)設(shè)施。這假設(shè)在可預(yù)測(cè)的和獨(dú)立的路徑上,有彈性的和充足的網(wǎng)絡(luò)容量。 In this configuration, concurrent maintainability (downtime at one site does not disrupt service) is possible as long as there is sufficient capacity, and processes are in place, to support full operations at either site. At the IT level, this setup can be used to support either synchronous (fault-tolerant automated failover to the second site) or asynchronous (a second copy of applications, data and files is kept at the second site to pick up the load) replication. 在這種配置下,只要有足夠的容量并且處理是適當(dāng)?shù)模l(fā)可維護(hù)能力(一個(gè)站點(diǎn)斷服不會(huì)導(dǎo)致服務(wù)中斷)是可能支持在其中一個(gè)站點(diǎn)的完整操作。在IT層,這種配置能夠被用于支持要么同步(容錯(cuò)自動(dòng)故障切換到第二個(gè)站點(diǎn))或者要么異步(為承載負(fù)載,應(yīng)用、數(shù)據(jù)和文件的第二拷貝被保留在第二個(gè)站點(diǎn))的復(fù)制。 Distributed Site Resiliency This term describes two or more independent sites, in or out of region or globally distributed (cloud or not), using shared internet/VPN networks to provide resiliency through multiple asynchronously connected instances. This can produce very high availability but can result in some (usually minor) loss of integrity between instances if outages occur. 分布式站點(diǎn)彈性 這個(gè)術(shù)語(yǔ)描述了在區(qū)域內(nèi)外或是全局分布的(云或非云)兩個(gè)及以上的獨(dú)立站點(diǎn),它們通過(guò)多個(gè)異步連接的實(shí)例以及使用共享互聯(lián)網(wǎng)/VPN網(wǎng)絡(luò)來(lái)提供彈性。這種方式能夠產(chǎn)生非常高的可用性,但是如果中斷發(fā)生,也會(huì)導(dǎo)致一些(通常很?。?shí)例之間的完整性損失。 At the IT level, distributed site resiliency is the architecture that underpins most DR services, and especially the modern cloud iteration, DR as a service (DRaaS). Improved network capacity, software tools, database synchronization protocols and, critically, homogenous IT infrastructure running virtualized workloads have now made this option far more practical, flexible and economically feasible both for active/active operations and for backup and recovery. As more distributed management technologies are added, distributed site resiliency can support or blur into cloud-based resiliency. 在IT層,分布式站點(diǎn)彈性是一種支持大多數(shù)容災(zāi)服務(wù)的架構(gòu),尤其是現(xiàn)在云迭代,容災(zāi)即服務(wù)(DRaaS)。改進(jìn)后的網(wǎng)絡(luò)容量,軟件工具,數(shù)據(jù)庫(kù)同步協(xié)議和非常關(guān)鍵的運(yùn)行虛擬化負(fù)載的同構(gòu)IT基礎(chǔ)設(shè)施現(xiàn)在已經(jīng)使這種彈性方式對(duì)于雙活操作和備份恢復(fù)來(lái)說(shuō)更加實(shí)用,靈活以及經(jīng)濟(jì)可行。隨著越多的分布式管理技術(shù)加入,分布式站點(diǎn)彈性能夠支持或者模糊的看做基于云的彈性。 Next-Generation Resiliency Cloud-Based Resiliency This term describes resiliency provided by distributing virtualized applications, instances and/or containers with associated data across multiple data centers, using middleware, orchestration and distributed databases, under the control of a comprehensive and distributed control system. These systems will enable service or design choices to be made between, for example, absolute database integrity or immediate availability. Effectively, cloud-based resiliency moves the resiliency up to the IT level. Any facility resilience achieved through redundancy provides added security, but may not prove essential. It does, however, assume that there is sufficient capacity in place, including the network, which is critical if loads are shifted from place to place. Developers do not need to concern themselves with location or infrastructure - this architecture is primarily suited for stateless or cloud-native applications. 基于云的彈性 這個(gè)術(shù)語(yǔ)描述了通過(guò)使用中間件、編排和分布式數(shù)據(jù)庫(kù),在一個(gè)綜合的、分布式的控制系統(tǒng)控制下,將虛擬化應(yīng)用、實(shí)例和/或攜帶相關(guān)數(shù)據(jù)的容器分布到多個(gè)數(shù)據(jù)中心來(lái)提供彈性。這些控制系統(tǒng)會(huì)做出服務(wù)或者設(shè)計(jì)選擇,比如絕對(duì)數(shù)據(jù)庫(kù)完整或者立即可用。實(shí)際上,基于云的彈性將彈性上升到IT層。任何通過(guò)冗余實(shí)現(xiàn)的物理設(shè)施彈性提供了額外的安全, 但是可能證明不是必須的。不管怎樣,它的確假設(shè)在相應(yīng)的地方有足夠的容量,包括網(wǎng)絡(luò),如果負(fù)載從一個(gè)地方遷移到另一個(gè)地方,它非常關(guān)鍵。開(kāi)發(fā)者不需要關(guān)注他們自己的位置或者基礎(chǔ)設(shè)施,這個(gè)架構(gòu)主要是和無(wú)狀態(tài)的或者云原生的應(yīng)用。 Clearly, each type of resiliency architecture described above fulfills different purposes and has a different profile in terms of objectives, cost, level of availability and technical maturity. Cloud- based resiliency is the newest, and currently the most expensive; it may provide good total cost of ownership, but effectively can only be achieved at scale and with considerable capital. Each type is not mutually exclusive, at least at the facilities level. 顯而易見(jiàn),以上描述的每一種彈性架構(gòu)實(shí)現(xiàn)了不同的目的,根據(jù)目標(biāo)、成本、可用性級(jí)別和技術(shù)成熟度有不同的畫(huà)像?;谠频膹椥允亲钚碌?,也是當(dāng)前最昂貴的;它可能提供很好的總體擁有成本,但實(shí)際上只有在大規(guī)模情況下,具備大量資金時(shí)才會(huì)實(shí)現(xiàn)。 For CIOs setting out to develop appropriate resiliency strategies, this is a challenging period because engineering control is being eroded, to be replaced with a more nuanced and strategic approach where good assessments are needed. 對(duì)于CIO著手開(kāi)發(fā)合適的彈性策略來(lái)說(shuō),這是一個(gè)具有挑戰(zhàn)的時(shí)期,因?yàn)楣こ炭刂普诒磺治g,它被更加微妙的、戰(zhàn)略性的方法替代,這個(gè)方法需要好的評(píng)估。 With cloud services and architectures now part of the mix, or even the totality, the CIO must determine which type (or types) of resiliency is most appropriate for each type of application and data, based on business needs and technical risk, and then architect the best combination of IT infrastructure. This will span data center resiliency, applications, databases and networking, and must take into account organizational structure, processes, tools and automation. From all this, the organization must then deliver comprehensive and consistent applications that meet and exceed customer expectations for service availability and resiliency. 通過(guò)云服務(wù)和架構(gòu)的部分混合甚至完全混合,CIO必須決定對(duì)于每一種應(yīng)用、數(shù)據(jù),基于業(yè)務(wù)需求和技術(shù)風(fēng)險(xiǎn)哪種彈性最適合,然后構(gòu)建IT基礎(chǔ)設(shè)施最佳組合。這會(huì)橫跨數(shù)據(jù)中心彈性、應(yīng)用、數(shù)據(jù)庫(kù)和網(wǎng)絡(luò),同時(shí)必須考慮組織結(jié)構(gòu)、流程、工具和自動(dòng)化。從這一切,組織必須交付理解深刻的和一致的應(yīng)用,它們能夠從業(yè)務(wù)可用性和彈性上符合并超越客戶(hù)期望。 (全文完) 翻譯: 張德 DKV(Deep Knowledge Volunteer)計(jì)劃精英成員 華為智慧數(shù)據(jù)中心管理系統(tǒng) 總經(jīng)理 編輯: 梁鴻雁 中能測(cè)(北京)科技發(fā)展有限公司秘書(shū)處處長(zhǎng) 公眾號(hào)聲明: |
|
來(lái)自: yi321yi > 《數(shù)據(jù)中心》