Changes

Jump to navigation Jump to search
30 bytes removed ,  00:26, 1 June 2018
replace bad characters with simple apostrophe
== Introduction ==
In this paper we will look at the methods and architecture for serving a MediaWiki instance that is performant, scalable, and redundant. We'll also touch on the related operations that may be used to make a complete picture of the organization IT infrastructure. On the implementation side, we will be using the traditional GNU/Linux Free Software components: Linux, Apache, MySQL, PHP (LAMP), Squid/Varnish, LVS, Memcached, Nginx, etc. You won't need this information to run a single small wiki. But you will need this information if you aspire to provide a large-scale, performant, enterprise wiki.<ref>Mark Bergsma presented 'Wikimedia architecture' in 2008. http://www.haute-disponibilite.net/wp-content/uploads/2008/06/wikimedia-architecture.pdf Although the material is now dated, it provides a clear example of running an architecture that supports 3Gbits/s of data traffic and 30,000 HTTP requests/s on 350 commodity servers managed by 6 people.
</ref> One goal of this paper is to update the information at [[mw:Manual:MediaWiki architecture]].
In a standard web application architecture, the incoming user traffic is distributed through load balancers to a number of application servers that run independent instances. These application servers access a shared storage, a shared database and a shared cache. This architecture scales well up to a 6 figure number of users. The application servers are easy to scale because doubling the number of servers doubles the performance.
But scalability limitations can be found in the shared components. These are the load balancers, the database, the storage and the cache.<ref>Nextcloud introduces an architecture called 'Global Scale' that is designed to scale to hundreds of millions of users. https://nextcloud.com/blog/nextcloud-announces-global-scale-architecture-as-part-of-nextcloud-12/
</ref>
This reference architecture is also incomplete because it does not address any of the related aspects of how you must integrate this architecture into your operations. It is obviously important that you must have a means to deploy the software onto the system. It is equally important that you configure, monitor and control the infrastructure to adjust over time. Even if the infrastructure 'automatically' adjusts (failoverfail-over, scale-up, scale-down), you need to be able to monitor and know how these systems are performing. If the software is at all developed or deployed internally, then you must also integrate the Development, Software Quality Assurance / Testing, and Release Management disciplines. We can take this even further to address things like how does the architecture enable you to migrate to various geographic locations ([https://wikitech.wikimedia.org/wiki/Switch_Datacenter switch data center]) or fail over in catastophecatastrophe.
== Wikimedia Foundation ==
Looking at the Wikimedia Foundation's usage and implementation of technology gives us great insight about how to grow and scale to be a top ten Internet site, using commodity hardware plus free and open source infrastructure components. At VarnishCon 2016, Emanuele Rocca presents on running Wikipedia.org and details their operations engineering.<ref>https://upload.wikimedia.org/wikipedia/commons/d/d4/WMF_Traffic_Varnishcon_2016.pdf
</ref> Aside from the architecture above, here is another representation of their Web request flow [https://upload.wikimedia.org/wikipedia/commons/5/51/Wikipedia_webrequest_flow_2015-10.png from October 2015] and [https://upload.wikimedia.org/wikipedia/commons/d/d8/Wikimedia-servers-2010-12-28.svg architecture from 2010]
== Data Persistence ==
It's the job of the persistence layer to store the data of your application. For MediaWiki, the old [https://github.com/wikimedia/cdb Constant Database] (CDB) wrapper around PHP's [https://secure.php.net/manual/en/book.dba.php native PHP DBA functions] (which provides a flat file store like the [https://en.wikipedia.org/wiki/Berkeley_DB Berkeley DB] style databases) is now replaced by simple PHP arrays which are file included. This allows the HHVM opcode cache to precompile and cache these data structures. In MediaWiki, it is used for the [https://www.mediawiki.org/wiki/Interwiki_cache interwiki cache], and the localization cache. This is not to say that you will want or need WMF interwiki list, but having a performant cache for the interwiki links contained in ''your'' wiki farm<ref>e.g. see https://freephile.org/w/api.php?action=query&meta=siteinfo&siprop=interwikimap</ref> is probably important.
Persistent data is stored in the following ways:
Require email for sign-up
Instant Commons is a great feature if you want the millions of photos found at [https://commons.wikimedia.org/wiki/Main_Page Wikimedia Commons] available at your fingertips. If you don't, then you might still have your own collection of files/assets that you want to make available across your wiki farm. In that case, you want to use MediaWiki's federated file system which includes advanced configurations for [https://www.mediawiki.org/wiki/Manual:$wgLBFactoryConf Load Balanced] or local File repos<ref>[https://www.mediawiki.org/wiki/Manual:$wgForeignFileRepos $wgForeignFileRepos]
</ref>.
Wiki best practices for cutting down spam. If your server is busy with spam bots, then real users won't enjoy themselves.
=== Security considerations ===
You can't deploy an enterprise architecture without including security best practices in general; and the specific security practices relevant to your application. For example, make sure that users with the editinterface permission are trusted admins.<ref>https://www.mediawiki.org/wiki/Manual:Security
</ref>
== Containers ==
In order to scale, you not only need to have configuration management (etckeeper, git) and orchestration tools (Chef, Ansible, Puppet), but you need a way to package base or complete systems for easy reproducibility. The Docker or VirtualBox technologies allow you to do just that. So, for example, when you want to deploy a MariaDB Galera clustered solution for your database tier, you can use [https://github.com/instantlinux Rich Braun]'s docker script Instantlinux/mariadb-galera <ref>https://hub.docker.com/r/instantlinux/mariadb-galera/
</ref>
== Other Best Practices ==
Use naming conventions in your infrastructure so that you know what's what.<ref>https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions
</ref>
4,558

edits

Navigation menu