![]() |
Scalability is ... There are a number of potential limitations regarding the scalability of MTECS. Each will be described in detail. Many of the limitations described here do not affect the typical web site. This information is provided for our more demanding users.
Contrary to popular belief, many CGI scripts are not processor intensive, but merely output fragments of pre-defined text. For example, forms and tables can be created from fragments of HTML stored in the program itself and style can be taken from a small number of files. None of this requires much processing power, but is dependant on bandwidth from media to processor and processor to network. The useful information is slotted between the template output. This forms the minority of the data but may require the majority of the processing power. It dependant entirely on the task at hand.
MTECS Tier 2 [the shopping basket] requires a catalogue file. This is held as a flat "database" table without indexes. (While this appears perfectly acceptable to the lay person, this should immediately alarm any database expert.) Processing the output of the shopping basket requires that a shopping basket be "joined" with the catalogue, so that the current price for each item can be found. This requires one table to be scanned for each item listed in the other. Obviously, the basket can never have more items than the catalogue and therefore it would be prudent to scan the longest table, the catalogue, only once. The current implementation copies both tables to memory and therefore it does not matter which precedence is taken. The customer may prefer that additional items are appended because the display would be a chronology of items purchased. The customer may prefer that items are always represented in a pre-determined order matching the order of the uploaded catalogue.
The shopping basket currently performs the former, although we may change it to the latter in future implementations. This would also reduce processing power slightly, as will be shown below. The shopping basket program is currently implemented in PERL. Using this language, it is standard practice to read all data into memory. To perform this for a large number of catalogue items requires a large amount of memory. Typically, each field would be held in its own variable; all records forming an array of hashes. Typically, a chunk of memory would be allocated for each piece of data, with further chunks allocated for the "hashes" that contain them.
Alternatively, you could hold the each record in its encoded form, that is one string per record, and fully decode the fields when required. Rather than have a vast number of variables, only a few temporaries are required. Typical memory allocation schemes (such as GNU malloc) allocate memory in large chunks. By placing a lower limit on memory allocation it reduces the number of small chunks of unallocated memory to track and therefore speeds operation significantly, but it does trade footprint for speed. Small memory allocations, such as for short strings, often require 1KB or more memory. To make matters worse, when real memory has been consumed and virtual memory is required, virtual memory is usually allocated in larger chunks of 8KB or more.
So, it can be seen that decoding a small, six field record of 50 bytes can require 6KB or more. Should this memory be allocated non conseculatively, it may require 48KB or more to be paged to and from virtual memory for each record held by each process. Any reduction of memory usage would increase scalability. For this reason, records are held in a partially decoded form. This greatly reduces the memory required to hold data in memory. For a six field table, such as the catalogue, this reduces the memory requirement to less than 1/6. The disadvantage of this technique is that a regular expression must be run multiple times to split the intermediate data, so processing power is traded for application footprint. With current processing speeds and memory architecture, holding data in this "compressed" form increases ability to cope with heavy loads.
Actually, the above is justification of an incremental design process, which happens to be quite efficient. The largest inefficiency of the current implementation is the fact that it is a stateless CGI script written in PERL. For each access of the CGI script, it requires the PERL binary to be initialised and the source code to be compiled before processing commences. Persistant web servers are available that keep the compiled code cached. If such a web server is employed, overheads are reduced, otherwise this initialisation and compilation process must be repeated with every access. Additionally, stateful code would cache common data between accesses, reducing overheads further. Users are free to make this adaptation themselves, although if there sufficient demand, we will implement the changes. Regardless of state, in all the above cases PERL byte code must be interpreted.
This is a lesser overhead than at first imagined because the bytecode is optimised to the language and therefore compact. It requires relatively few instructions to be decoded to perform tasks; the overhead is minimal. The data structures within PERL appear haphazard and slow but are fairly efficient. The largest overhead that springs to mind is that string based hashes (as used for the root of the variable space) require computation before access can occur. This computation can be explicitly cached, although the additional memory required may make it slower than re-computing the hash. This non obvious property applies to systems with associative memory caches, which have become the widespread majority. Hashing is an overhead of decreasing significance with increasing processing power, although it remains an unneccessary overhead at odds with traditionalists. For very high loads, it may be prudent to modify or re-implement the code.
As with all software, there are a number of causes for failure. Each program has different failure conditions due to its algorithm and its environment. MTECS is far from unique in its environment and language of implementation. This is an asset when planning resources and avoiding potential problems.
MTECS consists of CGI scripts written in PERL. Obviously, MTECS is dependant apon a functioning copy of PERL and a web server with a correctly configured CGI. Scripts must also be readable and executable to the web server. This is a common configuration mistake to overlook the fact that the scripts will run not in the permissions of the user that uploads the scripts, but in the permissions of the web server software. Failure to set permissions correctly will result with inability to run MTECS. Failure to set permissions for the creation and update of sessions will result in no changes to the state of a user's shopping basket. The greatest causes of failure result in CGI errors or no change in state.
It is also possible that CGI scripts may run low on memory on a heavily loaded system. This may result in erratic operation, including spurious errors due to failure to read the catalogue file. This can be corrected by repeating action to repeat access to the CGI script, although this is disconcerting to customers and should ideally be avoided. Should this situation occur, it is strongly recommended that the maximum number of web server threads be reduced so as to decrease contention for memory.
It is possible that CGI scripts may fail to execute entirely. This may result in error messages either stating that the web server is busy or that there existed a problem running the script. ...
There is also the condition of performance becoming decidedly non linear and may result in time-outs, or in extreme cases, system failure. The customer or the web browsing software may discontinue waiting for results. This condition may occur at any time, including correcting accidental selection during normal use. Such action will close the connection to the web server. For many web servers, this will terminate an associated CGI script. (MTECS is written with this condition in mind.) The web server software may also terminate CGI scripts after a pre-determined time, regardless of its progress. Obviously, this processing time cannot be recouped, and therefore merely creates additionally load reducing performance for other customers. On a sufficiently busy server, all scripts will terminate prematurely, preventing any stable access to the server. To prevent this problem, reduce the limit on the number of CGI scripts that can be accessed concurrently.
Performance may be greatly reduced when virtual memory is employed. A small increase in load may result in a vast decrease in performance, possibly leading to cascade failure. Budget memory use for typical load ...
For the above cases it is necessary to restrict the processes to improve the stability of the system. Obviously, this reduces the capacity of the system. Fortunately, it is possible to run muliple web servers
The software uses a common library to read and write catalogue and basket entries, which are held in ...
MTECS is written to the quality described here. Programs should be written to run with minimal configuration, although configuration should not be "hard coded". Should configuration not be present, programs must default to sensible and secure options. Additionally, configuration should be held in multiple files if possible. This simplifies the live updating of software when different versions have different configuration parameters. This practice also increases resiliance to misconfiguration and corruption.
Tiers of MTECS can be implemented as CGI scripts. CGI scripts can be regarded as stateless because no internal state survives between individual accesses. CGI scripts are able to modify files and therefore able to transfer state in this manner. Additionally, it is the operation of web server software to terminate scripts that appear to have stalled or when connection is determined to have been severed. Any script writing data may be terminated prematurely at any stage.
Fortunately, it is possible to make trivial changes from one source atomic. One temporary file can be created on the same filing system, usually in the same directory. After the file has been closed, it can be re-named as the live file. This allows the file to be updated atomically; in one operation. Should the script terminate, through error or otherwise, the update is not committed. (In general, this atomic operation only applies to one file because the script may terminate before subsequent files are updated.) On Unix servers, it is possible to continue access to an open file that is deleted. It is only truly deleted after it is closed. This allows one program to update a file without affecting other programs. No semaphore or locking mechanisms are required.
It may be desirable for the temporary file to be given the same name every time. Should a script terminate prematurely, it will leave one temporary file, which will be overwritten with subsequent attempts to update the file. In this case, there will be at most one temporary file of any size for every live file. It may be desirable for each temporary file to be given a unique name. This allows updates from multiple independant sources. Although some concurrent changes will not be committed, state will always be from one source.
When changes occur from one source, the former option is desirable because the number of temporary files never exceeds live files.
The required network bandwidth is smaller than expected. MTECS can be styled to match a web site and this includes sharing the images and style sheets from the web site. Images typically form the majority of data transfered (by volume) when accessing a web site, so any software that reuses such images has conservative network bandwidth usage. The remaining consideration is the content to which style has been added.
ECommerce applications typically generate tabular data, such as items to be purchased. This data is represented in HTML tables, and therefore require a deceptively large volume of data to describe. This would not be a major consideration if the system was occasionally used. The importance of the situation is multiplied when numerous people intend to purchase a number of items at the same time. Additional network bandwidth required should take into account the frequency of potentially large purchases, which varies with the nature of the web site.
An order is created incrementally via the shopping basket, which provides feedback to the user. Therefore, during normal operation, the worst case of network bandwidth required is O(n2) for n items ordered.
The bandwidth required to access data from storage may place a limitation upon the maximum throughput of the system. It is possible for data to be stored elsewhere on a network, therefore storage bandwidth may also be limited by a network connection. A storage network may be employed to minimise this overhead while increasing security.
A cache may greatly improve the access time to data and this is a typical feature of a server operating system. When reading from media, a sufficiently large cache can eliminate repeated access to common used files. In the case of ECommerce, it would be useful to cache the catalogue data and recent sessions. This would increase performance of the software because much of the data would be to hand. Unfortunately, a cache is unable to determine which data will be accessed again, and is therefore undiscriminating. For maximum performance, we would have to revise the size of our cache to account for all miscellaneous access.
A cache is usually a pool of main memory, although it may be dedicated hardware. When a cache is used in conjunction with virtual memory, the remaining memory becomes increasingly over committed. The size of main memory may need to be revised. An inadequate cache may lead to "churn": a sufficiently large amount of data renders the cache useless and may hinder performance. Accessing a sufficiently large file may be enough to churn a cache. A competant system administrator will take action to ensure that caching is efficient.
Virtual memory has a related problem: "thrashing". Thrashing occurs when a system has too many processes attempting to use too much virtual memory. Processes that work fine independantly are competing for finite resources with the result that the system "runs like treacle". When a process accesses data that is not cached, it "sleeps" until the data is available. Meanwhile, other processes run. Virtual memory creates a more fundamental problem because information about a process is not always available.
A heavily loaded web server may have many copies of the web server software running to ensure prompt response. Each copy may be running CGI scripts, such as ECommerce software. Each script may be accessing data, such as the catalogue, which may be large. Should thrashing occur in this situation, the web site service would be disrupted and business may be lost.
Oddly, thrashing may be caused by third parties. To achieve economies of scale, many web sites are "virtual hosted", that is one computer serving many web sites. The load on such a web server is the sum of load on each web site plus miscellaneous activity, such as receiving EMail. This load varies throughout the day, week and seasons. Fortunately, EMail is not intensive and few web site are active. For technical reasons, few web servers virtual host more than 254 web sites and of these sites as few as five may be actively used. Increased demand for one web site may decrease performance for the others.
Operating systems and web server programs typically have limits to restrict such an event. CGI scripts, written by the public, may do reasonable actions that are not suited to a heavily loaded web server. One such operation would be to read a very large text file into main memory. Extreme cases may lead to reduced performance, fleeting unsuccessful access or system failure.
The biggest overhead in storage is the fact that most filing systems allocate one disk sector per file. Sectors are typically 1KB, 2KB or 4KB each, although they may be 8KB or more. There is no reason not to have millions of shopping baskets pending because the shopping basket software can easily cope. The limitations in scalability occur outside of the shopping basket software.
The number of shopping basket sessions pending may grow considerably during normal use. On some systems, inspection of the session directory may create false alarm. Many filing systems store file names in a hash, which is very efficient when the name of the file is known. This is the typical case during operation of MTECS Tier 2. The hash process means that file names are not stored in any particular order. Enumeration of the names in a directory generates a list that requires sorting.
Sorting is not a linear process. This is acceptable for relatively small directories, but for large directories it may create a pregnant pause before a directory listing is generated. Do not be alarmed, this is not the typical access time. The directory listing program may have an option to provide unsorted output. Again, this is not typical access time, but the time to enumerate the entire directory.
To minimise the risk of the corruption of the data, the directory may not be stored in one place. This is not significant when accessing a named resource, but when listing a directory it may give the impression of lower performance. The typical case of access is easily revealed by attempting to access files. Attempting to access files that do not exist may also be insightful.
Estimating the additional bandwidth required when adding an ECommerce will not be very accurate. Very few additional ...
| Xirium Penthouse Suite 102 Long Gore Farncombe Surrey GU7 3TD England UK |
Telephone: +44 1483 415 485 Mobile: +44 79 7779 1430 EMail: webmaster@xirium.com WWW: http://www.xirium.com/ |