Performance tuning strategy
Source: Shangpin China |
Type: website encyclopedia |
Date: June 23, 2012
This is a big topic about performance optimization. Let's talk about it in 12306.cn Website construction In Performance Technology, I mentioned some available technologies and their advantages and disadvantages in terms of business and design. Today, I want to talk about performance optimization from some technical details, mainly including some code level technologies and methods. The content of this article is some of my experience and knowledge, not necessarily all right, I hope you can correct and add.
Before starting this article, you can take a look at the Code Optimization Summary previously published by Cool Shell. This article basically tells you that to optimize, you must first find the performance bottleneck! But before I talk about how to position the system performance bottleneck, let me talk about the definition and testing of system performance SEO There is no way to start optimization.
1、 System performance definition
Let's talk about how and what system performance is. This definition is very critical. If we do not know what is system performance, we will not be able to locate it. I have met many friends who think this is easy, but when asked carefully, they don't have a systematic method. So here I want to tell you how to systematically locate performance. In general, system performance is two things: 1. Throughput. That is, the number of requests and tasks that can be processed per second. 2. Latency, system delay. That is, the delay when the system processes a request or a task.
Generally speaking, the performance of a system is constrained by these two conditions, and neither of them is dispensable. For example, my system can withstand 1 million concurrency, but the system delay is more than 2 minutes, so this 1 million load is meaningless. The system latency is very short, but the throughput is very low, which is also meaningless. Therefore, the performance test of a good system must be affected by these two conditions at the same time. Experienced friends must know some relations between these two things: • The larger the throughput, the worse the latency. Because the request volume is too large and the system is too busy, the response speed will naturally be low. • The better the latency, the higher the throughput that can be supported. Because the Latency short indicates that the processing speed is fast, more requests can be processed.
2、 System performance test
After the above description, we know that to test the performance of the system, we need to collect the system's Throughput and Latency values. • First, you need to define the value of Latency. For example, the response time of the website system must be within 5 seconds (for some real-time systems, it may need to be shorter, such as within 5 ms, which is more defined according to different businesses) • Secondly, develop performance testing tools. One tool is used to make high-strength throughputs, and the other tool is used to measure latency. For the first tool, you can refer to the "Ten Free Web Stress Testing Tools". For how to measure latency, you can measure it in code, but this will affect the execution of the program, and can only test the latency inside the program. The real latency is calculated for the entire system, including the operating system and network latency, You can use Wireshark to grab network packets for measurement. How to use these two tools? Please think for yourself. • Finally, start the performance test. You need to constantly improve the throughput of the test, and then observe the load of the system. If the system can withstand, then observe the Latency value. In this way, you can find the maximum load of the system, and you can know the response delay of the system.
Say more, • With respect to latency, if the throughput is small, this value is estimated to be very stable. When the throughput becomes larger and larger, the latency of the system will appear very severe jitter. Therefore, when we measure latency, we need to pay attention to the distribution of latency, that is, the percentage of latency is within our allowable range, and the percentage of latency is exceeded, Some percent are totally unacceptable. Maybe the average Latency is up to the standard, but only 50% of them are acceptable. That doesn't make sense. • For performance testing, we also need to define a time period. For example, it lasts for 15 minutes on a certain throughput. Because when the load arrives, the system will become unstable. After a minute or two, the system will become stable. In addition, it may be that your system performs normally for a few minutes under this load, and then it becomes unstable or even collapses. So, it takes a while. This value is called peak limit. • The performance test also needs to do Soak Test, that is, under a certain throughput, the system can run for a week or even longer. This value is called the load limit of normal operation of the system.
There are many important things in performance testing, such as burst test. I can't go into details here. I just mention some things related to performance tuning. In short, performance testing is a painstaking task.
3、 Locate performance bottlenecks
With the foreshadowing above, we can test the performance of the system. Before tuning, let's talk about how to find the performance bottleneck. I have met many friends who think it is easy, but when asked carefully, they do not have a systematic method.
3.1) Viewing the Operating System Load
First of all, when our system has problems, we should not rush to investigate our code, which is meaningless. The first thing we need to see is the operating system report. Look at the CPU utilization of the operating system, the memory utilization, the IO of the operating system, the network IO, the number of network links, and so on. Perfmon under Windows is a good tool, and there are many related commands and tools under Linux, such as SystemTap, Latency TOP, vmstat, sar, iostat, top, tcpdump, and so on. By observing these data, we can know where the performance of our software basically lies. For example:
1) Let's first look at the CPU utilization. If the CPU utilization is not high, but the system's Throughput and Latency are not available, it means that our program is not busy with computing, but busy with other things, such as IO. (In addition, CPU utilization also depends on kernel mode and user mode. Once the kernel mode increases, the performance of the whole system will decline. For multi-core CPUs, CPU 0 is quite critical. If CPU 0 has a high load, it will affect the performance of other cores, because CPU cores need scheduling, which is done by CPU 0.)
2) Then, we can see whether the IO is large. IO and CPU are generally the opposite. High CPU utilization means low IO, while high IO means small CPU. Regarding IO, we need to look at three things: disk file IO, driver IO (such as network card), and memory paging rate. All three things will affect system performance.
3) Then, check the network bandwidth usage. Under Linux, you can use the commands iftop, iptraf, ntop, and tcpdump to check. Or use Wireshark to view.
4) If the CPU is not high, IO is not high, memory usage is not high, and network bandwidth usage is not high. But the performance of the system is not good. This indicates that there is a problem with your program. For example, your program is blocked. It may be waiting for the lock, or it may be waiting for a resource, or it may be switching contexts.
By understanding the performance of the operating system, we can know the performance problems, such as insufficient bandwidth, insufficient memory, insufficient TCP buffer, etc. Many times, if you do not need to adjust the program, you only need to adjust the hardware or operating system configuration.
3.2) Testing with Profiler
Next, we need to use a performance detection tool, that is, use a profiler to compare the running performance of our program. For example, JProfiler/TPTP/CodePro Profiler in Java, gprof in GNU, PurifyPlus in IBM, VTune in Intel, CodeAnalyst in AMD, and OProfile/perf in Linux. The latter two allow you to optimize your code to the CPU's microinstruction level. If you are concerned about cache optimization of CPU L1/L2, you need to consider using VTune. Using these Profiler tools, you can use many things of various module functions and even instructions in your program, such as running time, number of calls, CPU utilization, and so on. These things are very useful to us.
We focus on the functions and instructions that run the most and are called the most. Note here that you may only need to slightly optimize the function that is called many times but takes a short time to improve your performance (for example, a function is called 1 million times a second. Think about how much performance it will bring if you increase the function's time by 0.01 milliseconds)
There is a problem we need to pay attention to when using Profiler, because Profiler will reduce the performance of your program. Tools such as PurifyPlus will insert a lot of code into your code, which will reduce the efficiency of your program, so that the performance of the system under high throughput is not tested. In this regard, there are generally two methods to locate the system bottleneck:
1) Make your own statistics in your code. Use microsecond level timers and function call calculators to log the statistics to the file every 10 seconds.
2) Annotate your code blocks in sections, let some functions idle, do the Hard Code Mock, and then test whether the system's Throughput and Latency have qualitative changes. If so, the commented function is a performance bottleneck. Then annotate the code in the function body until you find the statement that consumes the most performance.
4、 Common system bottlenecks
The following are some of the problems I have experienced. They may not be complete, or they may not be correct. You can add that I am just throwing a brick to attract jade. For the performance tuning of the system architecture, you can go to "Talking about website performance technology from 12306.cn". For some performance tuning of the Web, you can see the performance chapter of "What you need to know in Web development". I won't talk about design and architecture here.
Generally speaking, performance optimization refers to the following strategies: • Exchange space for time. Various caches, such as CPU L1/L2/RAM to hard disk, use space for time. This strategy basically saves or caches the calculation process step by step, so that you don't have to calculate again every time you use it, such as data buffer, CDN, etc. Such strategies also include redundant data, such as data mirroring and load balancing. • Exchange time for space. Sometimes, a small amount of space may have better performance, such as network transmission. If there are some algorithms for compressing data (such as "Huffman coding compression algorithm" and "rsync core algorithm" mentioned a few days ago), such algorithms are actually time-consuming, but because the bottleneck is network transmission, it can save time to exchange time for space. • Simplify code. The most efficient program is the program that does not execute any code. Therefore, the less code, the higher the performance. There are many examples of code level optimization in the textbooks of technical universities. For example, reduce the number of loops, reduce recursion, declare less variables in the loop, allocate and release less memory, extract the expressions in the loop to the outside of the loop as far as possible, determine the order of multiple conditions in the conditional expression, prepare something as far as possible when the program starts, pay attention to the cost of function calls (stack overhead), Pay attention to the overhead of temporary objects in object-oriented languages, be careful to use exceptions (don't use exceptions to check for some acceptable, negligible and frequently occurring errors),... and so on. We need to have a good understanding of programming languages and common libraries. • Parallel processing. If the CPU has only one core and you want to play multiprocessing and multithreading, it will be slower for computing intensive software (because of the high cost of operating system scheduling and switching). Only with more CPU cores can the advantage of multiprocessing and multithreading be truly reflected. Parallel processing requires that our programs have Scalability, and programs that cannot be expanded horizontally or vertically cannot be processed in parallel. From the perspective of architecture, this table is: can we achieve performance improvement by adding machines without changing the code?
In short, according to the 2:8 principle, 20% of the code consumes 80% of your performance. If you find the 20% of the code, you can optimize the 80% of your performance. The following are some of my experiences. I only cite some of the most valuable performance tuning methods for your reference, and welcome to add.
4.1) Algorithm tuning. Algorithms are very important. Good algorithms will have better performance. Let me give you some examples of projects I have experienced. • One is the filtering algorithm. The system needs to filter the received requests. We have configured the things that can be filtered in/out in a file. The original filtering algorithm is to traverse the filtering configuration. Later, we found a method to sort the filtering configuration, so that we can use the half and half method to filter. The system performance has increased by 50%. • One is the hash algorithm. The function of the hash algorithm is not efficient. On the one hand, the computation is too time-consuming, and on the other hand, the collision is too high. If the collision is high, it is the same performance as the one-way linked list (see the Hash Collision DoS problem). We know that algorithms have a lot to do with the data that needs to be processed. Even the "bubble sorting" that is ridiculed by everyone is more efficient than all sorting algorithms in some cases (most data are sorted). The same is true for hash algorithms. Well known hash algorithms are tested using English dictionaries, but our business has its particularity in data. Therefore, we also need to select the appropriate hash algorithm according to our own data. For one of my previous projects, a bully in the company sent me a hash algorithm, which resulted in a 150% increase in our system performance. (For various hash algorithms, you must see this article on various hash algorithms on StackExchange.) • Divide and conquer and pretreatment. There used to be a program that took a long time to calculate each time in order to generate monthly reports, sometimes it took nearly a whole day. So we found a way to send this algorithm into an incremental form, that is, I have calculated the data of the day and merged it with the report of the previous day every day, which can greatly save the calculation time. The daily data calculation only takes 20 minutes, but if I want to calculate the data of the whole month, The system takes more than 10 hours (the performance of SQL statements decreases in series in the face of large amounts of data). This idea of divide and conquer is very helpful to performance in front of big data, just like merge sorting. This is also a strategy for optimizing the performance of SQL statements and databases, such as using nested Select instead of Cartesian Select, using views, and so on.
4.2) Code tuning. From my experience, the code tuning has the following points: • String operations. This is the most expensive thing for system performance. Whether it is strcpy, strcat or strlen, the most important thing to pay attention to is string substring matching. Therefore, it is better to use integer if you can use integer. To give a few examples, the first example is that when I was working as a bank N years ago, my colleagues liked to save the date as a string (for example, 2013-05-29 08:30:02). I pulled it out. A select where between statement was quite time-consuming. Another example is that one of my former colleagues used strings to process some status codes. His reason is that this can be directly displayed on the interface. Later, when performing performance tuning, I changed all of these status codes to integer types, and then used bit operations to check the status. Because there are three places in a function that is called 150K times per second that need to check the status. After improvement, The performance of the whole system has increased by about 30%. Another example is that one of the product programming specifications I worked on before is to define the function name in each function, such as const char fname []="functionName()". This is for logging purposes, but why not declare it as static? • Multi thread tuning. Some people say that thread is evil, which is a problem for system performance in some cases. Because the bottleneck of multithreading lies in the mutually exclusive and synchronous locks, as well as the cost of thread context switching, it is fundamental to use less locks or no locks (for example, the optimistic locks in MVCC applications in distributed systems can solve performance problems). In addition, read/write locks can also solve the performance problems of most read operations. More to say here, in C++, we may use thread safe smart pointer AutoPtr or some other containers. As long as it is thread safe, it will be locked no matter what happens. Locking is a costly operation. Using AutoPtr will degrade our system performance very quickly. If you can guarantee that there will be no thread concurrency problems, Then you should not use AutoPtr. I remember last time my colleague removed the reference count of smart pointer, which improved the system performance by more than 50%. For the reference count of Java objects, if I guess correctly, there are locks everywhere, so the performance of Java has always been a problem. In addition, the more threads, the better. Scheduling and context switching between threads are also exaggerations. Try to work in one thread as far as possible, and try not to synchronize threads. This will give you a lot of performance. • Memory allocation. Do not underestimate the memory allocation of programs. System tuning such as malloc/realloc/calloc is time-consuming, especially when memory fragments occur. My previous company had such a problem - on the user's site, our program did not respond one day. We followed it with GDB and found that the system hang did not return in malloc operation for 20 seconds. It would be better to restart some systems. This is the problem of memory fragmentation. This is why many people complain that STL has a serious memory fragmentation problem, because too many small memory allocations are released. Many people think that memory pool can solve this problem, but in fact they just reinvented the memory management mechanism of Runtime-C or the operating system, which is totally useless. Of course, the problem of memory fragmentation is still solved through memory pools, specifically a series of memory pools of different sizes (this is left to everyone to think about). Of course, less dynamic memory allocation is the best. When it comes to memory pool, we need to talk about pooling technology. For example, thread pool, connection pool, etc. Pooling technology is quite effective for some short jobs (such as http services). This technology can reduce the overhead of link establishment and thread creation, thus improving performance. • Asynchronous operation. We know that there are block and non block file operations under Unix. For example, some system calls are also block type, such as select under Socket and WaitforObject under Windows. If our program is operated synchronously, it will greatly affect the performance. We can change it to asynchronous, but changing it to asynchronous will make your program more complex. The asynchronous mode usually needs to pass through the queue and pay attention to the performance of the inter queue. In addition, the status notification under asynchronous mode is usually a problem, such as the message event notification mode, the callback mode, etc. These modes may also affect your performance. But generally speaking, asynchronous operation will greatly improve the throughput of performance, but will sacrifice the response time of the system. This requires business support. • Language and code base. We should be familiar with the language and the performance of the function library or class library used. For example, after many containers in STL allocate memory, even if you delete elements, the memory will not be recycled, which will cause the false image of memory leakage and may cause memory fragmentation problems. For another example, the size ()=0 and empty () of some STL containers are different, because size () is O (n) complexity, and empty () is O (1) complexity. Be careful. These parameters need to be used for JVM tuning in Java: - Xms - Xmx - Xmn - XX: SurvivorRatio - XX: MaxTenuringThreshold. You also need to pay attention to the GC of the JVM, which is known for its arrogance, especially the full GC (which also cleans up memory fragments). It is like a "dinosaur super Kesai". When it runs, the time in the whole world stops.
4.3) Network tuning
There are many things to say about network tuning, especially TCP Tuning (you can find many articles on the Internet with these two keywords). Look at so many TCP/IP parameters under Linux (by the way, you may not like Linux, but you can't deny that Linux gives us a lot of power to perform kernel tuning). It is strongly recommended that you read TCP/IP Details Volume 1: Protocol. I will only talk about some conceptual things here.
A) TCP Tuning
We know that TCP links have a lot of overhead. One is to occupy file descriptors, the other is to open cache. Generally speaking, a system can support a limited number of TCP links. We need to clearly understand that TCP links have a large overhead on the system. Because TCP consumes resources, many attacks cause a large number of TCP links on your system, which depletes your system resources. For example, the famous SYNC Flood attack.
Therefore, we should pay attention to configuring the KeepAlive parameter, which means to define a time when the system will send a packet if there is no data transmission on the link. If no response is received, TCP will consider the link broken, and then the link will be closed, so that the system resource overhead can be recovered. (Note: The HTTP layer also has the KeepAlive parameter.) For short links like HTTP, it is very important to set a keepalive for 1-2 minutes. This can prevent DoS attacks to some extent. There are the following parameters (the values of these parameters are for reference only):
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_fin_timeout = 30
For the TIME_WAIT status of TCP, the active shutdown party enters the TIME_WAIT status. The TIME_WAIT status will last for two MSLs (Max Segment Lifetime). The default is 4 minutes. Resources in the TIME_WAIT status cannot be recycled. There are a large number of TIME_WAIT links on the HTTP server. There are two parameters to note
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=1
The former means reusing TIME_WAIT, and the latter means reclaiming TIME_WAIT resources.
Another important concept of TCP is RWIN (TCP Receive Window Size), which means the largest packet that a TCP link can receive without sending an ack to the sender. Why is this important? Because if the sender does not receive the ack sent by the receiver, the sender will stop sending data and wait for a period of time. If it times out, it will retransmit. This is why TCP links are reliable links. Retransmission is not the most serious. If packet loss occurs, the TCP bandwidth utilization will be immediately affected (it will be blindly halved). Then packet loss will occur, and then halve again. If no packet loss occurs, it will gradually recover. Relevant parameters are as follows:
net.core.wmem_default = 8388608
net.core.rmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
Generally speaking, the theoretical RWIN should be set as: throughput * loop time. The sender side buffer should be the same size as the RWIN, because the sender side has to wait for the receiver side to confirm after sending data. If the network delay is large and the buffer is too small, the number of times of confirmation will be large, so the performance is not high, and the utilization of the network is not high. That is to say, for networks with large latency, we need large buffers, so that we can have fewer acks, more data, and fewer buffers for networks with faster response. Because if there is packet loss (no ack received), the buffer is too large, which may cause TCP to retransmit all data and affect network performance. (Of course, if the network is poor, don't play with high-performance.) Therefore, the important thing of high-performance networks is to make the network packet loss rate very, very small (basically used in the LAN). If the network is basically trusted, then using a larger buffer will have better network transmission performance (too much back and forth will affect the performance).
In addition, let's think about it. If the network quality is very good and there is almost no packet loss, and we are not afraid of occasionally losing a few packets in business, then why don't we use faster UDP? Have you thought about this question?
B) UDP Tuning
When it comes to UDP tuning, I want to focus on the same thing, that is, MTU - the largest transmission unit (in fact, this is the same for TCP, because this is something on the link layer). The so-called maximum transmission unit can be imagined as a bus on the road. Suppose a bus can take up to 70 people, and the bandwidth is like the number of lanes on the road. If a road can accommodate up to 100 buses, that means I can transport up to 7000 people, but if the bus is not enough, for example, the average number of people per bus is only 20, So I only transported 2000 people, so my highway resources (bandwidth resources) were wasted. Therefore, we should try to make a UDP packet as large as the maximum size of the MTU before uploading it to the network, so as to maximize the bandwidth utilization. For this MTU, Ethernet is 1500 bytes, optical fiber is 4352 bytes, and 802.11 wireless network is 7981 bytes. However, when we use TCP/UDP to send packets, our payload Payload should be lower than this value. Because IP protocol will add 20 bytes, and UDP will add 8 bytes (TCP adds more), generally speaking, the maximum size of a UDP packet should be 1500-8-20=1472, which is the size of your data. Of course, if you use optical fiber, this value can be higher. (By the way, for some NB's kilo optical Ethernet network cards, if the network card hardware finds that the size of your packet exceeds the MTU, it will do the fragment for you, and then it will do the reorganization for you at the target end, which does not need to be processed in the program.)
More on, when programming with Socket, you can use setsockopt() to set the size of SO_SNDBUF/SO_RCVBUF, TTL and KeepAlive. Of course, there are many other key settings. Please refer to the Socket manual for details.
Finally, the biggest advantage of UDP is multi cast multicast, which is very convenient and efficient when you need to notify multiple nodes in the intranet. Moreover, multicast is also beneficial to the horizontal expansion of opportunities (additional machines are needed to listen to multicast information).
C) Network card tuning
For the network card, we can also tune it, which is very necessary for the gigabit and network card. Under Linux, we can use ifconfig to view the online statistics. If we see data on the overlay, we may need to adjust the size of txqueuelen (generally 1000 by default), and we can increase it, such as ifconfig eth0 txqueuelen 5000. There is also a command under Linux called ethtool, which can be used to set the buffer size of the network card. Under Windows, we can adjust relevant parameters (such as Receive Buffers, Transmit Buffer, etc.) in the Advanced tab of the network adapter. Different network adapters have different parameters. Turning up the buffer is very effective for network transmission that requires a large amount of data.
D) Other network performance
With regard to multiplexing technology, that is, using one thread to manage all TCP links, three system calls should be highlighted: one is select, which only supports the upper limit of 1024 links, and the second is poll, which can break through the limit of 1024. However, select and poll are essentially polling mechanisms, which perform poorly when there are many links, Because the main algorithm is O (n), epoll appears. epoll is supported by the operating system kernel. The operating system will call back only when the link is active. This is triggered by the operating system notification, but it is only supported after Linux Kernel 2.6 (introduced in 2.5.44, to be exact). Of course, if all links are active, Excessive use of epoll_ctl may affect the performance more than polling, but the impact is not significant.
In addition, be careful about some system calls to DNS Lookup, such as gethostbyaddr/gethostbyname. This function may be quite time-consuming, because it needs to find the domain name on the network, because DNS recursive queries will lead to serious timeout, and you cannot set the time out by setting any parameters. You can speed up this by configuring the hosts file, Or you can manage the corresponding table in your own memory and check it when the program starts, instead of checking it every time when it runs. In addition, under multithreading, gethostbyname will have a more serious problem. If one thread's gethostbyname blocks, other threads will block at gethostbyname. This is abnormal. Be careful. (You can try GNU's gethostbyname_r(), which has better performance) There are many things to find information on the Internet. For example, if your Linux uses NIS or NFS, some user or file related system calls are slow, so be careful.
4.4) System tuning
A) I/O model
The three system calls of select/poll/epoll mentioned earlier, we all know that under Unix/Linux, all devices are used as files for I/O, so those three operations should be regarded as I/O related system calls. Speaking of the I/O model, it is very important for our I/O performance. We know that the classic I/O mode of Unix/Linux is (for the I/O model under Linux, you can read the article "Using asynchronous I/O to greatly improve performance"):
The first is synchronous blocking I/O.
The second is synchronous non blocking mode. This is done by fctnl setting O_NONBLOCK.
Third, for select/poll/epoll, I/O is not blocked, but is blocked on events. It can be regarded as I/O asynchronous and event synchronous calls.
Fourth, AIO. This I/O model is a parallel processing model with I/O. The I/O request will return immediately, indicating that the request has been successfully initiated. When I/O operations are completed in the background, the application is notified in two ways: one is to generate a signal, the other is to execute a thread based callback function to complete the I/O processing process.
The fourth is that there is no blocking, whether in I/O or event notification, so it allows you to make full use of the CPU. Compared with the second non blocking synchronization, the second is that you need to poll again and again. Nginx is efficient because it uses epoll and AIO for I/O.
Let's talk about the I/O model under Windows,
a) One is the WriteFile system call, which can be synchronously blocked or synchronously non blocked. It is about checking whether the file is opened as Overlapped. For non blocking synchronization, you need to set the last parameter Overlapped. Microsoft calls it Overlapped I/O. You need WaitForSingleObject to know whether the write is complete. You can imagine the performance of this system call.
b) Another system call, WriteFileEx, can implement asynchronous I/O and let you pass in a callback function, which will be called back after I/O is completed. However, the process of this callback is that Windows places the callback function in the APC (Asynchronous Procedure Calls) queue, and then only when the current thread of the application becomes Notifiable, Will be recalled. Only when your thread uses these functions, WaitForSingleObjectEx, WaitForMultipleObjectsEx, MsgWaitForMultipleObjectsEx, SignalObjectAndWait and SleepEx, will the thread become Alternate. It can be seen that this model still has a wait, so its performance is not high.
c) Then comes IOCP – IO Completion Port. IOCP will put the I/O results in a queue. However, the queue is not listened to by the main thread, but by one or more threads dedicated to this task (the old platform requires you to create your own thread, while the new platform allows you to create a thread pool). IOCP is a thread pool model. This is similar to the AIO model under Linux, but its implementation and usage are completely different.
Of course, the way to really improve I/O performance is to minimize the number of I/Os with peripherals. It is better not to do so. Therefore, for reading, memory cache can generally improve performance qualitatively because memory is much faster than peripherals. For writing, the cache keeps the data to be written and writes it several times less. But the problem brought by the cache is the real-time problem, that is, the latency will become larger. We need to balance the number of writes with the corresponding number.
B) Multi core CPU tuning
As for the multi-core technology of CPU, we know that CPU 0 is very critical. If CPU 0 is used too hard, the performance of other CPUs will also decline. Because CPU 0 has the adjustment function, we cannot allow the operating system to load balance. Because we know our own programs better, we can manually allocate CPU cores to it without taking up too much CPU 0, Or let our key processes crowd together with a bunch of other processes. • For Windows, we can set and limit the cores on which this process can be run through "Process" in the "Task Manager" and "Set Affinity..." in the right-click menu. • For Linux, you can use the taskset command (you can install this command by installing scheditils: apt get install scheditils)
Another technology of multi-core CPU is NUMA technology (Non Uniform Memory Access). Traditional multi-core computing uses SMP (Symmetric Multi Processor) mode, where multiple processors share a centralized memory and I/O bus. The problem of consistent storage access arises. Consistency usually means performance problems. In NUMA mode, the processor is divided into multiple nodes, and each node has its own local memory space. For some technical details of NUMA, you can check the article "Linux NUMA Technology". Under Linux, the command for NUMA tuning is: numactl. Such as the following command: (The specified command "myprogram arg1 arg2" runs on node 0, and its memory is allocated on nodes 0 and 1)
numactl --cpubind=0 --membind=0,1 myprogram arg1 arg2
Of course, the above command is not good, because the memory spans two nodes, which is very bad. The best way is to allow the program to access only the same nodes as its own, such as:
$ numactl --membind 1 --cpunodebind 1 --localalloc myapplication
C) File system tuning
For the file system, because the file system also has cache, in order to maximize the performance of the file system. The first thing is to allocate enough memory, which is very important. In Linux, you can use the free command to view free/used/buffers/cached. Ideally, there should be about 40% buffers and cached. Then a fast hard disk controller, SCSI will be much better. The fastest is the Intel SSD solid state disk, which is extremely fast, but has limited write times.
Next, we can tune the file system configuration. For Ext3/4 of Linux, one parameter that is helpful in almost all cases is to close the file system access time. Check whether your file system has the noatime parameter under/etc/fstab (generally there should be), and one is dealloc, It allows the system to decide which block to use when writing files at the last moment, which can optimize the writer. Also note the following three logging modes: data=journal, data=ordered, and data=writeback. The default setting of data=ordered provides the best balance between performance and protection.
Of course, for these purposes, the default settings of ext4 are basically the best optimization.
Here is a Linux command to view I/O - iotop, which allows you to see the disk read/write load of each process.
Others are about NFS and XFS tuning. You can search Google for some related optimization articles. For each file system, you can take a look at this article - Linux Log File System and Performance Analysis
4.5) Database tuning
Database tuning is not my strong point. I will just use my very limited knowledge to say something. Note that the following things are not necessarily correct, because in different business scenarios, different database designs may lead to completely opposite conclusions. Therefore, I will only make some general explanations here, and specific analysis is required for specific problems.
A) Database Engine Tuning
I'm not familiar with database engines, but there are several things I think I must understand. • The lock mode of the database. This is very, very important. In the case of concurrency, locks affect performance very, very much. Various isolation levels, row locks, table locks, page locks, read-write locks, transaction locks, and various write first or read first mechanisms. The highest performance is not to lock. Therefore, the performance can be effectively improved by dividing databases and tables, redundant data, and reducing consistent transaction processing. NoSQL sacrifices consistency, transaction processing and redundant data to achieve distributed and high performance. • Storage mechanism of database. It is not only necessary to understand how various types of fields are stored, but also more important is how the database is stored, partitioned, and managed, such as Oracle data files, tablespaces, segments, and so on. Understanding this mechanism can reduce a lot of I/O load. For example, show engines is used in MySQL; You can see the support of various storage engines. Different storage engines have different priorities, and different business or database designs will give you different performance. • Distributed strategy of database. The simplest is replication or mirroring. You need to understand the distributed consistency algorithm, or master master synchronization, master slave synchronization. By understanding the mechanism of this technology, we can achieve horizontal expansion at the database level.
B) SQL statement optimization
For the optimization of SQL statements, the first thing is to use tools, such as MySQL SQL Query Analyzer, Oracle SQL Performance Analyzer, or Microsoft SQL Query Analyzer. Basically, all RMDBs have such tools to let you view the SQL performance problems in your applications. You can also use explain to see what the final Execution Plan of the SQL statement will look like.
Another important point is that various database operations require a large amount of memory, so the server memory should be sufficient to deal with SQL statements queried by multiple tables, which is quite memory consuming.
Here are some SQL with performance problems based on my limited knowledge of database SQL: • Full table retrieval. For example, select * from user where lastname="xxxx", such SQL statements are basically full table lookups. With linear complexity O (n), the more records, the worse the performance (for example, it takes 50ms to find 100 records, and 5 minutes to find one million records). In this case, we can improve performance in two ways: one is to divide tables to reduce the number of records, and the other is to build indexes (for lastname). An index is like the data structure of key value. The key is the field behind where, and the value is the physical line number. The search complexity of the index is basically O (log (n)) - using B-Tree to implement the index (for example, it takes 50ms to find 100 records, and 100ms to find a million records). • Index. For indexed fields, it is better not to perform calculations, type conversions, functions, null value judgments, and field joins on the fields. These operations will damage the original performance of the index. Of course, indexes usually appear in Where or Order by clauses, so it is better not to calculate the sub segments in Where and Order by clauses, or add something like NOT, or use any function. • Multi table query. The most common operation of relational database is multi table query. Multi table query mainly has three keywords, EXISTS, IN and JOIN (for various joins, please refer to the Join article in the diagram SQL). Basically, modern data engines optimize SQL statements very well. JOIN and IN/EXISTS have different results, but their performance is basically the same. Some people say that the performance of EXISTS is better than IN, and that of IN is better than JOIN. I think it depends on the complexity of your data, schema, and SQL statements. For simple cases, it is almost the same. So don't use too much nesting. Don't make your SQL too complex, It is better to use a few simple SQL than a huge nested N-level SQL. Others said that if the two tables have the same amount of data, the performance of Exists may be higher than that of In, and In may be higher than that of Join. If the two tables are one large and one small, then Exists will use a large table and In will use a small table in the subquery. I haven't verified this. Let's discuss it here. In addition, there is an article about SQL Server. You can refer to IN vs JOIN vs EXISTS • JOIN operation. Some people say that the order of the join table will affect the performance. As long as the result set of the join is the same, the performance is independent of the order of the join. Because the background database engine will help us optimize. Join has three implementation algorithms: nested loop, sorting and merging, and Hash type Join. (MySQL only supports the first type) • Nested loops, just like our common multiple nested loops. Note that the previous index said that the database index lookup algorithm uses B-Tree, which is an O (log (n)) algorithm. Therefore, the complexity of the entire algorithm should be O (log (n)) * O (log (m)). • Hash type join, which mainly solves the complexity of O (log (n)) of nested loops, and uses a temporary hash table to mark it. • Sort merge, which means that the two tables are sorted according to the query field and then merged. Of course, the index fields are generally in order.
Again, it depends on what kind of data and what kind of SQL statement you will know which method is the best. • Partial result sets. We know that the limit keyword in MySQL, rownum in Oracle, and Top in SQL Server all limit the return results of the first few items. This gives us a lot of room for tuning the database engine. In general, we need to use order by to return the record data of top n. Note that we need to index the fields of order by here. With the order by being indexed, the performance of our select statement will not be affected by the number of records. Using this technology, generally speaking, our foreground will display data in a pagination way. MySQL uses OFFSET and SQL Server uses FETCH NEXT. This Fetch method is actually not good for linear complexity. Therefore, if we can know the starting value of the second page of the order by field, we can directly use the expression>=in the where statement to select, This technology is called seek, not fetch. The performance of seek is much higher than that of fetch. • String. As I said earlier, string operations have a very big nightmare for performance. Therefore, numbers are used when data is available, such as time, tag number, etc. • Full text search. Do not use Like or other things to do full-text search. If you want to play full-text search, you can try to use Sphinx. • Others. • Do not select *, but specify each field. If there are multiple tables, be sure to add the table name before the field name, and do not let the engine calculate. • Don't use HAVING, because it needs to traverse all records. The performance can't be worse. • Use UNION ALL instead of UNION whenever possible. • If there are too many indexes, insert and delete will be slower. Update is also slow if most indexes are updated, but if only one index is updated, only one index table will be affected. This article was published by Beijing Website Construction Company Shangpin China //ihucc.com/
Source Statement: This article is original or edited by Shangpin China's editors. If it needs to be reproduced, please indicate that it is from Shangpin China. The above contents (including pictures and words) are from the Internet. If there is any infringement, please contact us in time (010-60259772).