MENU service case
 Website construction website design Beijing website construction high-end website production company Shangpin China
We create by embracing change
360 ° brand value__
simplified Chinese character
Simplified Chinese English

Shangpin China Joins Hands with Beisheng Internet to Create a New Chapter in Website Construction

Type: Shangpin Dynamic Learn more

Why can't "UFT8" be used in MySQL for enterprise website building

Source: Shangpin China | Type: website encyclopedia | Time: August 16, 2019

When using MYSQL, you often encounter a problem. Try to save a UTF-8 string in MariaDB encoded with "utf8" through Rails, and then a particularly bizarre error occurs:

Incorrect string value: ‘\xF0\x9F\x98\x83 <…’ for column ‘summary’ at row 1

The UTF-8 encoded client and server are also UTF-8 encoded, as is the database. Even the string "<..." to be saved is legal UTF-8. The crux of the problem is that MySQL's "utf8" is not really UTF-8. "Utf8" only supports up to three bytes per character, while the real UTF-8 is up to four bytes per character. It turns out that MySQL has never fixed this bug. In 2010, they released a character set called "utf8mb4" to bypass this problem. They did not advertise the new character set, so that developers are still suggested to use "utf8" on the Internet, but these suggestions are wrong.

The brief summary is as follows:

(1) MySQL's "utf8mb4" is really "UTF-8".

(2) MySQL's "utf8" is just a "proprietary encoding", which can encode only a few Unicode characters.

Shangpin Editor suggests that MySQL and MariaDB users who use "utf8" should use "utf8mb4" instead and never use "utf8" again.

First What is coding? What is UTF-8?

As we all know, the essence of computer storage is binary, which uses 0 and 1 to store text. For example, if the character "C" is saved as "01000011", the computer needs to go through two steps when displaying this character:

  1. My computer maps "C" to 67 in the Unicode character set.

  2. My computer codes 67 as "01000011" and sends it to the Web server.

Relative:

  1. The computer reads "01000011" and gets the number 67, because 67 is encoded as "01000011".

  2. The computer looked for 67 in the Unicode character set and found "C".

Almost all network applications use the Unicode character set, because there is no reason to use other character sets.

The Unicode character set contains millions of characters. The simplest encoding is UTF-32, and each character uses 32 bits. This is the easiest way to do it, because all along, computers treat 32 bits as numbers, and computers are best at processing numbers. But the problem is that it wastes too much space.

UTF-8 can save space. In UTF-8, the character "C" only needs 8 bits, and other characters may use 16 or 24 bits. For an article like this one, if UTF-8 coding is used, the space occupied is only about a quarter of that of UTF-32.

Second    A brief history of MySQL

Why do MySQL developers disable "utf8"? We may find the answer from the submission log.

MySQL 4.1 supports UTF-8, that is, 2003. The UTF-8 standard (RFC 3629) used today is a later version. The old UTF-8 standard (RFC 2279) supports up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview. In September of the same year, they made an adjustment to the MySQL source code: "UTF8 now only supports 3 byte sequences at most". So who submitted the code? Why did he do this? This question is unknown. After migrating to Git (MySQL started with BitKeeper), many names of committers in the MySQL code base were lost. There was no clue to explain this change in the September 2003 mailing list.

But try to guess.

In 2002, MySQL made a decision: if users can ensure that every row of the data table uses the same number of bytes, MySQL can greatly improve its performance. To do this, you need to define the text column as "CHAR". Each "CHAR" column always has the same number of characters. If the number of characters inserted is less than the defined number, MySQL will fill the space behind. If the number of characters inserted exceeds the defined number, the excess part will be truncated.

When MySQL developers first tried UTF-8, they used 6 bytes per character, CHAR (1) used 6 bytes, CHAR (2) used 12 bytes, and so on.

It can be said that their original version is correct, but this version has not been released. However, it is written in the document, and it is widely circulated. All people who know UTF-8 agree with what is written in the document.

However, it is obvious that MySQL developers or vendors are worried that users will do these two things:

  1. Define columns using CHAR.

  2. Set the code of the CHAR column to "utf8".

The editor guessed that MySQL developers wanted to help users who wanted to win in space and speed, but they messed up the "utf8" code.

So the result is failure. Users who want to win in space and speed, when they use the CHAR column of "utf8", actually use more space than expected, and the speed is slower than expected. Users who want to be correct cannot save characters like "" when they use "utf8" encoding.

Third, why does this issue attract people's attention

On this issue, I spent a lot of time to find this bug. But I'm not sure that this is the only one. Almost all articles on the network regard "utf8" as the real UTF-8. "Utf8" is only a proprietary character set. It has brought us new problems, but it has not been solved.

Source Statement: This article is original or edited by Shangpin China's editors. If it needs to be reproduced, please indicate that it is from Shangpin China. The above contents (including pictures and words) are from the Internet. If there is any infringement, please contact us in time (010-60259772).
TAG label:

What if your website can increase the number of conversions and improve customer satisfaction?

Make an appointment with a professional consultant to communicate!

* Shangpin professional consultant will contact you as soon as possible

Disclaimer

Thank you very much for visiting our website. Please read all the terms of this statement carefully before you use this website.

1. Part of the content of this site comes from the network, and the copyright of some articles and pictures involved belongs to the original author. The reprint of this site is for everyone to learn and exchange, and should not be used for any commercial activities.

2. This website does not assume any form of loss or injury caused by users to themselves and others due to the use of these resources.

3. For issues not covered in this statement, please refer to relevant national laws and regulations. In case of conflict between this statement and national laws and regulations, the national laws and regulations shall prevail.

4. If it infringes your legitimate rights and interests, please contact us in time, and we will delete the relevant content at the first time!

Contact: 010-60259772
E-mail: [email protected]

Communicate with professional consultants now!

  • National Service Hotline

    400-700-4979

  • Beijing Service Hotline

    010-60259772

Please be assured to fill in the information protection
Online consultation

Disclaimer

Thank you very much for visiting our website. Please read all the terms of this statement carefully before you use this website.

1. Part of the content of this site comes from the network, and the copyright of some articles and pictures involved belongs to the original author. The reprint of this site is for everyone to learn and exchange, and should not be used for any commercial activities.

2. This website does not assume any form of loss or injury caused by users to themselves and others due to the use of these resources.

3. For issues not covered in this statement, please refer to relevant national laws and regulations. In case of conflict between this statement and national laws and regulations, the national laws and regulations shall prevail.

4. If it infringes your legitimate rights and interests, please contact us in time, and we will delete the relevant content at the first time!

Contact: 010-60259772
E-mail: [email protected]