Convert MYSQL to UTF8 character set on Ubuntu 8.10

Posted on January 2nd, 2009

I used the following methods to convert my server to UTF-8, using standard Ubuntu packages, eg mysql-server from synaptix, nothing had to be recompiled.

They may not work in your environment. Use them at your own risk, and BACKUP everything first. (Preferably test before using it on any production server.) I post them only as a personal memory aid. If you use the information here and your server bursts into flames, you accept responsibility for that.

PuTTY /SSH:
In PuTTY 0.60 you can change the option “Window > Translation > Received data assumed to be in which character set:” to ‘UTF-8’. After this all received data will be interpreted as UTF-8 and displayed correctly. (For anyone ssh’ing to the mysql server and using the mysql> prompt from there!)

Bash:
try adding something like this to your ~/.bash_profile
echo -ne ‘\e%G\e[?47h\e%G\e[?47l’
There’s no point trying to work in UTF-8 if your computer thinks you are speaking ISO 8859-1!! This should even fix vi, so AltGr+4 correctly gives the Euro ‘€’ and AltGr+` [backtick] gives the broken pipe ‘¦’, instead of displaying nothing, or something odd like an accented A¦ pair. (or a dot in bash). Of course, Windows notepad will choke on the extended character, but something like notepad++ will work to display it.

MySQL:
A) Edit /etc/mysql/my.cnf, locate the section that starts [mysqld], add the lines below to configure your server, and then restart mysql. This should switch the server to UTF-8 ‘mode’.

[mysqld]
# utf8
init-connect=’SET NAMES utf8′
character-set-server=utf8
collation-server=utf8_general_ci

B) # /etc/init.d/mysqld restart

afterwards you should find the following all correctly updated to utf-8
> Variable Session value Global value
> character set client utf8 latin1
> character set connection utf8 latin1
> character set database latin1 latin1
> character set results utf8 latin1
> character set server latin1 latin1
> character set system utf8 utf8
> collation connection utf8_general_ci latin1_swedish_ci
> collation database latin1_swedish_ci latin1_swedish_ci
> collation server latin1_swedish_ci latin1_swedish_ci

This pinched from https://lists.ubuntu.com/archives/ubuntu-users/2006-February/067367.html because it was so very useful =)

C) Now follow this guide: http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet which you can also find copied below. Note that I was able to convert my latin1 encoded database (containing a wordpress blog) and all its tables (one-by-one) into utf8 using the below commands whilst the server was ‘hot’ – but take care! (and backups!!). Put db/table names with hyphens into backticks, e.g. `wp-blog`.

——–
To support worldwide languages, you need to use UTF-8 encoding for your web pages, emails and application, rather than ISO 8859-1 or another common western encoding, since these don’t support characters used in languages such as Japanese and Chinese.

Happily, UTF-8 is transparent to the core Latin characterset, so you won’t need to convert all your data to start using UTF-8. But there are a number of other issues to deal with. In particular, because UTF-8 is a multibyte encoding, meaning one character can be represented by more one or more bytes. This causes trouble for PHP, because the language parses and processes strings based on bytes, not characters, and makes mincemeat multibyte strings – for example, by splitting characters ‘in half’, bodging up regular expressions, and rendering email unreadable.

There are a number of great articles online about UTF-8 and how it works – Joel Spolski’s comes to mind – but very few about how to actually get it working with PHP and iron out all the bugs. So, here to save you the time we put in, is a quick cheatsheet and info about a few common issues.

1. Update your database tables to use UTF-8

CREATE DATABASE db_name
CHARACTER SET utf8
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
DEFAULT COLLATE utf8_general_ci
;

ALTER DATABASE db_name
CHARACTER SET utf8
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
DEFAULT COLLATE utf8_general_ci
;

ALTER TABLE tbl_name
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
;

2. Install the mbstring extension for PHP

Windows: download the dll if it’s not in your PHP extensions folder, and uncomment the relevant line in your php.ini file: extension=php_mbstring.dll
Linux: yum install php-mbstring

3. Configure mbstring

Do this in php.ini, httpd.conf or .htaccess. (Remember to prepend these with ‘php_value ‘ in httpd.conf or .htaccess.)

mbstring.language = Neutral ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On ; HTTP input encoding translation is enabled
mbstring.http_input = auto ; Set HTTP input character set dectection to auto
mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order = auto ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset = UTF-8 ; Default character set for auto content type header

4. Deal with non-multibyte-safe functions in PHP

The fast-and-loose way to do this is with the following php configuration:

mbstring.func_overload = 7 ; All non-multibyte-safe functions are overloaded with the mbstring alternatives

But there are problems with this. php.net has a warning about this potentially affecting the whole server. And even if this isn’t an issue for you, mbstring can make a mess of binary strings.

So, a better route is to search your application code for the following functions, and replace them with mbstring’s ‘slot-in’ alternatives:

mail() -> mb_send_mail()
strlen() -> mb_strlen()
strpos() -> mb_strpos()
strrpos() -> mb_strrpos()
substr() -> mb_substr()
strtolower() -> mb_strtolower()
strtoupper() -> mb_strtoupper()
substr_count() -> mb_substr_count()
ereg() -> mb_ereg()
eregi() -> mb_eregi()
ereg_replace() -> mb_ereg_replace()
eregi_replace() -> mb_eregi_replace()
split() -> mb_split()

5. Sort out HTML entities

The htmlentities() function doesn’t work automatically with multibyte strings. To save time, you’ll want to create a wrapper function and use this instead:

/**
* Encodes HTML safely for UTF-8. Use instead of htmlentities.
*
* @param string $var
* @return string
*/
function html_encode($var)
{
return htmlentities($var, ENT_QUOTES, ‘UTF-8’) ;
}

6. Check content-type headers

Check through your code for any text-based content-type headers, and append the UTF-8 charset, so the browser knows what it’s working with:

header(‘Content-type: text/html; charset=UTF-8’) ;

You should also repeat this at the top of HTML pages:
(replace square brackets with angle ones!!)

[meta http-equiv=”Content-type” value=”text/html; charset=UTF-8″ /]

7. Update email scripts

Email can be tricky. You’ll need to update the content-type for any emails and text-based mime parts to use UTF-8 encoding. You’ll also need to alter the way in which headers are encoded to use UTF-8. mbstring provides a function mb_encode_mimeheader() to handle this for you, but it does make a mess of address lists: you’ll need to encoding the name and address parts seperately, then compile them into an address list.

Be sure to encode the subject and other headers too – Korean speakers will tend to put Korean text for the subject.

9. Check binary files and strings

Finally, double check any binary files and strings handled by PHP, particularly uploads, downloads and encryption. In some cases it may be necessary to revert to ASCII just before a download or processing a binary string.

——–

MySQL clients:
Check the options in your mysql clients – should be switched to UTF8 mode also, or you might get your results in latin1, or something wierd like that.

Perhaps use tips from http://tlug.dnho.net/?q=node/276 if you are trying to mysqldump & restore from latin1 to utf8.

PHP:
Use the above to set your PHP / front end into UTF8 (see mbstrings section).