fixing a word of fgets regular expression

There is a trouble that it is not goes well at reading one line regular expression.
I would like to code if only one dot contained in a line, issue an alert.

$file = fopen("test.txt", "r");

if($file){
	while($line = fgets($file)){

		if(preg_match('/^.$/', $line)){
			echo "please write again.<br>";
		} else {
			echo "OK<br>";
		}

	}
}
fclose($file);

Regular expression does not work well in this code.

add half-width dot in the regular expression

if($file){
	while($line = fgets($file)){

		if(preg_match('/^..$/', $line)){
			echo "please write again.<br>";
		} else {
			echo "OK<br>";
		}

	}
}

A half-size dot in a regular expression means an arbitrary one character.

Count the number of characters, bytes, width in PHP

How to count the number of characters, number of bytes, character width (apparent length) of a character string in PHP?

1. strlen
function strlen
Generally, to know the number of bytes, use strlen.

mb_internal_encoding('UTF-8');
$char = 'To die: to sleep No more; and by a sleep to say we end. The heart-ache and the thousand natural shocks. That flesh is heir to, tis a consummation Devoutly to be wishd. ';
echo strlen($char);

168

2. mb_strlen
function mb_strlen
Also, use mb_strlen to distinguish between full-width and half-width characters and count the number of characters.

mb_internal_encoding('UTF-8');
$char = '名前(カタカナ)';

echo mb_strlen($char);

3. mb_strwidth
function mb_strwidth
Use mb_strwidth to count character width(apparent length).

mb_internal_encoding('UTF-8');
$char = 'おはようございます。';

echo mb_strwidth($char);

Let’s put emoji and count it.

mb_internal_encoding('UTF-8');
$char = 'こんにちは😀';

echo strlen($char);

-> 19
it is as expected.

Then, let’s display an alert if it is more than 10 bytes.

mb_internal_encoding('UTF-8');
$char = 'こんにちは😀';

// echo strlen($char);
if(strlen($char) > 10){
	echo "this is more than 10 byte, please write again";
} else {
	echo "confirmed.";
}

-> this is more than 10 byte, please write again

Perfect job, am I.

4 byte character checking for PHP

Some of the pictograms and Chinese characters are 4bytes of UTF-8 characters, and cannot be saved depending on the version of mysql and character code. I think there is something I want to check that there are more characters than the UTF-8 range in the string.

let’s look at sample, cherry blossom is 4 byte characters.

mb_internal_encoding('UTF-8');
$char = 'abcdefgあいうえお🌸';

echo preg_replace('/[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]/', '', $char);

abcdefgあいうえお
Wow, it’s impressive.

Ok then, I want to display an alert if 4-byte characters are included.
Let’s see regular expression below.

mb_internal_encoding('UTF-8');
$char = 'abcdefgあいうえお';
// $char = 'abcdefgあいうえお🌸';

// echo preg_replace('/[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]/', '', $char);

if(preg_match('/[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]/', $char)){
	echo "alert, it included 4 byte";
} else {
	echo "there is no 4 byte charcter";
}

Oh my goodness.
“just trust yourself. surely, the way to live is coming. Goethe”

Character Code as mainly used in Japan

Character Code
It refers to the correspondence between the characters used on the computer and the numbers in bytes assigned to each letter. Character codes have some to be used in many language spheres by computers, and the variety has increased. Typical character codes are said to be more than 100 kinds.

Mainly used in Japan
JIS Code
The official name is “ISO-2022-JP”. It is widely used.

SJIS(Shift-JIS) Code
It is ASCII code plus Japanese, and it is used in Japan domestic mobile phones.

EUC
It is widely used on UNIX.


Unicode consists of “encoded character set” and “character encoding method(encoding)”.

“Character set” refers to letters put together by certain rules such as “All Hiragana” and “All Alphabet”, for example. A rule in which a unique code is associated with the character set is called a “coded character set”. The associated numerical value is called a code point and it is displayed in the form of “U + xxxxx”.

“Character encoding method” is a method of converting a coded character set into another byte sequence so that it can be handled by a computer. The encoding methods include UTF-8 and UTF-16.

Unicode code point
0x0000 – 0x007f : ASCII
0x0080 – 0x07ff : Country alphabet
0x0800 – 0xffff : Indian characters, punctuation marks, academic symbols, pictograms, East Asian characters, double-byte, half-size

OK!