A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
class Solution(object):
def validUtf8(self, data):
"""
:type data: List[int]
:rtype: bool
"""
count = 0
for d in data:
if 128<=d<192:
if count==0:
return False
count -= 1
else:
if count:
return False //如果后面值需要接一个10开头的,但是却有两个10开头的,也就是count<0了,这时候也是返回错误的
if d<128:
continue
elif 192<=d<224:
count = 1
elif 224<=d<240:
count = 2
elif 240<=d<248:
count = 3
else:
return False
return count==0
1 UTF8: character encoding method using 1 to 4 bytes to encode all unicode
2 If using 1 byte, the first bit should be 0
3 If using n bytes, the first n bits should be all ones, and the n+1 bit should be 0, and the most significant 2 bits for all n-1 bytes should be 10
4 分别判断数是在哪个范围,分别遍历data的数,如果在:
<128 说明是ASCII码,直接跳过;
128<=x<192 以10开头,如果count!=0,则count=count-1;如果count=0,则返回false
192<=x<224 以110开头,说明后面必须得跟1个10开头的数才正确,所以count=1
224<=x<240 以1110开头,说明后面必须得跟2个10开头的数才正确,所以count=2
240<=x<248 以11110开头,说明后面必须得跟3个10开头的数才正确,所以count=2
其他情况返回false
最后判断count是否等于0,如果等于0,说明以10开头的个数是正确的
5 为什么先要判断是不是10开头?因为要特别注意第一个数是不是10开头的,如果是10开头,则应该返回False,所以初始化count=0,如果第一个就是10开头,表示此时count=0,返回False,如果不是10开头,则可以跑后面的分支
6 int("11001101", 2)可以将二进制转换成十进制
7 第一个数以10开头是不对的
