1

00:00:01,260  -->  00:00:07,830
Let's not learn about character encoding and it's an absolutely fundamental lesson before we do any

2

00:00:07,830  -->  00:00:14,070
programming that involves processing of text it is important to know about character encoding in case

3

00:00:14,070  -->  00:00:18,000
you're not familiar with this topic then it is painful learning it now.

4

00:00:18,000  -->  00:00:20,740
So let's get started with character encoding.

5

00:00:21,690  -->  00:00:23,760
Let's start with absolute basics.

6

00:00:24,210  -->  00:00:26,580
Most people classify files into bucket.

7

00:00:26,610  -->  00:00:34,650
It is binary and text binary would include content like images audio and video right text is about files

8

00:00:34,650  -->  00:00:36,690
with characters.

9

00:00:37,780  -->  00:00:44,810
However fundamentally all files are binary that a sequence of bytes where each byte is a group of eight

10

00:00:44,820  -->  00:00:48,490
bits on the bed is either 0 or 1.

11

00:00:48,720  -->  00:00:50,880
So text files are also binary files.

12

00:00:50,970  -->  00:00:58,500
It is just that they're stored in a certain way in binary all files whether they are binary or text

13

00:00:58,860  -->  00:01:06,930
would look alike to computers hardware hollower software would make a distinction for instance bytes

14

00:01:06,930  -->  00:01:13,950
are present in text are only handled by text processing software like Notepad word pide or editors like

15

00:01:13,950  -->  00:01:15,980
vi and Eclipse.

16

00:01:16,170  -->  00:01:21,920
Similarly bytes representing images are only handled by image processing software like Windows photo

17

00:01:22,370  -->  00:01:32,290
or windows paint application and computers use hexadecimal numbers to represent bytes for example characters

18

00:01:32,290  -->  00:01:41,390
is represented using the hexadecimal Korsak 61 which is stored in binary as the byte 0 1 1 0 0 0 0 1

19

00:01:42,270  -->  00:01:44,150
6 corresponds to the first four bits.

20

00:01:44,260  -->  00:01:50,300
Well one for the last four bits not the same bit pattern would represent something else.

21

00:01:50,340  -->  00:01:52,660
In an image file.

22

00:01:53,820  -->  00:01:59,910
Also note that a single byte is represented in one of the two of the six big patterns that is stewpot

23

00:01:59,960  -->  00:02:02,650
8 as a byte has a bits.

24

00:02:03,030  -->  00:02:12,310
So it means we can represent as many as 256 characters using different variations of a byte but to present

25

00:02:12,330  -->  00:02:15,880
more characters we need to use more than one byte.

26

00:02:16,290  -->  00:02:22,370
And that's with a different encoding schemes coming to agree and we will see all of that in about now

27

00:02:22,380  -->  00:02:24,860
since the title of the lesson is character encoding.

28

00:02:24,990  -->  00:02:33,540
We will only look at text representations unprocessed in text is very complex due to the number of languages

29

00:02:33,570  -->  00:02:39,120
involved and also due to the number of ways in which characters in those languages can be represented

30

00:02:40,500  -->  00:02:47,490
Nemeroff is would be a different character encoding schemes so a given text follows some encoding scheme

31

00:02:48,720  -->  00:02:55,050
under the software like say a browser does not understand the character encoding used.

32

00:02:55,050  -->  00:02:59,140
Then you would end up seeing familiar with characters such as this.

33

00:02:59,190  -->  00:03:02,580
This happens usually for international characters.

34

00:03:03,510  -->  00:03:11,340
And so every file uses some encoding scheme to represent its content and then encoding scheme simply

35

00:03:11,340  -->  00:03:18,540
maps every character to a unique x or a similar Amber whose binary representation is just some encoding

36

00:03:18,540  -->  00:03:25,140
scheme is basically an algorithm which maps Karakas to hexadecimal numbers and whose binary representation

37

00:03:25,140  -->  00:03:26,520
are used.

38

00:03:27,360  -->  00:03:33,660
And here's an example of encoding and decoding with regard to encoding character is first and coded

39

00:03:33,900  -->  00:03:36,990
or map to hexadecimal most 61.

40

00:03:37,200  -->  00:03:42,180
And finally the binary equal end of 61 is used for the actual storage.

41

00:03:42,180  -->  00:03:49,560
The recording process is simply the reversal and coding process and here are some example encoding schemes

42

00:03:51,200  -->  00:03:56,380
on each encoding scheme is basically an implementation of some characters.

43

00:03:56,910  -->  00:04:02,910
For instance you be afraid you give 16 and you give 32 or all implementations of Unicode characters

44

00:04:02,910  -->  00:04:10,650
it dumps encoding scheme and characters that are frequently intermixed but to speak precisely they are

45

00:04:10,650  -->  00:04:13,690
different with regards to ASCII.

46

00:04:13,710  -->  00:04:21,480
It can be viewed as both a character set encoding scheme initially only characters that mattered who

47

00:04:21,480  -->  00:04:27,670
are unaccented English letters and for that 7 bit ASCII codes were used.

48

00:04:27,750  -->  00:04:30,700
Here is an example in ASCII hex cards.

49

00:04:30,710  -->  00:04:36,620
What we want to fight we are used to the present uppercase alphabet agency.

50

00:04:37,710  -->  00:04:43,500
But since we also how many other languages soon other 8 bit extensions of ASCII came up.

51

00:04:43,710  -->  00:04:47,680
I also added five line is a standard ascii based extension.

52

00:04:48,180  -->  00:04:54,440
In fact I also did 5:9 had 15 different variations within different regions in the world.

53

00:04:54,810  -->  00:05:01,290
In these extensions the fust one bungee cords were all less identical and would correspond to the 7

54

00:05:01,290  -->  00:05:08,900
bit US-ASCII the remaining 128 CTS is where we see the differences based on the region.

55

00:05:09,540  -->  00:05:15,540
No Asian languages have thousands of letters on 8 bits wasn't sufficient for them.

56

00:05:15,540  -->  00:05:21,630
So this operator do us another encoding scheme called double byte characteristics are DDC as in short

57

00:05:23,880  -->  00:05:29,530
Sagitta all these variations when documents were mailed from one country to another then they are not

58

00:05:29,520  -->  00:05:31,740
getting decoded properly.

59

00:05:31,750  -->  00:05:38,230
For instance a hexadecimal code of a particular character in one country would correspond to a completely

60

00:05:38,230  -->  00:05:40,910
different character in some other country.

61

00:05:40,930  -->  00:05:46,240
Here we can see an example of that the word or as you may in one country get to despair differently

62

00:05:46,240  -->  00:05:49,940
in a different country in this because the second country is Israel.

63

00:05:50,250  -->  00:05:57,000
Here Hicks court for Oxenford e corresponds to the Hebrew character Jamail on Israeli Beezus.

64

00:05:57,340  -->  00:06:05,050
So there were decoding issues and the advent of internet simply made it worse and it became a huge problem

65

00:06:05,750  -->  00:06:13,600
on to this Unicode characters that was invented which aims to cover all languages in the world.

66

00:06:14,650  -->  00:06:17,860
Let's look at some quick facts about Unicode encoding.

67

00:06:18,120  -->  00:06:25,440
It is Mandarin by Unicode Consortium and it is backward compatible with Southern US-ASCII.

68

00:06:25,600  -->  00:06:34,350
So one can be it characters are seen no initial assumption was that 16 bits which would represent 65000

69

00:06:34,360  -->  00:06:42,670
536 characters would suffice to cover all languages and these characters together as a group are referred

70

00:06:42,670  -->  00:06:52,300
to as basic multi-lingual bin or VMP foreshock as a result UCSD which was a fixed length and going skiing

71

00:06:52,480  -->  00:07:01,270
with 16 bits was created when Mr. Fix lent it means that it used exactly 16 bits to represent any character

72

00:07:01,260  -->  00:07:03,360
.

73

00:07:03,370  -->  00:07:07,020
But soon it was realized that there were more characters.

74

00:07:07,240  -->  00:07:11,450
These characters were especially important for Asian markets.

75

00:07:11,470  -->  00:07:17,140
These would also include things like Smiley graphics symbols which we see in various social networking

76

00:07:17,130  -->  00:07:24,600
applications as a result other encoding schemes like usurious for a new age of 16 were created to accommodate

77

00:07:24,670  -->  00:07:32,510
new characters but useless for was not a word as it required 4 bytes for every character.

78

00:07:32,800  -->  00:07:36,060
And so it would be a lot of this based on memory.

79

00:07:37,650  -->  00:07:45,130
According to Wikipedia as of today Unicode covers are on one conduct characters from on 129 scripts

80

00:07:45,120  -->  00:07:45,760
.

81

00:07:46,060  -->  00:07:49,550
So Unicode is really universal.

82

00:07:50,470  -->  00:07:54,040
Now as we just discussed Unicode characters that has two parts.

83

00:07:54,150  -->  00:07:57,860
One is BMB that is the basic multi-lingual plane.

84

00:07:58,050  -->  00:08:03,820
Well second is referred to as supplementary characters and let's just call it as non-being.

85

00:08:04,600  -->  00:08:05,720
So BNP how.

86

00:08:05,760  -->  00:08:12,660
Sixty five thousand five thirty six characters why while B and B probably have it on the five key characters

87

00:08:12,660  -->  00:08:12,890
.

88

00:08:13,020  -->  00:08:20,470
That is as of today and we may expect more characters to be added to non-being be said and the hexadecimal

89

00:08:20,470  -->  00:08:27,500
code that represents each character is referred to as code point on to represent a single BNP character

90

00:08:27,500  -->  00:08:27,700
.

91

00:08:27,770  -->  00:08:30,170
One 16 bit code point is used.

92

00:08:30,430  -->  00:08:36,840
So that's an exciting animal called hoverboard for each supplement we got after 16 bit code points are

93

00:08:36,850  -->  00:08:38,280
used.

94

00:08:38,370  -->  00:08:46,060
Now this illustration also shows a range of code points for both B and B as well as non BMV.

95

00:08:46,260  -->  00:08:50,030
And here's an example of a BMB code point for the Euro symbol.

96

00:08:50,640  -->  00:08:52,650
You place here means Unicode.

97

00:08:53,010  -->  00:08:58,190
I'm going DHC is the actual court which is in hexadecimal.

98

00:08:58,370  -->  00:09:02,860
Of 16 and udy of aid are popular implementations of Unicode.

99

00:09:03,150  -->  00:09:06,510
So let's discuss each of them briefly.

100

00:09:07,020  -->  00:09:09,240
Let's begin with UGF 16.

101

00:09:09,270  -->  00:09:13,830
It is a variable length encoding scheme to represent BNP characters.

102

00:09:13,870  -->  00:09:20,770
It needs two bytes one for non BNP characters it is four bytes for the course and B and B it has a one

103

00:09:20,760  -->  00:09:22,320
to one correspondence.

104

00:09:22,480  -->  00:09:26,680
That is it uses the same hex scores as the code points in B and B.

105

00:09:26,860  -->  00:09:35,980
And this applies to UGF 32 format also no FERNANDE and because UTF 16 uses unicode values put up as

106

00:09:35,980  -->  00:09:44,280
an that you could once just we call that non BMB has to code points for each character.

107

00:09:44,290  -->  00:09:46,620
Each Unicode value has two parts.

108

00:09:46,910  -->  00:09:53,460
Together these two Unicode values are referred to as SoBo get peace with the Unicode value is called

109

00:09:53,470  -->  00:09:54,680
as high Celgard.

110

00:09:54,930  -->  00:10:01,270
Well the second one as Losar target for example here has an emoji symbol which represents the source

111

00:10:01,270  -->  00:10:07,210
of joy and you might have seen it in your walks up keyboard or in other kinds of social networking as

112

00:10:08,430  -->  00:10:11,130
as you can see doing it as a single character.

113

00:10:11,250  -->  00:10:20,350
It's Unicode point is one F-16 or two and it is represented by that subtle get you know Java C-Sharp

114

00:10:20,350  -->  00:10:28,180
on Python are some languages that internally use uty of 16 characters that is internally within the

115

00:10:28,170  -->  00:10:28,770
memory.

116

00:10:28,910  -->  00:10:36,310
And language like Java or visit or present a single character in Yogya of 16 that is even if the character

117

00:10:36,310  -->  00:10:43,060
was decoded from a file that was using a different encoding but when it comes to storing in memory it

118

00:10:43,060  -->  00:10:45,480
gets stored in uty of 16 format.

119

00:10:47,530  -->  00:10:55,240
Not come into UTF 8 it is also available encoding scheme like Yugi of 16 and it gives the first 128

120

00:10:55,240  -->  00:11:01,120
characters that are present the ASCII characters can be represented using a single bite.

121

00:11:01,310  -->  00:11:05,890
And so it is backwards compatible with ASCII for all other characters.

122

00:11:05,890  -->  00:11:11,260
It would require anywhere between Google for bytes and for non VMP characters.

123

00:11:11,260  -->  00:11:13,500
It would be 4 bytes for sure.

124

00:11:14,230  -->  00:11:21,100
So as it needs only one byte for plain-English got a curse it is definitely free for them Yugi of 16

125

00:11:21,100  -->  00:11:25,960
on the other hand needed to bytes even for plain ASCII characters.

126

00:11:25,960  -->  00:11:33,010
So there is some vestige of space when it comes to plain-English characters UTF 8 is also the most popular

127

00:11:33,010  -->  00:11:38,590
encoding scheme on more than half the pages are encoded in UK.

128

00:11:39,350  -->  00:11:46,450
And it was invented by Ken Thompson who also designed and implemented the Unix operating system and

129

00:11:46,450  -->  00:11:52,990
has also invented the beat programming language which is a direct predecessor to the C programming language

130

00:11:52,990  -->  00:11:54,280
.

131

00:11:54,280  -->  00:12:01,330
Just know that you'll be a stands for Unicode transmission format and so it is basically an algorithmic

132

00:12:01,330  -->  00:12:08,230
mapping from every Unicode could point to a unique byte sequence in few minutes.

133

00:12:08,230  -->  00:12:14,500
We will also do a demo when we see how the same code point gets encoded differently using different

134

00:12:14,590  -->  00:12:17,500
and coding schemes.

135

00:12:17,500  -->  00:12:24,970
Norman unicorn's games like U-T of 16 cents the basic unit is two bytes they can be who is in which

136

00:12:25,150  -->  00:12:33,280
those bytes can be stored in memory or even on disk one is quite as big endian or before shot while

137

00:12:33,280  -->  00:12:38,790
the other as low endian or Eliphaz shock in big endian format.

138

00:12:38,890  -->  00:12:43,550
Most significant Byard will be stored at last memory address.

139

00:12:43,840  -->  00:12:51,130
And this is also most commonly used for Mac under default one do while in low endian most significant

140

00:12:51,130  -->  00:12:59,080
byte would be stored at highest memory address and there is something called byte order maake or B-Boy

141

00:12:59,190  -->  00:13:00,190
for.

142

00:13:00,760  -->  00:13:08,800
Which is used to identify the endianness that is big endian on Indian or that it uses a special Unicode

143

00:13:08,800  -->  00:13:15,790
character for big endian the Unicode character is F E F F will follow.

144

00:13:15,800  -->  00:13:22,360
In the end it is all positive it is the reverse F F F G and these byte order marks are placed at the

145

00:13:22,360  -->  00:13:28,480
beginning of a data stream and that or tell if their data stream uses big endian or low end the end

146

00:13:28,480  -->  00:13:29,920
format.

147

00:13:30,850  -->  00:13:32,260
And here is an example.

148

00:13:32,500  -->  00:13:38,880
The first line shows the data stream for the string Hello second line shows the same data stream Magners

149

00:13:38,920  -->  00:13:46,570
big endian while the next one is marked as low endian as you can see the order of MSBA is different

150

00:13:46,580  -->  00:13:47,080
.

151

00:13:47,590  -->  00:13:52,220
So when you add code or the code you can specify the endianness.

152

00:13:52,300  -->  00:13:57,260
For example you can say you have 16 be on you get 16.

153

00:13:57,260  -->  00:14:05,410
LG No one final important thing to note is that you should always decoded data stream using the same

154

00:14:05,410  -->  00:14:11,990
encoding scheme that was used to and coded all it has to be compatible encoding scheme.

155

00:14:12,010  -->  00:14:18,910
For instance if data stream was encoded using ASCII then you can decode it using you defeat as you'd

156

00:14:18,910  -->  00:14:22,710
be if ADAs backward compatible with ASCII.

157

00:14:22,810  -->  00:14:23,990
So that's about it.

158

00:14:24,030  -->  00:14:30,070
Buteyko the resources section for some great pointers on this topic along with some additional notes

159

00:14:30,130  -->  00:14:30,870
.

160

00:14:31,120  -->  00:14:35,950
Altadis mentioned earlier let's not do a quick game off and coding.

161

00:14:37,080  -->  00:14:42,250
OK in this demo we are going to end quote three different characters using different encoding schemes

162

00:14:43,360  -->  00:14:50,830
and the characters are the character the symbol the character Eudo And finally an emoji symbol which

163

00:14:50,830  -->  00:14:53,530
is for B-S of joy in this case.

164

00:14:53,560  -->  00:14:56,670
So the first book out occurs in Euro RBM characters.

165

00:14:56,740  -->  00:15:04,390
Well the last one is a non B and B got a letter and to have this you wrote in your editor in your example

166

00:15:04,390  -->  00:15:05,600
in this because it's eclipse.

167

00:15:05,800  -->  00:15:13,870
For that you need to specify this particular property you should make sure that it is UTF 8.

168

00:15:13,900  -->  00:15:15,730
Only then you will be able to insert it.

169

00:15:15,730  -->  00:15:20,880
So by default inequitably it's going to be C.P. 12:52 on Windows machine.

170

00:15:21,100  -->  00:15:24,050
So you need to make sure that it isn't UGF and corny.

171

00:15:24,500  -->  00:15:30,040
And so in order to apply the encoding we are using this method called print and coding the deaths on

172

00:15:30,040  -->  00:15:34,840
all of all of this and most are are being passed as input to this method.

173

00:15:34,840  -->  00:15:36,600
So let's look at this method.

174

00:15:36,760  -->  00:15:38,890
So the input is a string.

175

00:15:39,120  -->  00:15:47,260
And here we are invoking this method called Get bytes which has a single parameter which accepts the

176

00:15:47,260  -->  00:15:48,850
name of the encoding scheme.

177

00:15:48,880  -->  00:15:55,380
So we are encoding the input symbol with all these different encoding schemes like ASCII Aycock for

178

00:15:55,490  -->  00:15:59,940
isolated 5:09 you are newbie of encoding schemes and it returns it.

179

00:15:59,960  -->  00:16:02,960
And of course it underlines the byte sequence.

180

00:16:03,000  -->  00:16:09,460
OK an R is dot string is basically used to concatenate them those byte sequences so that we can print

181

00:16:09,460  -->  00:16:10,310
it.

182

00:16:10,360  -->  00:16:16,100
So we are applying all these different encoding schemes that you'd be afraid you give 16 UTF 16 beat

183

00:16:16,110  -->  00:16:16,240
.

184

00:16:16,240  -->  00:16:18,500
So it is specific to begin here.

185

00:16:18,550  -->  00:16:19,840
And this is low endian.

186

00:16:19,990  -->  00:16:21,810
And here there is no Indianness.

187

00:16:22,060  -->  00:16:26,400
So it will in Gordon bomb symbol be or m right order mark.

188

00:16:26,740  -->  00:16:31,900
So let's just go ahead and run it and let's look at the notebook for all three of them.

189

00:16:31,900  -->  00:16:40,140
So vastus got a curry and in ASCII it is represented using the numeric symbol 97 which has hexadecimal

190

00:16:40,140  -->  00:16:40,770
code.

191

00:16:40,920  -->  00:16:49,230
But it's displaying on either the byte as the byte for that which is 97 I assume also has the same Euboea

192

00:16:49,390  -->  00:16:55,010
8 also has 97 because you'll be a four ASCII Euboea eight is backward compatible.

193

00:16:55,270  -->  00:16:57,580
And it uses only one byte.

194

00:16:57,610  -->  00:16:59,700
Now you do have 16 users who bytes.

195

00:16:59,730  -->  00:17:00,330
Know that.

196

00:17:00,580  -->  00:17:07,510
But in this case it is also including the bytes far byte order Mach 2 and to indicate the M.B.A..

197

00:17:07,510  -->  00:17:14,200
And in this case it is big endian because you can also see it here it is zero 6:51 and the it is 97

198

00:17:14,200  -->  00:17:14,530
0.

199

00:17:14,530  -->  00:17:16,100
Just the reverse.

200

00:17:16,120  -->  00:17:22,800
So the idea of 60 needs 16 bits and half or euros Inbal ask he said 63.

201

00:17:23,090  -->  00:17:29,410
Anders Neumann equal to 63 corresponds to the character question mark which means that it is unable

202

00:17:29,410  -->  00:17:31,490
to uncoded.

203

00:17:31,690  -->  00:17:33,260
And same goes with us.

204

00:17:33,710  -->  00:17:39,490
We had as be of all you'd be able to do and quoted can be of eight needs three bytes whereas Yukiya

205

00:17:39,490  -->  00:17:40,840
of 16 needs to base.

206

00:17:40,840  -->  00:17:44,590
You can ignore these two bytes here and you'll be 16.

207

00:17:44,910  -->  00:17:48,930
B needs to be NLE also need only two bytes.

208

00:17:48,940  -->  00:17:54,130
So as you can see the böhme is indicated the start of the data stream.

209

00:17:54,130  -->  00:17:55,340
Now let's look at the last one.

210

00:17:55,360  -->  00:17:58,030
Last one is the emoji symbol.

211

00:17:58,170  -->  00:18:03,450
And since it's a non BMB kind of grow it needs it needs Bucco points.

212

00:18:03,460  -->  00:18:03,700
OK.

213

00:18:03,700  -->  00:18:08,050
Even though it's a single Caracara it needs to code points and ASCII cannot handle it.

214

00:18:08,050  -->  00:18:12,510
So when it looks at this first individual Foster-Carter it says I cannot do X.

215

00:18:12,510  -->  00:18:18,000
So to answer a question mark and for the second one it is it do it is supposed to do.

216

00:18:18,040  -->  00:18:24,330
So it prints the corresponding numeric value which is 50 unsane goes with I ISO for UTF 8.

217

00:18:24,340  -->  00:18:26,890
It is able to encode it properly.

218

00:18:26,920  -->  00:18:30,120
So I needs 4 bytes Judaea of 16 needs 4 bytes.

219

00:18:30,130  -->  00:18:32,740
And once again we go we can ignore the first two.

220

00:18:32,740  -->  00:18:36,080
Now your idea of 16 b also needs only 4 bytes.

221

00:18:36,220  -->  00:18:37,620
And this also needs forbearance.

222

00:18:37,660  -->  00:18:39,950
The only difference here is this is 31 5:51.

223

00:18:39,970  -->  00:18:44,640
And here it is 9 to 6:31 0 50 and 50 0.

224

00:18:44,650  -->  00:18:48,860
Now in the browser run the browser looks at this byte sequence.

225

00:18:49,150  -->  00:18:52,000
It should be able it should render a single character.

226

00:18:52,270  -->  00:18:52,610
OK.

227

00:18:52,630  -->  00:18:54,370
The emoji symbol.

228

00:18:54,490  -->  00:19:00,520
Now if the bot browser does not have the capability to to render non-being bmp characters then you will

229

00:19:00,520  -->  00:19:02,120
not be able to see those symbols.

230

00:19:02,150  -->  00:19:03,920
You'll see some other symbol.

231

00:19:04,120  -->  00:19:05,700
So that's that's the thing.

232

00:19:05,710  -->  00:19:11,500
And many US are supporting some of those sort are not supporting I guess so in that case you will just

233

00:19:11,540  -->  00:19:13,390
see some other symbol.

234

00:19:13,750  -->  00:19:15,100
So that's about it.

235

00:19:15,370  -->  00:19:24,370
And sometimes even databases do not support non-ABS characters so they don't like to insert any data

236

00:19:24,630  -->  00:19:27,790
which is not which has an on B and B characters.

237

00:19:27,820  -->  00:19:31,970
You might get some kind of an exception because some databases don't support.

238

00:19:32,020  -->  00:19:32,770
So that's about it.

239

00:19:32,770  -->  00:19:38,170
Hope you liked this character encoding lesson and it's a very fundamental lesson and that's about it

240

00:19:38,170  -->  00:19:38,200
.

241

00:19:38,200  -->  00:19:38,880
Thank you.

242

00:19:38,960  -->  00:19:39,910
And happy coding