What’s going on with those Swift Substrings?

Notes

Although we expect string slicing to be a common task in software development, many Swift developers find substring to be complicated and frustrating.

Comparing substring in other languages 

Let’s see how substring works in other computer languages:

C++: str.substr(2,8)
Java: str.substring(2,8)
Python: str[2,8]
Objective-C: [str substringWithRange:NSMakeRange(2,6)]

Now let’s see how to substring works in Swift:

var startIndex = str.index(str.startIndex, offsetBy: 2)
var endIndex    =   str.index(startIndex, offsetBy: 6)
var substring = str[startIndex..<endIndex]

The root of the problem – UTF-8

To understand how strings work, we need to go back to the basics – Unicode and UTF-8.

When we work with strings, we have the feeling we are dealing with a plain text, just an array of symbols and numbers but this is a lie. It used to be the case back then when computers worked with something called ASCII.

ASCII was a way to represent all the important characters (letters, digits, symbols) as a number between 32 to 127, so every character took 1 byte of memory. And what about 127 to 255? Every developer could use it for whatever he wanted, so you can imagine the mess we had when computers got out of the US to non-english countries.

That’s where Unicode takes a part – Unicode is the way of representing every letter and digits you can think of, in almost every language in the world, and not only that – Unicode is great for representing emojis as well.

So the Unicode characters map is 4 bytes map, and since most of the characters we are typing are English letters and digits, it is very inefficient to allocate 4 bytes for each character, when in most cases 1 byte is enough, That’s the final piece of the matrix -> encoding, and in this case – UTF-8.

UTF-8 is a way to encode a Unicode string to smaller chunks of data so it can be more efficient.

How UTF-8 works under the hood?

You can find plenty of materials about text encoding on the web, so I won’t dive into it in this post, but I will try at least to explain the basics.

Unicode holds 4 bytes of Characters mapping data. The first 127 characters are the good old ASCII, the next 1920 characters are Latin languages, Arabic, Hebrew, etc. After that, we can find Chinese, Japanese and Korean and so on.

So, according to the above, you can understand that when dealing with English for example, most of the cases we don’t need to take advantage of all 4 bytes, we can use only the first byte of the mapping.

So how the encoding works under the hood? How can we tell how many bytes we use for each character? We use the first bits of the data to inform the encoder how many bytes we allocated for this character.

The best way to understand that is by example – 

Let’s try to represent the character “₤” in UTF-8.

The Unicode HEX value of “₤” is U+20A4, meaning it will take 3 bytes of memory.

The binary value of 20A4 is 0010 0000 1010 0100.

Now, we want to announce that we are going to use 3 bytes of memory, so we start with 3 bits of 1 and then 0:

1110

Then, to finish the first byte, we going to take the first 4 bits of the 20A4 and complete the first byte of the UTF-8 value:

1110-0010.

For the second byte, we start with 10, and after that the next 6 bits of the 20A4:

1110-0010 1000-0010

For the third byte we are going to do the same – starts with 10 and then the rest 6 bits of 20A4:

1110-0010 1000-0010 1010-0100

And this is the way we represent “₤” in UTF-8 and almost every symbol in the world.

UTF-8 is better than UTF-16, right?

Yes. I mean, No. Well, it’s complicated 🙂 It depends on what you use it for. 

For example, If most of the symbols located in the ASCII area (English and numbers) UTF-8 can be very efficient. But if you write in Chinese or using a lot of emojis, then probably UTF-16 is more suitable for this case

Just like in the previous part, I’m going to use an example to explain:

The string “AB” in UTF-8 will be “41 42”, and in UTF-16 “41 00 42 00”.

But the Chinese string “いろは”, will be “E3 81 84 E3 82 8D E3 81 AF” in UTF-8, and “44 30 8D 30 6F 30” in UTF-16 – half of the space.

What encoding has to do with substrings in Swift?

Imagine the next string:

let str = “I have a friend called 摩西, nice name 😀, ha?”

The above string has 43 characters, but it takes space of 50 bytes. You already understand the difference between the number of characters and the number of bytes because of the previous explanations about how UTF-8 works.

For example, the name “摩西” which holds 2 characters, takes 6 bytes of space, and the single emoji “😀” takes 4 bytes of space (!). 

Let’s see what happens when we try to print the length of the string:

print(str.count) // print 43
print(str.utf8.count) //print 50

If theoretically, you could do str[10], you would expect the compiler to reach the 10th place in the variable and return “f”.But to do that, the compiler needs to understand what is the memory address of the 10th element in the string. That would be very easy if each character always takes 1 byte, but since it’s an unknown size, the compiler needs to iterate from the beginning of the string until he reaches the 10th element, and because of that, this is an O(n) operation only to access a single variable.

So, when you trying to do something like str[10…15], you need to iterate the string twice only to get a substring.

Meet String.Index

To do string manipulation efficiently, we have String.Index. String.Index is a struct that represents a calculated position in a string.

Because we don’t want to iterate the string twice to get a substring, we can use String.Index to iterate only once.

 We iterate to the first index, and after we iterate from the first index to the second one.

let str = "I have a friend called 摩西, nice name 😀, ha?"
let startIndex = str.index(str.startIndex, offsetBy: 2)

// startIndex is a calculated index to the second place in the string

let endIndex = str.index(startIndex, offsetBy: 6)

// endIndex is calculated by iterating from start index 6 places, so we actually don't need to iterate from the beginning of the string

The advantage of using indexes is that it’s already calculated. You can save it, pass it along and reuse it instead of letting the compiler iterating the string each time you want to access a specific location.

More String Efficiency Tricks

Strings in swift have more efficiency tricks besides String.Index. For example, did you know that when you’re substring a string, you don’t get a newly allocated space for the substring? It’s just a reference to the memory part of the original string, so the string and its substring share their storage which is efficient both in terms of memory and performance. It means that the substring is a temporary variable, and if you want to continue working with the string slice, you need to convert it to a string.

let string = "This is a regular string"
let startIndex = string.index(string.startIndex, offsetBy : 1)
let endIndex = string.index(startIndex, offsetBy: 3)
let substring = string[startIndex..<endIndex]

// substring is actaully a stringSlice and not really a string. 

let substringAsString = String(substring) // now we converted it to a real string

The other neat trick Swift has is COW (Copy-On-Write). COW means that when you copy a variable (Collection types and Strings), it will share the same storage space and won’t allocate a new memory until you modify the new copy. This is a lazy memory allocating.

String Extension

If you still want to do something like str[5..<10] to get a substring, you can add a string extension to do just that:

extension String {
    
    subscript (_ index: Int) -> String {
        return String(self[self.index(startIndex, offsetBy: index)])
    }
    
    subscript (_ range: CountableRange<Int>) -> String {
        let lowerBound = index(startIndex, offsetBy: range.lowerBound)
        let upperBound = index(startIndex, offsetBy: range.upperBound)
        return String(self[lowerBound..<upperBound])
    }
    
    subscript (_ range: CountableClosedRange<Int>) -> String {
        let lowerBound = index(startIndex, offsetBy: range.lowerBound)
        let upperBound = index(startIndex, offsetBy: range.upperBound)
        return String(self[lowerBound...upperBound])
    }
    
    subscript (_ range: CountablePartialRangeFrom<Int>) -> String {
        return String(self[index(startIndex, offsetBy: range.lowerBound)...])
    }
    
    subscript (_ range: PartialRangeUpTo<Int>) -> String {
        return String(self[..<index(startIndex, offsetBy: range.upperBound)])
    }
    
    subscript (_ range: PartialRangeThrough<Int>) -> String {
        return String(self[...index(startIndex, offsetBy: range.upperBound)])
    }
    
}

But remember it’s not the most efficient approach and the recommendation is to use String.Index if you want your program to run faster.

Summary 

A string is not a plain array, and under the hood, it’s a complicated and different beast. The Swift language developers chose the approach of performance and memory efficiency instead of readability. Fortunately, we can choose (just like in other developing areas) between readability and efficiency by adding a string extension, that like in many other cases, make our life as a developer much easier.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s