There are a number of different String types in V8 which are optimized for various situations. If we look in src/objects/objects.h we can see the object hierarchy:
Object
SMI
HeapObject // superclass for every object instans allocated on the heap.
...
Name
String
SeqString
SeqOneByteString
SeqTwoByteString
SlicedString
ConsString
ThinString
ExternalString
ExternalOneByteString
ExternalTwoByteString
InternalizedString
SeqInternalizedString
SeqOneByteInternalizedString
SeqTwoByteInternalizedString
ConsInternalizedString
ExternalInternalizedString
ExternalOneByteInternalizedString
ExternalTwoByteInternalizedString
Note that v8::String is declared in include/v8.h
.
Name
as can be seen extends HeapObject and anything that can be used as a
property name should extend Name.
Looking at the declaration in include/v8.h we find the following:
int GetIdentityHash();
static Name* Cast(Value* obj)
This section contains some notes about unicode which are good to keep in mind while reading the code and understanding what some of the string related functions do.
An abstract character in unicode has a name like LATIN SMALL LETTER A
.
A code point is a number associated with an abstract character, for example U+0061 (U+ where U stands for Unicode).
A plane is a range of 65536 continuous code points from U+n0000 to U+nFFFF. There are 17 equals groups:
Plane 0: U+0000 -> U+FFFF Basic Multilingual Plane (BMP)
Plane 1: U+10000 -> U+1FFFF Supplementary Multilingual Plane
Plane 2: U+20000 -> U+2FFFF Supplementary Ideographic Plane
Plane 3: U+30000 -> U+3FFFF
...
Plane 16: U+100000 -> U+10FFFF Supplementary Private Use Area B.
BPM contains most of the characters that are used when programming which can have 4 hex digits.
Not to be confused with code points which are just numbers used to look up abstract characters in the format U+. The memory in a computer does not deal with code points or abstract characters but instead it deals with code units which is a bit sequence. You can take an code point and put it through a function that translates it into a code unit. This process is called character encoding. There are different encodings and JavaScript uses UTF-16.
Code points from BPM are transformed/encoded into single code unit of 16-bits Code points from the other planes are transformed/encoded into two code units each of size 16-bits.
Abstract Character Code point Code unit
LATIN SMALL LETTER A U+0061 0x0061
LATIN SMALL LETTER B U+0062 0x0062
...
To represent the Swedish character 'å' we need to code units each of 16-bits:
$ node -p "console.log('\u0061\u030A')"
å
This pair for code units represent a single abstract character. The first is called the high-surrogate code unit, and the second the log-surrogate code unit.
In the case or 'å' the second code unit is called a combinging mark as it modifies the character preceeding it.
A String extends Name
and has a length and content. The content can be made
up of 1 or 2 byte characters.
Looking at the declaration in include/v8.h we find the following:
enum Encoding {
UNKNOWN_ENCODING = 0x1,
TWO_BYTE_ENCODING = 0x0,
ONE_BYTE_ENCODING = 0x8
};
int Length() const;
int Uft8Length const;
bool IsOneByte() const;
Example usages can be found in test/string_test.cc. Looking at the functions I've seen one that returns the actual bytes from the String. You can get at the in utf8 format using:
String::Utf8Value print_value(joined);
std::cout << *print_value << '\n';
So that is the only string class in include/v8.h, but there are a lot more implementations that we've seen above. There are used for various cases, for example for indexing, concatenation, and slicing).
Represents a sequence of characters which (the characters) are either one or two bytes in length.
These are string that are built using:
const str = "one" + "two";
This would be represented as:
+--------------+
| |
[str|one|two] [one|...] [two|...]
| |
+-----------------------+
So we can see that one and two in str are pointers to existing strings.
This function was introduced in b097a8e5de7.
The normal way of creating a Local would be to use:
Local<String> str = String::NewFromUtf8(isolate_, "åäö").ToLocalChecked();
Now, String::NewFromUtf8 looks like this:
MaybeLocal<String> String::NewFromUtf8(Isolate* isolate, const char* data,
NewStringType type, int length) {
NEW_STRING(isolate, String, NewFromUtf8, char, data, type, length);
return result;
}
The macro NEW_STRING
(which can be found in src/api/api.cc)
We can take a look at the expanded macro using:
$ g++ -I./out/x64.release_gcc/gen -I./include -I. -E src/api/api.cc > output
MaybeLocal<String> String::NewFromUtf8(Isolate* isolate, const char* data,
NewStringType type, int length) {
MaybeLocal<String> result;
if (length == 0) {
result = String::Empty(isolate);
} else if (length > i::String::kMaxLength) {
result = MaybeLocal<String>();
} else {
i::Isolate* i_isolate = reinterpret_cast<internal::Isolate*>(isolate);
i::VMState<v8::OTHER> __state__((i_isolate));;
i::RuntimeCallTimerScope _runtime_timer( i_isolate, i::RuntimeCallCounterId::kAPI_String_NewFromUtf8);
do {
auto&& logger = (i_isolate)->logger();
if (logger->is_logging())
logger->ApiEntryCall("v8::" "String" "::" "NewFromUtf8");
} while (false);
if (length < 0)
length = StringLength(data);
i::Handle<i::String> handle_result = NewString(i_isolate->factory(), type, i::Vector<const char>(data, length)) .ToHandleChecked();
result = Utils::ToLocal(handle_result);
};
return result;
}
There are a number of checks that are not required when we have a string literal, and some that can be checked at compile time, like the max length.
template <int N>
static V8_WARN_UNUSED_RESULT Local<String> NewFromUtf8Literal(
Isolate* isolate, const char (&literal)[N],
NewStringType type = NewStringType::kNormal) {
static_assert(N <= kMaxLength, "String is too long");
return NewFromUtf8Literal(isolate, literal, type, N - 1);
}
Notice the static_assert
which is performed at compile time.
Local String::NewFromUtf8Literal(Isolate* isolate, const char* literal,
NewStringType type, int length) {
DCHECK_LE(length, i::String::kMaxLength);
i::Isolate* i_isolate = reinterpret_castinternal::Isolate*(isolate);
ENTER_V8_NO_SCRIPT_NO_EXCEPTION(i_isolate);
LOG_API(i_isolate, String, NewFromUtf8Literal);
i::Handle<i::String> handle_result =
NewString(i_isolate->factory(), type,
i::Vector(literal, length))
.ToHandleChecked();
return Utils::ToLocal(handle_result);
}
We can see that the compiler instantiates this template as:
```console
$ nm -C test/string_test | grep NewFromUtf8Literal
U v8::String::NewFromUtf8Literal(v8::Isolate*, char const*, v8::NewStringType, int)
000000000041551c W v8::Local<v8::String> v8::String::NewFromUtf8Literal<10>(v8::Isolate*, char const (&) [10], v8::NewStringType)
These strings are located on the native heap. The ExternalString structure has a pointer to this external location and the usual length field for all Strings.