黑马程序员技术交流社区

标题: [科普]字符串和字符串的长度 [打印本页]

作者: 哦啊啊    时间: 2016-9-27 22:43
标题: [科普]字符串和字符串的长度
首先明确几个概念:
字符串:形式语言理论研究的基本对象之一,是字符的有限序列。
以下引用中文喂鸡“字符串”:

设∑是叫做字母表的非空有限**。∑的元素叫做“符号”或“字符”。在∑上的字符串(或字)是来自∑的任何有限序列。例如,如果∑ = {0, 1},则0101是在∑之上的字符串。
字符串的长度是在字符串中字符的数目(序列的长度),它可以是任何非负整数。“空串”是在∑上的唯一的长度为0的字符串,并被指示为ε或λ。

注意,这里的长度的概念是足够清晰的。

以下引用中文喂鸡“字符串->字符串数据类型”:

字符串长度
尽管形式字符串可以有任意(但有限)的长度,实际语言的字符串的长度经常被限制到一个人工极大值。一般的说,有两种类型的字符串数据类型:“定长字符串”,它有固定的极大长度并且不管是否达到了这个极大值都使用同样数量的内存;和“变长字符串”,它的长度不是专断固定的并且依赖于实际的大小使用可变数量的内存。在现代编程语言中的多数字符串是变长字符串。尽管叫这个名字,所有变长字符串还是在长度上有个极限,一般的说这个极限只依赖于可获得的内存的数量。
……
表示法
一种常用的表示法是使用一个字符代码的数组,每个字符占用一个字节(如在ASCII代码中)或两个字节(如在unicode中)。它的长度可以使用一个结束符(一般是NUL,ASCII代码是0,在C编程语言中使用这种方法)。或者在前面加入一个整数值来表示它的长度(在Pascal语言中使用这种方法)。
【例略】

可见字符串的长度和存储的关系是不唯一的。

在C/C++中可以使用多种形式表示和存储的字符串。最常见的基本的字符串表示形式(即C标准库/C++标准库都使用的形式)通称为C风格字符串,ISO C++的学名是NTCTS(null terminated character string)。

ISO C++11 17.3.17 [defns.ntcts]
NTCTS
a sequence of values that have character type that precede the terminating null character type value charT()

具体说来,一个典型的场景是:多余一个元素的char/wchar_t/char16_t/char32_t/其它实现允许的扩展字符类型的数组可以放一个NTCTS。
注意,a sequence of values而不是characters,表示抽象的含义。下面会看到character(但不是multibyte character)在C++标准库中的明确受限的意义。
顺便,关于multibyte character是C++整体通用的基本术语之一,所以独立于character之外考虑:
ISO C++11 1.3.13 [defns.multibyte]
multibyte character
sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment
[ Note: The extended character set is a superset of the basic character set (2.3).—end note ]
(至于字符、基本执行字符集什么的虽然是必要基础但理解起来很简单,暂且不在此展开。)

ISO C++11 17.5.2.1.4 Character sequences [character.seq]
1 The C standard library makes widespread use of characters and character sequences that follow a few uniform conventions:
— A letter is any of the 26 lowercase or 26 uppercase letters in the basic execution character set.166
— The decimal-point character is the (single-byte) character used by functions that convert between a (single-byte) character sequence and a value of one of the floating-point types. It is used in the character sequence to denote the beginning of a fractional part. It is represented in Clauses 18 through 30 and Annex D by a period, ’.’, which is also its value in the "C" locale, but may change during program execution by a call to setlocale(int, const char*),167 or by a change to a locale object, as described in Clauses 22.3 and 27.






作者: 哦啊啊    时间: 2016-9-27 22:47
— A character sequence is an array object (8.3.4) A that can be declared as T A [N], where T is any of the types char, unsigned char, or signed char (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that points to its first element.
166) Note that this definition differs from the definition in ISO C 7.1.1.
167) declared in <clocale> (22.6).

据此可以定义更具体的NTBS(null terminated byte string)
ISO C++11 17.5.2.1.4.1 Byte strings [byte.strings]
1 A null-terminated byte string, or NTBS, is a character sequence whose highest-addressed element with defined content has the value zero (the terminating null character); no other element in the sequence has the value zero.168
2 The length of an NTBS is the number of elements that precede the terminating null character. An empty ntbs has a length of zero.
3 The value of an NTBS is the sequence of values of the elements up to and including the terminating null character.
4 A static NTBS is an ntbs with static storage duration.169
168) Many of the objects manipulated by function signatures declared in <cstring> (21.7) are character sequences or NTBSs.
The size of some of these character sequences is limited by a length value, maintained separately from the character sequence.
169) A string literal, such as "abc", is a static ntbs.

NTBS的元素通常用char类型对象或值表示。

NTBS在NTCTS和character sequence的基础上明确了存储。此外,NTBS区分于NTCTS的定义的重要目的之一是为了明确(允许变长编码的)多字节字符串NTMBS的外延——注意,这里的一些“长度”开始体现出显著的区别。
先看定义:

ISO C++11 17.5.2.1.4.2 Multibyte strings [multibyte.strings]
1 A null-terminated multibyte string, or NTMBS, is an NTBS that constitutes a sequence of valid multibyte characters, beginning and ending in the initial shift state.170
2 A static NTMBS is an NTMBS with static storage duration.
170) An NTBS that contains characters only from the basic execution character set is also an NTMBS. Each multibyte character then consists of a single byte.
可见NTMBS是NTBS的子集,它其中可以包含多个(连续)字节组成的字符。
按17.5.2.1.4.1/2,NTMBS即NTBS的长度是其中包含的元素数。这里的“元素”概念和NTBS中有区别,即强调作为NTMBS时长度是多字节字符数而不是字符(字节)数。显然对于一般的NTMBS,即便去除结尾的空字符,长度和占用的字节数可以不同。

但是,ISO C里面关于“长度”可以有些关键性的不同。简而言之,ISO C标准库使用的string相当于NTBS,类似NTMBS的概念中长度仍以字节计:
ISO C99/C11(N1570)
7.1.1/1 A string is a contiguous sequence of characters terminated by and including the first null character. The term multibyte string is sometimes used instead to emphasize special processing given to multibyte characters contained in the string or to avoid confusion with a wide string. A pointer to a string is a pointer to its initial (lowest addressed) character. The length of a string is the number of bytes preceding the null character and the value of a string is the sequence of the values of the contained characters, in order.
作者: chenhao597    时间: 2016-9-28 23:39
楼主你到底说是什么哦 思路好乱




欢迎光临 黑马程序员技术交流社区 (http://bbs.itheima.com/) 黑马程序员IT技术论坛 X3.2