Read data from a text file or string.
The string str or file associated with fid is read from and
parsed according to format. The function is an extension of
strread
and textread
. Differences include: the ability to
read from either a file or a string, additional options, and additional
format specifiers.
The input is interpreted as a sequence of words, delimiters (such as whitespace), and literals. The characters that form delimiters and whitespace are determined by the options. The format consists of format specifiers interspersed between literals. In the format, whitespace forms a delimiter between consecutive literals, but is otherwise ignored.
The output C is a cell array where the number of columns is determined by the number of format specifiers.
The first word of the input is matched to the first specifier of the format and placed in the first column of the output; the second is matched to the second specifier and placed in the second column and so forth. If there are more words than specifiers then the process is repeated until all words have been processed or the limit imposed by repeat has been met (see below).
The string format describes how the words in str should be parsed. As in fscanf, any (non-whitespace) text in the format that is not one of these specifiers is considered a literal. If there is a literal between two format specifiers then that same literal must appear in the input stream between the matching words.
The following specifiers are valid:
%f
%f64
%n
The word is parsed as a number and converted to double.
%f32
The word is parsed as a number and converted to single (float).
%d
%d8
%d16
%d32
%d64
The word is parsed as a number and converted to int8, int16, int32, or int64. If no size is specified then int32 is used.
%u
%u8
%u16
%u32
%u64
The word is parsed as a number and converted to uint8, uint16, uint32, or uint64. If no size is specified then uint32 is used.
%s
The word is parsed as a string ending at the last character before whitespace, an end-of-line, or a delimiter specified in the options.
%q
The word is parsed as a "quoted string". If the first character of the string is a double quote (") then the string includes everything until a matching double quote—including whitespace, delimiters, and end-of-line characters. If a pair of consecutive double quotes appears in the input, it is replaced in the output by a single double quote. For examples, the input "He said ""Hello""" would return the value ’He said "Hello"’.
%c
The next character of the input is read. This includes delimiters, whitespace, and end-of-line characters.
%[…]
%[^…]
In the first form, the word consists of the longest run consisting of only characters between the brackets. Ranges of characters can be specified by a hyphen; for example, %[0-9a-zA-Z] matches all alphanumeric characters (if the underlying character set is ASCII). Since MATLAB treats hyphens literally, this expansion only applies to alphanumeric characters. To include ’-’ in the set, it should appear first or last in the brackets; to include ’]’, it should be the first character. If the first character is ’^’ then the word consists of characters not listed.
%N…
For %s, %c %d, %f, %n, %u, an optional width can be specified as %Ns, etc. where N is an integer > 1. For %c, this causes exactly N characters to be read instead of a single character. For the other specifiers, it is an upper bound on the number of characters read; normal delimiters can cause fewer characters to be read. For complex numbers, this limit applies to the real and imaginary components individually. For %f and %n, format specifiers like %N.Mf are allowed, where M is an upper bound on number of characters after the decimal point to be considered; subsequent digits are skipped. For example, the specifier %8.2f would read 12.345e6 as 1.234e7.
%*…
The word specified by the remainder of the conversion specifier is skipped.
literals
In addition the format may contain literal character strings; these will be skipped during reading. If the input string does not match this literal, the processing terminates.
Parsed words corresponding to the first specifier are returned in the first output argument and likewise for the rest of the specifiers.
By default, if there is only one input argument, format is "%f".
This means that numbers are read from the input into a single column vector.
If format is explicitly empty (""
) then textscan will
return data in a number of columns matching the number of fields on the
first data line of the input. Either of these is suitable only when the
input is exclusively numeric.
For example, the string
str = "\ Bunny Bugs 5.5\n\ Duck Daffy -7.5e-5\n\ Penguin Tux 6"
can be read using
a = textscan (str, "%s %s %f");
The optional numeric argument repeat can be used for limiting the number of items read:
Read all of the string or file until the end (default).
Read until the first of two conditions occurs: 1) the format has been processed N times, or 2) N lines of the input have been processed. Zero (0) is an acceptable value for repeat. Currently, end-of-line characters inside %q, %c, and %[…]$ conversions do not contribute to the line count. This is incompatible with MATLAB and may change in future.
The behavior of textscan
can be changed via property/value pairs.
The following properties are recognized:
"BufSize"
This specifies the number of bytes to use for the internal buffer. A modest speed improvement may be obtained by setting this to a large value when reading a large file, especially if the input contains long strings. The default is 4096, or a value dependent on n if that is specified.
"CollectOutput"
A value of 1 or true instructs textscan
to concatenate consecutive
columns of the same class in the output cell array. A value of 0 or false
(default) leaves output in distinct columns.
"CommentStyle"
Specify parts of the input which are considered comments and will be skipped. value is the comment style and can be either (1) A string or 1x1 cell string, to skip everything to the right of it; (2) A cell array of two strings, to skip everything between the first and second strings. Comments are only parsed where whitespace is accepted and do not act as delimiters.
"Delimiter"
If value is a string, any character in value will be used to split the input into words. If value is a cell array of strings, any string in the array will be used to split the input into words. (default value = any whitespace.)
"EmptyValue"
Value to return for empty numeric values in non-whitespace delimited data. The default is NaN. When the data type does not support NaN (int32 for example), then the default is zero.
"EndOfLine"
value can be either an emtpy or one character specifying the
end-of-line character, or the pair
"\r\n"
(CRLF).
In the latter case, any of
"\r"
, "\n"
or
"\r\n"
is counted as a (single)
newline. If no value is given,
"\r\n"
is used.
"HeaderLines"
The first value number of lines of fid are skipped. Note that this does not refer to the first non-comment lines, but the first lines of any type.
"MultipleDelimsAsOne"
If value is nonzero, treat a series of consecutive delimiters, without whitespace in between, as a single delimiter. Consecutive delimiter series need not be vertically aligned. Without this option, a single delimiter before the end of the line does not cause the line to be considered to end with an empty value, but a single delimiter at the start of a line causes the line to be considered to start with an empty value.
"TreatAsEmpty"
Treat single occurrences (surrounded by delimiters or whitespace) of the string(s) in value as missing values.
"ReturnOnError"
If set to numerical 1 or true, return normally as soon as an error is
encountered, such as trying to read a string using %f
.
If set to 0 or false, return an error and no data.
"Whitespace"
Any character in value will be interpreted as whitespace and trimmed;
The default value for whitespace is
"
\b\r\n\t"
(note the space). Unless whitespace is set to ""
(empty) AND at
least one "%s"
format conversion specifier is supplied, a space is
always part of whitespace.
When the number of words in str or fid doesn’t match an exact
multiple of the number of format conversion specifiers, textscan
’s
behavior depends on whether the last character of the string or file is an
end-of-line as specified by the EndOfLine
option:
Data columns are padded with empty fields, NaN or 0 (for integer fields) so that all columns have equal length
Data columns are not padded; textscan
returns columns of unequal
length
The second output position provides the location, in characters from the beginning of the file or string, where processing stopped.
See also: dlmread, fscanf, load, strread, textread.
Package: octave