5.1.2. Breaking into tokens

After the text has had the character filters applied, it needs to be split into pieces that can be operated on. Lucene itself doesn’t act on large strings of data; instead, it acts on what are known as tokens. Tokens are generated out of a piece of text, which results in any number (even zero!) of tokens. In English, for example, a common tokenization that can be used is the standard tokenizer, which splits text into tokens, based on whitespace like spaces and newlines, but also on some characters like the dash. In figure 5.1 this is represented by breaking the string “share your experience with NoSql and big data technologies” into the tokens share, your, experience, with, NoSql, and, big, data, and technologies.