You are not logged in.
Pages: 1

Hi guys,
I'm not sure if you'll find this useful but a few weeks ago I spun off a number of classes that I used on a regular basis into a simple package. It's called jTokeniser and comes with various methods to split a string into tokens.
jTokeniser comprises of four tokenisers that all extend from an abtract Tokeniser class:
* WhiteSpaceTokeniser - this splits a string on all occurances of whitespace, which include spaces, newlines, tabs and linefeeds.
* StringTokeniser - this is basically the same as java.util.StringTokenizer with some extra methods (and extends from Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser, however, you can specify a set of characters that are to be used to indicate word delimiters.
* RegexTokeniser - this tokeniser is much more flexible as you can use regular expressions to define a what a token is. So, "\w+" means whenever it matches one or more letters, it will consider that a word. By default, it uses a regular expression equivalent to a whitespace tokeniser.
* BreakIteratorTokeniser - the most sophisticated of the four, although should only really be used on natural language strings to isolate words. It also comes with built-in rules about how to find words, knowing how to disregard punctuation, etc.
I've been using these a lot for my research and they work fine. Javadoc API available here. It's all open source and licenced under the LGPL. I hope others find it useful.
Offline
Pages: 1