Blatant plug: jTokeniser 1.0

arooaroo · 2005-04-23 09:47:09

Hi guys,

I'm not sure if you'll find this useful but a few weeks ago I spun off a number of classes that I used on a regular basis into a simple package. It's called jTokeniser and comes with various methods to split a string into tokens.

jTokeniser comprises of four tokenisers that all extend from an abtract Tokeniser class:

* WhiteSpaceTokeniser - this splits a string on all occurances of whitespace, which include spaces, newlines, tabs and linefeeds.

* StringTokeniser - this is basically the same as java.util.StringTokenizer with some extra methods (and extends from Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser, however, you can specify a set of characters that are to be used to indicate word delimiters.

* RegexTokeniser - this tokeniser is much more flexible as you can use regular expressions to define a what a token is. So, "\w+" means whenever it matches one or more letters, it will consider that a word. By default, it uses a regular expression equivalent to a whitespace tokeniser.

* BreakIteratorTokeniser - the most sophisticated of the four, although should only really be used on natural language strings to isolate words. It also comes with built-in rules about how to find words, knowing how to disregard punctuation, etc.

I've been using these a lot for my research and they work fine. Javadoc API available here. It's all open source and licenced under the LGPL. I hope others find it useful.

Arch Linux

#1 2005-04-23 09:47:09

Blatant plug: jTokeniser 1.0

Board footer