FFP - Flat file parsing library version 1.2
© Copyright 2003,4 dr. Cristiano Sadun - This library is available under LGPL

Javadoc is here.
Download from SourceForge.

History

1. Purpose

This Java class library eases parsing of "flat files", i.e. text files where lines can be interpreted according to positional patterns. Even if more modern ways of representing data, as XML and related standards, exist today, flat files are still widesperead due to their portability, immediacy and ease of handling.

However, even if conceptually simple (or perhaps just for that ;-), parsing flat files in a robust and flexible way has some pitfalls - and is generally boring. This library reduces the work required to the bare minimum: declaring the formats to expect (with different degrees of validation and diagnostics dependent on the amount of work desired), then running the parser on a file stream and being notified of the results.

The library can handle both multi-line formats and files containing lines have different known formats ("mixed format" files).

2. Usage

Parsing always occurs in the following phases:

In the simplest and most common case, where each line has the same format, here's a code example for parsing (the format consists of a length-5 numeric field followed by the constant 'XX' followed by a length-10 alphanumeric field):

 FlatFileParser parser = new FlatFileParser(LineFormat.fromImage("##### XX @@@@@@@@@@"));
 parser.addListener(new EchoListener());
 parser.parse(new File("myfile.txt"));

Note: this code assumes that the line separator sequence in the file is the same as the platform's default. Otherwise, invoking setLineSeparator() on the parser object allows to specify a different sequence (for example \n instead of \r\n for Un*x-like text files).

There are, however, much more complex cases where different formats may occur depending on known conditions and records span over multiple lines.

3. LineFormat objects

LineFormat objects are used to declare a line format. A line format is composed by positional fields. In certain cases, a single logical "line" spans several physical lines: LineFormat always describes a logical line. For example, a format like

           0123456789012345678901234567890123456789
           1######@@TYPE1
           2@@@##################
(where # indicates a digit and @ an alphanumeric character) spans 2 physical lines.

Note: for each of the methods discussed below, two overloads are provided: one having a physical line number as first parameter (int physicalLine) and one not. This latter speeds up the common case where a logical line always consists of a physical line.

A LineFormat can have a name (declared at construction). This is useful to identify the line type when when mixed formats need to be parsed.

Declaring fields

There are several ways to declare fields, and each can go with more or less information. The more information the parser has, the more validation it can perform when parsing. The minimum amount of information is, of course, the physical position of the field.

1 Absolute definition

The first way of declaring fields is the most obvious: declaring them one at the time, together with their position. For this, use the defineField overloads, where the start and end parameter indicate the start and end positions.

Note: the end position is considered excluded from the field (in a way similar to Java's substring method). So the first declaration below, for example, declares a field from postion 0 to position 4.

For example,

  LineFormat format = new LineFormat();
  format.defineField(0, 5);  // declares a field from position 0 to 4 (length 5)
  format.defineField(5, 7);  // declares a field at positions 5 and 6 (length 2)
  format.defineField(7, 12); // declares a field from position 7 to 11 (length 5)

Several overloads allow to declare also the name, type and image of the fields (see below). In case only the positions need be declared, an alternative coding uses defineFields() with an array of two-integers arrays:

  LineFormat format = new LineFormat();
  format.defineFields(new int[][] { {0, 5}, {5, 7}, {7, 12} });

2 Relative definition

Every time a field is declared, the LineFormat object keeps track of what is the end position (starting at 0). Therefore, it's possible to declare fields without specifying the start position - which is assumed being the previous' field end position; in order to do this, use the defineNextField() overloads.

For example,

  LineFormat format = new LineFormat(); // "current" position starts at 0.
  format.defineNextField(5);  // declares a field to end==5 (excluded). Current position becomes 5.
  format.defineNextField(7);  // declares a field to end==6 (excluded). Current position becomes 7.
  format.defineNextField(12); // declares a field to end==11 (excluded)

Similarly to above, several overloads allow to declare also the name, type and image of the fields (see below), and an alternative coding uses for specifiying just the positions uses defineNextFields() with an array of integers:

  LineFormat format = new LineFormat();
  format.defineNextFields(new int[]{ 5, 7, 12 });

3 Names and types

Each field may have a name and a type. LineFormat provides a number of define.. overloads with a String name parameter, which can be used to define a field's name. By default, a field's name is field_p_n, where p is the physical line number index (starting from 1) and n the field number in the physical line (starting from 1).

A field can also be typed. The Type class exposes the available types, and the default is UNDEFINED. If the type is NUMERIC when parsing, the parser will attempt to convert the field's value into a number (by using Double.parseDouble)- and if the conversion fails, the line will fail to parse. ALFA corresponds to any character sequence, while CONSTANT is used to declare that a certain field has always the same value.

4 Field images

In this latter case, it is possible to declare also which constant value is assumed by the line - by using one of the define.. overloads with a String image parameter. For CONSTANT types, a field's image is the value of the constant. For NUMERIC, a field's image, if present, is used as a java.text.DecimalFormat template, and when a field is parsed, it will be validate against that format (and if the value is not parseable, the line will fail to parse).

5 Line image definition

The third way to define line formats is by using an image - a pictorial denotation of the various fields in the line.

Note:The specific image syntax depends on the parser installed in the line format. Check LineFormat's API class comment for the latest details on the syntax.

The basic image syntax separates the fields by blanks, and a sequence of symbol with the same length as the field. The symbol '#' denotes Type.NUMERIC, the symbol '@' denotes Type.ALFA and forward slash sequences are used for these characters and the blank ( /# denotes #, /@ denotes @, /b denotes 'blank' and // denotes the forward slash itself). Any other character sequence denotes itself and assumes type CONSTANT.

For example, "Hello/bWorld ### @@@@@" indicates that the first field is the constant "Hello World", followed by a numeric field of length 3 and an alphanumeric field of length 5. Note that the spaces between the field images are there only as separators and will not affect the parsing (the fields will be assumed consecutive).

LineFormat.fromImage() can be used to create a line format object from an image string (there are two overloads, one with, one without a format name). The fields will have default names, but will have a valid type.

Note:The same net effect can be obtained by instantiating a LineFormat and then invoking declareLineImage()).

A multi-line image can be created simply by inserting the proper line separator sequence: for example "#####\r\n##" will match two lines, each with one numeric field, the first of lenght 5, the second length 2.

The line separator at the end of the last (or only) line is implicit.

6. Parsing

The parsing process implemented by the parse() method is quite simple. The LineFormat will try and read as many physical lines as defined, from the current position. If there are not lines enough, an exception will occur.

Then, the fields are separated according to their stated position, and type validation, if possible (that is, if enough information has been given to the LineFormat), is performed. If the line has not enough characters, or the validation fails, an exception is raised.

Finally, the resulting String array is returned. A version of the method which may automatically trim()s the fields is also provided.

Note: LineFormat.parse() will seldom be called directly, but is rather invoked when necessary by FlatFileParser.parse().

4. FlatFileParser

LineFormat exposes a parse() method which can be used to parse Strings. However, the line format must match the contents of the parsed line, or an FFPParseException is raised and the line fails to parse.

The FlatFileParser class offers services for

A flat file can contain different (logical) line types, recognizable depending on some criteria. Such files are treatable by FlatFileParser, if it is possible to define a condition which uniquely identifies the line format to use, for each line.

In order to do that, FFP defines the FlatFileParser.Condition interface and a number of implementations which cover common situations.

A condition is an object which can be associated to a LineFormat by using one of the declare() overloads in a FlatFileParser instance.

The instance stores the associations, and when its parse() method is invoked, it will attempt to match one and only one condition to the next available (logical) line. If zero or more than one matches occur, an exception is raised (unless it is instructed to ignore the fact by a previous call to setFailOnLineParsingError(false)). Otherwise, the associated LineFormat is used to parse the line.

A table of available conditions is presented at the end of the document.

Parsing events

During parsing, if one condition holds and the associated LineFormat applies correctly, the FlatFileParser will fire parsing events to every registered FlatFileParser.Listener.

The lineParsed() method is invoked - passing along the LineFormat which has been successfully used for the parsing the line, the physical and logical line counts and an array of Strings containing the string denotation of the values in the line.

Note that typed fields (for example, numeric) have already been verified by the parser.

From version 1.1, it is possible to associate listeners directly to specific formats - rather than having them receive all the events and have to discern manually (by using multiple ifs). This is accomplished using addDispatch rather than addListener, passing a format and a Listener object:

 ffp.addDispatch(format1, listenerToFormat1);

In this case, the listener listenerToFormat1 will receive only events related to successful parsing of the specific lineformat format1.

Usage example

Here's a general example of usage:

 // Declare and define the line format(s)
 LineFormat format = new LineFormat("test format"); // The format is named "test format"..

 // ..has one header of 40 characters..
 format.defineNextField("header", 40);
 // ..has and a constant TEST0000
 format.defineNextField(
  "transaction code",
  48,
  Type.CONSTANT,
  "TEST0000");
 // ... further definition of format ...

 // Create the parser and associate the format to the condition that a constant is found in the line..
 FlatFileParser ffp = new FlatFileParser();

 // ..by using the constructor which just looks into a line format and finds out the first constant field
 ffp.declare( new ConstantFoundInLineCondition(format), format );

 // Register a listener to parsing events which just prints out values
 // ( Alternatively, a dispatcher which listens only to events related to the specific LineFormat object ("format") 
 //   might be registered by using
 // 
 //    ffp.addDispatch(format, new FlatFileParserListnener() { ... }); 
 // )
 // 
 ffp.addListener(new ffp.addListener(new FlatFileParser.Listener() {
  public void lineParsed(LineFormat format,int logicalLinecount,int physicalLineCount,String[] values) {
    for(int i=0;i<values.length;i++) System.out.println(values[i]);
  }
 });



 // Finally, do the parsing 
 ffp.parse(new BufferedReader(new FileReader("testfile.txt)));

Building new condition classes

Flat file parse discrimination is usually simple1, but if necessary new conditions can be easily defined by implementing the FlatFileParser.Condition interface.

The method toString() must return a description of the condition - this is what is used in error reporting or logging. The central method of the interface is holds(logical line count, physical line count, reader). This method must be implemented so that it can check a specific condition on either the physical or logical line number (passed as parameters) or the line contents itself. During parsing, the parser will invoke it, providing the opportune parameter values. The invocation is synchronous, so if the method blocks, the parser will block.

Since most condition depend checking contents of the line, the FlatFileParser.LineReader object is passed to the method - and its readLine() method invoked to retrieve the character stream. Of course, the work with this character stream may in itself be implemented as a (potentially complex) line-parsing operation (using perhaps LineFormat) but the nature of conditions for discriminating line types in flat files is such that often quicker and more direct methods are suitable.

The built-in implementation of FlatFileParser.LineReader ensures that the actual character stream is read only once from the file, so every Condition does not have to worry about the performance impact of disk I/O.

Pre-definined condition classes

Condition classDescription
org.sadun.text.ffp.ConstantFoundInLineConditionChecks the line for a constant value at a specific position.

If a LineFormat is available when creating an instance, it is possible to just specifiy which field in the LineFormat is to be checked by using either ConstantFoundInLineCondition(LineFormat format, int n) or ConstantFoundInLineCondition(LineFormat format) constructor (the latter automatically locates the position of the first constant field in the LineFormat).

Otherwise, constructors which require position and constant image can be used.

 // Declare a line format with two fields - the constant 'CC' followed by a 5 digit number field
 test00format = LineFormat.fromImage("CC #####");

 // Associate the line format with a condition that use it only if the first field is 'CC'
 FlatFileParser.Condition condition = new ConstantFoundInLineCondition(test00format);
 ffp.declare(condition, test00format);
org.sadun.text.ffp.ConstantLineConditionChecks for a constant line

If the constant contains spaces, or other image-related characters, LineFormat.makeConstantImage() can be used.

 // Declare a line format with two fields - the constant 'CC' followed by a 5 digit number field
 String constant = "END OF FILE";

 // Associate the line format with a condition that use it only if the first field is 'CC'
 FlatFileParser.Condition condition = new ConstantLineCondition(constant);
 ffp.declare(condition, LineFormat.fromImage(LineFormat.makeConstantImage(constant));
org.sadun.text.ffp.CountConditionVerifies a simple numeric relationship between the current logical line number and another number.

If the constant contains spaces, or other image-related characters, LineFormat.makeConstantImage() can be used.

 // Declare a line format with two fields - the constant 'CC' followed by a 5 digit number field
 String constant = "END OF FILE";

 // Associate the line format with a condition that use it only if the first field is 'CC'
 FlatFileParser.Condition condition = new CountCondition(constant);
 ffp.declare(condition, LineFormat.fromImage(LineFormat.makeConstantImage(constant));
org.sadun.text.ffp.AndCondition, org.sadun.text.ffp.NotCondition, org.sadun.text.ffp.OrCondition Logical conditions

These condition classes allow to compose other conditions objects in logical expressions.

 // Declare a line format with two fields - the constant 'CC' followed by a 5 digit number field
 String constant1 = "END OF FILE";
 String constant2 = "EOF";

 // Associate the line format with a condition that use it only if the first field is 'CC'
 FlatFileParser.Condition condition1 = new ConstantLineCondition(constant1);
 FlatFileParser.Condition condition2 = new ConstantLineCondition(constant2);
 FlatFileParser.Condition condition = new OrCondition(constant1, constant2);
 ffp.declare(condition, LineFormat.fromImage(LineFormat.makeConstantImage(constant));


1. (which is also the rationale for a design where logical expressions are realized as object combinations rather than an expression language to be interpreted at runtime).