Friday, September 25, 2009

Convert Fixed width file to Delimited in Unix

Needed to convert a Convert Fixed width file to Delimited recently. There are many solutions, but the simplest & best solutions involved using the tools available in Unix / Linux. Here are a few possible ways to do it on Unix / Linux.



Input File: fixed.txt
The column widths are 2, 4, 6 & 3

AABBBBCCCCCCDDD
EEFFFFGGGGGGHHH
IIJJJJKKKKKKLLL
MMNNNNOOOOOOPPP
QQRRRRSSSSSSTTT
UUVVVVWWWWWWXXX
YYZZZZAAAAAABBB


Output File: delim.csv

AA,BBBB,CCCCCC,DDD
EE,FFFF,GGGGGG,HHH
II,JJJJ,KKKKKK,LLL
MM,NNNN,OOOOOO,PPP
QQ,RRRR,SSSSSS,TTT
UU,VVVV,WWWWWW,XXX
YY,ZZZZ,AAAAAA,BBB


Using GAWK on Linux

gawk < fixed.txt 'BEGIN{ FIELDWIDTHS = "2 4 6 3"; OFS=",";}{print $1,$2,$3,$4}' > delim.csv


This solution defines field widths & field separator and then uses gawk to split each line. The only down side is that GAWK is available only on Linux or GNU based systems.


Using SED on *nix

sed -e 's/./&,/2' -e 's/./&,/7' -e 's/./&,/14' fixed.txt > delim.csv


This solution uses sed - a stream editor to substitute commas inline.
The -e is to tell sed to execute the next command line argument as a sed program and you can have multiple such commands.
The s/./ looks to substitute any character.
The &, denotes the matched string followed by a comma that we need.
The /2 tells sed to do this substitution for only the first 2 such occurrences.
So in effect, a comma is put in after the first 2 characters are matched. Note that since this is substituting inline, the insertion point for the next comma is at position 2 + 1 + 4 = 7!

Here is a link to a very good list of handy sed one-liners on linuxhowtos.org.


Using AWK on Unix

1. Create a file conversion_rules
{
 a[1] = 2;
 a[2] = 4;
 a[3] = 6;
 a[4] = 3;
 }
 {
        o = "";
        r = 1;
        for (i=1;i<length(a);i++)
        { o = o  substr($0,r,a[i]) ","; r = r + a[i]; }
        o = o  substr($0,r,a[length(a)])
        print o;
 }


2. Run the command:
awk -f conversion_rules fixed.txt > delim.csv


This solution uses plain awk to loop through each line and add comma after the defined column length. An awk program file is created to define the column lengths and the looping & comma insertion logic. This is fed to an awk instance along with the input file and the output is redirected to the required file.

No comments:

LinkWithin

Related Posts Plugin for WordPress, Blogger...