sorting and splitting in a DFSORT together?

https://stackoverflow.com/questions/7039731

23-12-2020
|

Question

Input file Layout: 01 to 10 - 10 Digit Acct# 53 to 01 - An indicator with values 'Y' or 'N' 71 to 10 - Time stamp (Rest of the fields are insignificant for this sort)

While sorting the input file by splitting and eliminating duplicates in two ways result in different results. I wanna know why?

Casei: Splitting and Eliminating duplicates in the same step.

SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE
OUTFIL FILES=01,                                             
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),                            
OUTFIL FILES=02,                                             
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),

Caseii: Splitting and eliminating duplicates in two different steps:

STEP:01
SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE

STEP:02
SORT FIELDS=COPY
OUTFIL FILES=01,                                             
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),                            
OUTFIL FILES=02,                                             
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),

These two steps are resulting different output. Do u see any difference between both cases? Please clarify.

Solution

You are asking to sort on an Account Number (10 characters ascending) then on an Indicator (1 character ascending). These two fields alone determine the key of the record - Timestamp is not part of the sort key. Consequently if there are two or more records with the same key they could be placed in any (random) order by the sort. No telling what order the Timestamp values will appear.

Keeping the above in mind, consider what happens when you have two records with the same key but different Timestamp values. One of these Timestamp values meets the given INCLUDE criteria and the other one doesn't. The SUM FIELDS=NONE parameter is asking to remove duplicates based on the key. It does this by grouping all of the records with the same key together and then selecting the last one in the group. Since key does not include the Timestamp the choosen record is essentially a random event. Consequently it is unpredictable as to whether you get the record that meets the subsequent INCLUDE condition.

There are a couple of ways to fix this:

Add Timestamp to the sort key. This might not work because it may leave multiple records for the same Account Number / Inidcator, that is it may break your duplicate removal requirement
Request a stable sort.

A stable sort causes records having the same sort key to maintain their same relative positions after the sort. This will preserve the original order of the Timestamp values in your file given the same key. When the removal of duplicates occurs DFSORT will choose the last record from the set of duplicates. This should bring the predicability to the duplicate elimination process you are looking for. Specify a stable sort by adding an OPTIONS EQUALS control card before the SORT card.

EDIT Comment: ...picks the VERY FIRST record

The book I based my original answer on clearly stated the last record in a group of records with the same key would be selected when SUM=NONE is specified. However, it is always best to consult the vendors own manuals. IBM's DFSORT Application Programming Guide only states that one record with each key will be selected. However, it also has the following note:

The FIRST operand of ICETOOL's SELECT operator can be used to perform the same function as SUM FIELDS=NONE with OPTION EQUALS. Additionally, SELECT's FIRSTDUP, ALLDUPS, NODUPS, HIGHER(x), LOWER(y), EQUAL(v), LASTDUP, and LAST operands can be used to select records based on other criteria related to duplicate and non-duplicate keys. SELECT's DISCARD(savedd) operand can be used to save the records discarded by FIRST, FIRSTDUP, ALLDUPS, NODUPS, HIGHER(x), LOWER(y), EQUAL(v), LASTDUP, or LAST. See SELECT Operator for complete details on the SELECT operator.

Based on this information I would suggest using ICETOOL's SELECT operator to select the correct record.

Sorry for the misinformation.

OTHER TIPS

The problem is as NealB identified.

The easiest thing to do is to "get rid of" the records you don't want by date before the SORT. The SORT will take less time. This assumes that SORTOUT is not required. If it is, you have to keep your INCLUDE= on the OUTFILs.

SELECT is a good option. SELECT uses OPTION EQUALS by default. The below Control Cards can be included in an xxxxCNTL dataset, and action from the SELECT with USING(xxxx). SELECT gives you greater flexibility than SUM (you can get the last, amongst other things).

The whole task sounds flawed. If there are records per account with different dates, I'd expect either the first date or the last date, or something else specific, to be required, not just whatever record happens to be hanging around at the end of the SUM.

 OPTION EQUALS

 INCLUDE COND=(71,10,CH,GT,&DATE2(-))

 SORT FIELDS=(01,10,CH,A,53,01,CH,A)

 SUM FIELDS=NONE

 OUTFIL FILES=01,                                             
      INCLUDE=(53,01,CH,EQ,C'Y')

 OUTFIL FILES=02,                                             
      INCLUDE=(53,01,CH,EQ,C'N')

Or, if the Y/N cover all records:

 OUTFIL FILES=02,SAVE

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow