2009-06-17

SAS - How merge (mal)functions

The normal MERGE statement in SAS simply fouls up on MANY to MANY relationships. You are far better off using SQL but even then there are pitfalls. Check the examples below.
Data setsSetA
SETA
Alpha   Beta Desc
1000    1    Ax1 Bx2
2000    2    Ax1 Bx0
4000    3    Ax2 Bx1
4000    4    Ax2 Bx1
5000    5    Ax3 Bx2
5000    6    Ax3 Bx2
5000    7    Ax3 Bx2
SetB
SETB
Alpha   Gamma
1000    11
1000    12
3000    13
4000    14
5000    15
5000    16

SAS Merge
Does not perform combinatorial on 5000.
This is rediculous
DATA SetAB01;
MERGE SetA SetB;
BY Alpha;
RUN;
Alpha Beta Desc    Gamma
1000  1    Ax1 Bx2 11
1000  1    Ax1 Bx2 12
2000  2    Ax1 Bx0 
3000               13
4000  3    Ax2 Bx1 14
4000  4    Ax2 Bx1 14
5000  5    Ax3 Bx2 15
5000  6    Ax3 Bx2 16
5000  7    Ax3 Bx2 16
SAS SQL Simple Join
Does not include unmatched.
Where is Alpha=2000,3000?
PROC SQL;
CREATE TABLE SetAB02 AS
SELECT *
FROM SetA, SetB
WHERE SetA.Alpha = SetB.Alpha;
QUIT;
Alpha Beta Desc    Gamma
1000  1    Ax1 Bx2 11
1000  1    Ax1 Bx2 12
4000  3    Ax2 Bx1 14
4000  4    Ax2 Bx1 14
5000  5    Ax3 Bx2 15
5000  6    Ax3 Bx2 15
5000  7    Ax3 Bx2 15
5000  5    Ax3 Bx2 16
5000  6    Ax3 Bx2 16
5000  7    Ax3 Bx2 16
SAS SQL Full Join
Drops the Alpha value for unmatched left.
Where is Alpha=3000?
PROC SQL;
CREATE TABLE SetAB03 AS
SELECT *
FROM SetA FULL JOIN SetB
ON SetA.Alpha = SetB.Alpha;
QUIT;
Alpha Beta Desc    Gamma
1000  1    Ax1 Bx2 11
1000  1    Ax1 Bx2 12
2000  2    Ax1 Bx0    
13
4000  3    Ax2 Bx1 14
4000  4    Ax2 Bx1 14
5000  5    Ax3 Bx2 15
5000  5    Ax3 Bx2 16
5000  6    Ax3 Bx2 15
5000  6    Ax3 Bx2 16
5000  7    Ax3 Bx2 15
5000  7    Ax3 Bx2 16
SAS SQL Full Join Coalesce
Works!
PROC SQL;
CREATE TABLE SetAB03A AS
SELECT
COALESCE(SetA.Alpha, SetB.Alpha)
AS Alpha, *
FROM SetA FULL JOIN SetB
ON SetA.Alpha = SetB.Alpha;
QUIT;
Alpha Beta Desc    Gamma
1000  1    Ax1 Bx2 11
1000  1    Ax1 Bx2 12
2000  2    Ax1 Bx0 
3000               13
4000  3    Ax2 Bx1 14
4000  4    Ax2 Bx1 14
5000  5    Ax3 Bx2 15
5000  5    Ax3 Bx2 16
5000  6    Ax3 Bx2 15
5000  6    Ax3 Bx2 16
5000  7    Ax3 Bx2 15
5000  7    Ax3 Bx2 16
SAS SQL Left Join
Probably what you want.
No left unmatched
Where is 3000?
PROC SQL;
CREATE TABLE SetAB04 AS
SELECT *
FROM SetA LEFT JOIN SetB
ON SetA.Alpha = SetB.Alpha;
QUIT;
Alpha Beta Desc    Gamma
1000  1    Ax1 Bx2 11
1000  1    Ax1 Bx2 12
2000  2    Ax1 Bx0 
4000  3    Ax2 Bx1 14
4000  4    Ax2 Bx1 14
5000  5    Ax3 Bx2 15
5000  6    Ax3 Bx2 15
5000  7    Ax3 Bx2 15
5000  5    Ax3 Bx2 16
5000  6    Ax3 Bx2 16
5000  7    Ax3 Bx2 16

A nice Venn Diagram display of SQL linking

2 comments:

Firefly said...

Comment 1:
The 'where' clause in "SAS SQL Simple Join" may need to be looked at.

Comment 2:
I think overall your trying to get a Full Outer Join (http://en.wikipedia.org/wiki/Join_%28SQL%29#Full_outer_join)

For what you want does ... using a union of two left joins work?

select SETA.*, SETB.* from SETA LEFT JOIN SETB ON SETA.Alpha = SETB.Alpha UNION select SETA.*, SETB.* from SETB LEFT JOIN SETA ON SETA.Alpha = SETB.Alpha;


+-------+------+---------+-------+-------+
| Alpha | Beta | Desc | Alpha | Gamma |
+-------+------+---------+-------+-------+
| 1000 | 1 | Ax1 Bx2 | 1000 | 11 |
| 1000 | 1 | Ax1 Bx2 | 1000 | 12 |
| 2000 | 2 | Ax1 Bx0 | NULL | NULL |
| 4000 | 3 | Ax2 Bx1 | 4000 | 14 |
| 4000 | 4 | Ax2 Bx1 | 4000 | 14 |
| 5000 | 5 | Ax3 Bx2 | 5000 | 15 |
| 5000 | 5 | Ax3 Bx2 | 5000 | 16 |
| 5000 | 6 | Ax3 Bx2 | 5000 | 15 |
| 5000 | 6 | Ax3 Bx2 | 5000 | 16 |
| 5000 | 7 | Ax3 Bx2 | 5000 | 15 |
| 5000 | 7 | Ax3 Bx2 | 5000 | 16 |
| NULL | NULL | NULL | 3000 | 13 |
+-------+------+---------+-------+-------+

Neoporcupine said...

Thanks Firefly,

1) Yes, fixed. Obviously I was working with test data names.

2) Yes full outer join. However your UNION code fails the same as "SAS SQL Full Join" as SAS doesn't like columns with the same name - which means SetB.Alpha goes missing, hence the use of coalesce. Obviously for the sad versions of SQL that don't implement full outer joins then the UNION method is an almost reasonable work around.