Thursday, February 08, 2007

Creating Cumulative Totals

Here's how to update a SAS data set to include cumulative totals for a numeric variable.

In this example, he goal is to create a new variable ("sumT") that is a cumulative total of the variable tsippT:

data one;
input pulldate MMDDYY10.
tsippT 12-14
tsippC 15-18
count 20;
format pulldate date7.;
cards;
08/04/2006 34 24 1
09/25/2006 343 200 2
10/20/2006 678 398 3
11/03/2006 713 406 4
02/07/2007 857 451 5
;
run;

proc sort data=one;
by pulldate;
run;

data two;
set one;
by pulldate;
if first.count then sumT=0;
sumT+tsippT;
if last.pulldate;
run;

proc print data=two;
run;

Monday, October 30, 2006

The Monotonic Function

There is an undocumented function that enables you to generate observation numbers with a PROC SQL statement and retain them in an output data set.

You can use the NUMBER option in a PROC SQL statement to print observation numbers, but the observation numbers are not retained (just like PROC PRINT w/o the nobs option).

To keep the observation numbers in the output table, use the MONOTONIC function in the SELECT statement.

Example:

Proc sql;
create table step1 as
select monotonic() as ID,
district, district_name
from perm.districts;
quit;

To do the same thing in a data step:
ID = _n_;

Monday, October 23, 2006

Exporting Multiple Data Sets to One Spreadsheet

It's easy to export several data sets to one Excel spreadsheet with each data set in a separate worksheet.This example exports three SAS data sets (work.region, work.county and work.district) to the same spreadsheet (results.xls).

Each data set will be in a separate worksheet:

proc export
data=work.region
outfile="c:\temp\results.xls"
dbms=Excel;
run;
proc export
data=work.county
outfile="c:\temp\results.xls"
dbms=Excel;
run;
proc export
data=work.district
outfile="c:\temp\results.xls"
dbms=Excel;
run;

Results.xls will have three worksheets, each one automatically labeled with the name of the data set it was created from.

Friday, October 20, 2006

Where or If?

When is it most efficient to use a WHERE statement vs. a subsetting IF statement? The answer is (of course) it depends.

When subsetting from a large dataset, WHERE processing is usually more efficient because WHERE statements are evaluated before an observation is read into the program data vector. Subsetting IF statements test the condition after an observation is read into the program data vector.

However, you cannot use WHERE processing on anything other than a variable value that already exists.

Subsetting IF statements must be used in expressions that use:
• Values from raw data
• Values calculated or assigned during the course of the data step
• Automatic variables created in a data step (ex: first.variable , last.variable, _N_)

WHERE expression examples:
proc print data=work.enrollment;
where district='227913';
run;
proc print data=work.enrollment (where=(birthdate = '01SEP2000'D));
run;

proc sql;
select district, sum(students) as Students
from work.enrollment
where district='227901';
quit;

Choosing the Most Efficient Procedure

There’s often several ways to accomplish the same task with SAS. The question is, are there any guidelines to determine the most efficient method?

It’s impossible to say with certainty that one method or procedure is always more efficient than another because it usually depends on several factors unique to each installation.

But there are a few general guidelines to follow: Procedures designed to perform one specific task are usually more efficient than multi-purpose procedures.

For example, you can use PROC PRINT or PROC SQL to print out a dataset:
proc print data=work.enroll;
run;

proc sql;
select *
from work.enroll;

Since PROC PRINT is specifically designed for printing, it is usually more efficient than using PROC SQL for that purpose.

The Important Exception: PROC DATASETS PROC DATASETS is a utility procedure for managing SAS files that can be used to:
• Copy SAS files from one SAS library to another
• Rename, repair, append or delete SAS files
• List the SAS files in a SAS library
• List the attributes of a SAS dataset
• Manipulate passwords
• Modify dataset attributes
• Modify variables in SAS datasets
• Create and delete indexes
• Create and manage audit files
• Create and delete integrity constraints

Even though PROC DATASETS is a multi-purpose procedure, it tends to be more efficient than using a data step to change dataset attributes, because PROC DATASETS does not process the data portion of the dataset - just the data attributes. Since the data isn’t read in, the amount of time saved when changing dataset attributes on large datasets can be significant.

Creating an Index with PROC DATASETS

An index is an optional file you can create for a SAS data set to specify the location of observations based on values of one or more key variables. Indexes can provide direct access to observations and more efficient WHERE processing. Without an index, observations are accessed in the order they are stored in a data set.

The following example uses a permanent dataset (perm.graduates) that has 237,716 observations:

Data step1;
Set perm.graduates;

If perm.graduates is not indexed, SAS reads every observation to find those that satisfy the WHERE statement. This data step took 0.06 seconds of real time and 0.07 seconds of cpu time to execute.

But after perm.graduates is indexed, SAS uses the index to find the observations that meet the WHERE conditions without having to read all the other observations. After creating the index the same data step only took 0.01 seconds of real time and 0.01 seconds of CPU time to execute.

Indexing can significantly increase the efficiency of SAS programs that access large datasets. SAS IntrNet applications are a great example. While shaving off fractions of a second of processing time doesn't seem that important for jobs that run occasionally, it can be critical for those that are executed hundreds (or thousands) of times per day.

There are three methods you can use to index a dataset:
1. Data step processing
2. PROC SQL
3. PROC DATASETS

While the first two methods can be used to index a dataset at the time it is created, PROC DATASETS is the only method that can be used to index an existing dataset without rebuilding it.

Sample code for indexing an existing dataset:

libname perm '/home/myfiles';
proc datasets library=perm;
modify graduates;
index create district;
run;
quit;

The LIBRARY= option default is WORK or USER.

The DATASETS procedure remains active until either another procedure or DATA step is submitted or until a QUIT statement is issued.

Remember:
· Use indexes for queries that retrieve less than 15% of rows.
· Do not create an index based on columns that have a very small number of distinct values.
· Sequential access is faster for small datasets.

See the SAS documentation to learn more about indexing.

Wednesday, October 18, 2006

Appending Files Efficiently

There are a couple of methods you can use to concatenate data sets. In these examples, work.step2 is appended to work.step1:

With a SET statement in a data step:
data work.step1;
set work.step1
work.step2;
run;

With the Datasets procedure:
proc datasets;
append base=work.step1
data=work.step2;
run;

When you use a SET statement, SAS must process all the observations in work.step1 and work.step2.

The APPEND statement bypasses the processing of the data in the original data set (work.step1) and adds the observations in work.step2 directly to the end of the original.

As a result, using the APPEND statement can be more efficient that using a SET statement if the BASE= dataset is large, and all variables in both data sets have the same variables and they are the same length and type.

The APPEND statement is very useful if you frequently add observations to SAS data sets.
Click here to learn more about the APPEND statement.

Modifying Data Sets and Variables Efficiently with PROC DATASETS

Data sets and variables can be modified using data steps or PROC DATASETS. Here are a few examples using each method, with the time they took to execute.

For this example, a dataset (sasuser.grads05) with 239,862 observations was created.

Modifying a Data Set
1. Using a data step:
Data sasuser.grads05 (label=’2004-05 Graduates’);
Set sasuser.grads05;
Run;

Real time: 1.45 seconds
CPU time: 0.18 seconds

2. Using PROC DATASETS:
Proc datasets lib=sasuser nolist;
Modify grads05 (label=’2004-05 Graduates’);
Quit;
Run;

Real time: 0.01 seconds
CPU time: 0.01 seconds

Modifying Variables
1. Using a data step:
Data sasuser.grads05;
Set sasuser.grads05(rename=(sex=gender));
Format dtupdate mmddyy10.;
run;

Real time: 2.91 seconds
CPU time: 0.19 seconds

2. Using PROC DATASETS:
Proc datasets lib=sasuser nolist;
Modify grads05;
Rename sex=gender;
Format dtupdate mmddyy10.;
Quit;
Run;

Real time: 0.02 seconds
CPU time: 0.02 seconds

Click here to learn more about PROC DATASETS.

Using SAS DATA Step Views to Conserve Data Storage Space

A SAS data view is a type of SAS data set that retrieves data values from other files. A SAS data view only contains descriptor information such as the data types and lengths of the variables, plus information that is required for retrieving data values from other SAS data sets or from files that are stored in other software vendors' file formats.

SAS data views are of member type VIEW.

In most cases, you can use a SAS data view as though it were a SAS data file. Example 1 shows how to create a view:

Example 1:

data sasuser.at_risk / view=sasuser.at_risk;
set work.students;
where at_risk = '1';
run;

The VIEW= option tells SAS to compile (but not execute) the source program and to store the compiled code in the input DATA step view.

You can use a DESCRIBE statement in a DATA step to write a copy of the source code for a data set view to the SAS log:

Example 2:
data view=sasuser.at_risk;
describe;
run;

The log from this data step will show the source code used to create the view.

SAS Data View Benefits
You can save disk space by storing a view definition, which stores only the instructions for where to find the data and how it is formatted, not the actual data.

Views can ensure that the input data sets are always current because data is derived from views at execution time.

Since views can select data from many sources, once a view is created, it can provide prepackaged information without the need for additional programming.

Views can reduce the impact of data design changes on users. For example, you can change a query that is stored in a view without changing the characteristics of the view's result.
With SAS/CONNECT software, a view can join SAS data sets that reside on different host computers, presenting you with an integrated view of distributed data.
You can use views as input to other DATA steps or PROC steps and in combination with other data sources using PROC SQL.

When to Use SAS Data Views
Consider the following in order to determine whether a SAS data file or a SAS data view is better for your purposes:
Data files use additional disk space; data views use additional processing time.
Data file variables can be sorted and indexed prior to use; data views must process data in its existing form during execution.

To learn more about creating and using data views, click here.

Emailing SAS Output

Every need to email your SAS output? You can save the contents of the log or output window then use Outlook to email the file, or you can write SAS code that will email the file for you.

A DATA step that sends electronic mail has the following components:
· a FILENAME statement with the EMAIL device-type keyword
· options specified on the FILE statement indicating the e-mail recipients, subject, and any attached files
· PUT statements that contain the body of the message

This example uses PROC OPTIONS, which lists the current settings of SAS system options and displays the results in the SAS log. This code:
1. Executes PROC OPTIONS
2. Redirects the log to a file (options.log)
3. Sends an email to the specified addressees with options.log as an attachment.

proc printto log='options.log';
proc options;
run;
proc printo log=log;
filename outbox email;
data _null_;
file outbox
to=John.Doe@email.com
cc=Jane.Doe@email.com
bcc=David.Smith@email.com
Subject="Test Email"
Attach = "options.log";

put 'Emailing SAS output with a filename statement.';
run;

Click here to learn more about sending email with SAS.

A Quick Debugging Tip

If you are working on a lengthy program you can save some time debugging it by first submitting it with zero observations:

OPTION OBS=0;

Your job will run almost instantaneously and identify syntax errors.

PROC SQL and the DESCRIBE TABLE Statement

You can use the DESCRIBE TABLE statement in PROC SQL to display information about a SAS dataset. The information is displayed in the log. Whether the data set described was created with a DATA step or PROC SQL, the log will display the CREATE TABLE statement required to create it.

If applicable, SAS data set options are included with the table definition. If indexes are defined on columns in the table, then CREATE INDEX statements for those indexes are also written to the SAS log.

Unlike PROC CONTENTS, the CREATE TABLE statement also displays the name of the SAS program used to create the data set.

Here is an example:

libname abc '/home/sas/data';
proc sql;
describe table abc.customers;
quit;

The resulting log shows the CREATE TABLE statement that would create abc.customers with PROC SQL.

The %SKIP Macro

When working on a lengthy SAS program, there are a couple of ways to block out parts of your code so they don’t execute.

For example, you can use a forward slash and an asterisk:
/* SAS code to skip */

SAS has an automatic macro for blocking out lines of code, the SKIP macro. The advantage of this method is that it allows you to collapse the code while you’re working on other parts of the program:

- %macro skip;
SAS code to skip
:
:
:
%mend skip;

Macros are one example of collapsable code sections, which enable you to expand or collapse sections of code. An expanded section is indicated by a minus sign in the margin.

To collapse a section of code, click the minus sign. The results are:

+ %macro skip;

A collapsed section is indicated by a plus sign in the margin. To expand a section, click on the plus sign.

Using Data Sets without Libnames

You can directly reference SAS data sets without defining a libname first. Just enclose the physical file name in quotes:


proc contents data="C:\temp\customers.sas7bdat";
run;

Using PROC DOWNLOAD for Non-SAS Files

You can use proc download for files that are not SAS data sets by using infile and outfile statements. In this example, a file named myfile.pdf is downloaded to c:\temp:
filename dest 'c:\temp\myfile.pdf';


rsubmit;

filename source '/home/sas/myfile.pdf';

proc download infile=source
outfile=dest;
run;

endrsubmit;
signoff;
Click here to learn more about PROC DOWNLOAD.

Reading Text from a Zip File

There is a currently undocumented (and unsupported) filename engine that can be used to read text from compressed ZIP files directly. You use it by specifying the engine "SASZIPAM" on a filename statement that points to the zip file.

When referring to it you specify the file within it that you wish to read. The contents of the compressed file are written to your SAS log.

In this example, the zip file "STT.zip" contains several text files. I want to read "sweep.txt" and therefore specify "zipfile(sweep.txt)", where "zipfile" is the fileref.
filename zipfile saszipam 'C:\temp\STT.zip';

data _null_;
infile zipfile(sweep.txt);
input;
put _infile_;
run;

The DATECOPY Option

The DATECOPY option in PROC DOWNLOAD can be used when migrating SAS datasets to preserve the original create date and time on the SAS dataset.

DATECOPY also works with PROC UPLOAD, PROC COPY and the COPY statement in PROC DATASETS.

proc download data=perm.test
out=out.test datecopy;
run;

proc datasets lib=source;
copy out=dest datecopy;
quit;

Click here to learn more about the DATECOPY option.

Debugging Complex Macros

You can write code generated by macros to an external file. Since you can see the code that is generated, this technique can be useful for debugging complex macros.

The MPRINT system option writes to the SAS log each SAS statement generated by a macro. Using the MPRINT option is recommended when you suspect your bug lies in code that is generated in a manner you did not expect.

The MFILE option specifies whether MPRINT output is directed to an external file.

Using the technique described below, you can redirect the SAS statements generated by a macro to an external file instead of to the log.
First, include MFILE and MPRINT in your options statement:
options mfile mprint;
Next, include a fileref for MPRINT:
filename mprint '/home/mydir/mymacro.sas';

The SAS code generated by your macro(s) will be written to mymacro.sas. You can then submit mymacro.sas to resolve any problems.
Click here to learn more about debugging macros.

Using Bookmarks

You can bookmark a line of code in your Program Editor to make it easier to find later.

To bookmark a line, press Ctrl + F2 on the line you want to bookmark. A vertical rectangle will appear in the margin to indicate that the line is book marked.

To go directly to a book marked line, press F2. To go to a previous bookmark, press Shift + F2.
To remove a bookmark, go to that line and press Ctrl + F2.

Click here to learn more about book marking and other enhanced editor features.

Using The SUM Function to Prevent Missing Values

You can use the SUM function to prevent missing values if one argument has a missing value.

The syntax for using the SUM function to create a new variable is:
newvar=sum(var1, var2);
The SUM function ignores missing values, so if var1 and var2 are both missing values then the value of newvar will also be missing.

You can prevent that by coding in a constant zero:
newvar=sum(var1, var2, 0);

In the following example:
1. A data set is created by reading in three variables: var1, var2 and var3.
Note that the third observation has missing values for var1 and var2.

2. The SUM function is used to create newvar1, which is the sum of var1 and var2.

3. The SUM function is also used to create newvar2 by summing var1 and var2, but it also codes in a constant zero.

data step1;
input @1 var1 1-2
@4 var2 4-5
@7 var3 7-8
;
newvar1=sum(var1, var2);
newvar2=sum(var1, var2,0);
datalines;
10 20 15
10 10 25
35
20 20 45
;
run;

proc print data=step1;
var var1 var2 newvar1 newvar2;
run;
quit;


For the third observation, there is a missing value for newvar1. But by including a zero in the SUM statement for newvar2, the value of the third observation is zero instead of missing.

Using Wildcards in Variable Lists

You can use the colon as a variable modifier (e.g. abc:), which acts as a wildcard. So rather than entering variables abc1, abc3 and abcde you could just specify abc:. It is great for saving typing long variable lists.

The colon must be at the end, it can’t be embedded (e.g. ab:c).

Another Shortcut for Listing Variables

You can keep all of the numeric or character columns when creating a new data set without listing each one in your KEEP statement:
data test1 (keep = _numeric_); run;
data test2 (keep = _character_); run;

Example:
The work.source data set has three character and five numeric variables. This keep statement keeps one character variable (district) and all of the numeric variables:

data work.mix(keep= distname _numeric_);
set work.source;
run;

This method can also be applied to drop, sum, arrays, etc.

Enclosing Code or Text Within a Block Comment

In the PC-SAS enhanced editor it's easy to enclose a block of code or text within a block comment. Just highlight the selected code and press Ctrl + /

To remove a block comment, highlight the selected code and press Ctrl + Shift /

Getting Help for a SAS Procedure

In the PC-SAS enhanced editor it's easy to get help for a procedure. Just press the mouse button within the procedure name and press F1.

Merging Data Sets Faster

When merging two SAS data sets, many of us use one of these two methods to keep only the observations where both input data sets contribute:

data test1;
merge step1(in=a)
step2(in=b);
by district studentid;
if a=1 and b=1;
run;
or
data test1;
merge step1(in=a)
step2(in=b);
by district studentid;
if a and b;
run;

This method executes faster (and saves a few keystrokes):
data test1;
merge step1(in=a)
step2(in=b);
by district studentid;
if a*b;
run;

This shortened version of the IF statement works since if either the variable a or b is false, the expression resolves to zero (false).