Difference between revisions of "Pspp"

From KENET Training
Jump to: navigation, search
(Created page with "= PSPP (in JupyterLab) Tutorial - KENET HPC Cluster = == Overview == '''PSPP''' is a free and open-source statistical analysis program, designed as a free alternative to SPSS...")
(No difference)

Revision as of 19:04, 10 January 2026

PSPP (in JupyterLab) Tutorial - KENET HPC Cluster

Overview

PSPP is a free and open-source statistical analysis program, designed as a free alternative to SPSS. When integrated with JupyterLab, PSPP provides powerful statistical analysis capabilities accessible through Python notebooks, combining the ease of Jupyter's interactive environment with PSPP's comprehensive statistical procedures.

Use Cases: PSPP in JupyterLab is particularly well-suited for descriptive statistics and frequency analysis, hypothesis testing including t-tests, ANOVA, and chi-square tests, linear and logistic regression modeling, factor analysis and reliability testing, survey data analysis with crosstabulation and contingency tables, data transformation and recoding operations, and generating statistical reports that combine PSPP output with Python visualizations.

Access: PSPP is available through JupyterLab sessions on the KENET Open OnDemand web portal at https://ondemand.vlab.ac.ke

Code Examples: All code examples for this tutorial are available in our GitHub repository at https://github.com/Materials-Modelling-Group/training-examples


Prerequisites

Before using PSPP in JupyterLab, you should have an active KENET HPC cluster account with access to the Open OnDemand portal. Basic understanding of statistical concepts and hypothesis testing will be helpful. Familiarity with Python and JupyterLab is recommended, though not strictly required. Your data should be in CSV, SPSS (.sav), or tab-delimited format and stored in your home directory or scratch space.


Launching JupyterLab for PSPP

Step 1: Access Interactive Apps

Begin by logging into Open OnDemand at https://ondemand.vlab.ac.ke using your KENET credentials. Click the Interactive Apps menu in the top navigation bar, then select JupyterLab from the dropdown list. PSPP functionality is accessed through JupyterLab rather than as a separate application.

Navigate to Interactive Apps → JupyterLab

Step 2: Configure Job Parameters

For PSPP statistical analysis, moderate resources are typically sufficient unless working with very large datasets.

Parameter Description Recommended Value
Partition Queue for job execution normal for CPU-based analysis
Walltime Maximum runtime in hours 2-4 hours for typical analyses
CPU Cores Number of processor cores 2-4 cores for most statistical work
Memory RAM allocation 8-16 GB depending on dataset size
Working Directory Starting directory /home/username or your data folder
Job configuration form with recommended settings

Step 3: Submit and Connect

Click the Launch button to submit your job. Once the status changes to "Running" and the Connect to JupyterLab button appears, click it to open your session. You will access PSPP through Python notebooks or the terminal within JupyterLab.


Quick Start Guide

Understanding PSPP Integration

PSPP can be accessed in JupyterLab through three main methods. The command-line interface allows you to run PSPP syntax files from a terminal within JupyterLab. Python integration using subprocess enables you to call PSPP commands from Python code cells and capture the output. For users familiar with Python statistics libraries, you can also use pandas and scipy as alternatives that provide similar functionality with better Jupyter integration.

Creating Your First PSPP Analysis

Create a new Python notebook in JupyterLab by clicking the Python 3 icon in the Launcher. The pspp/examples/01_basic_analysis.py file in our GitHub repository demonstrates how to run PSPP commands from within a Jupyter notebook using Python's subprocess module. This approach allows you to write PSPP syntax, execute it, and capture the results.

PSPP Syntax Structure

PSPP uses a command-based syntax similar to SPSS. Each command performs a specific operation and must end with a period. Commands are case-insensitive, making DESCRIBE and describe equivalent. The pspp/examples/02_pspp_syntax_basics.sps file shows the fundamental syntax structure including data loading, variable specification, and procedure execution.

Alternative: Python Statistical Libraries

For users more comfortable with Python, the pandas and scipy libraries provide similar statistical capabilities with better Jupyter integration. The pspp/examples/03_python_alternative.py file demonstrates how to perform common PSPP operations using pure Python, which often provides more flexible integration with Jupyter notebooks and easier result manipulation.


Common Tasks

Task 1: Loading and Describing Data

PSPP can read data from CSV files, SPSS .sav files, and tab-delimited text files. The pspp/examples/04_data_loading.sps file demonstrates various data loading methods. After loading data, use DESCRIPTIVES to obtain summary statistics including mean, median, standard deviation, and range for numeric variables. The FREQUENCIES command provides frequency distributions for categorical variables.

The Python subprocess approach allows you to write PSPP syntax in a string, save it to a temporary file, execute it with PSPP, and capture the output for display in your notebook. This workflow integrates well with Jupyter's interactive environment.

Task 2: T-Tests and Group Comparisons

PSPP provides comprehensive support for comparing group means through various t-test procedures. The pspp/examples/05_ttests.sps file demonstrates independent samples t-tests for comparing two groups, paired samples t-tests for before-after comparisons, and one-sample t-tests for comparing a sample mean against a known value.

Results include t-statistics, degrees of freedom, p-values, and confidence intervals. The output can be parsed from PSPP's text output or you can use Python's scipy.stats module for similar analyses with easier result handling.

Task 3: ANOVA and Post-Hoc Tests

When comparing means across three or more groups, use PSPP's ONEWAY command for one-way analysis of variance. The pspp/examples/06_anova.sps file shows how to perform ANOVA with descriptive statistics, homogeneity of variance tests, and post-hoc comparisons using Tukey's HSD or Bonferroni corrections.

PSPP reports F-statistics, p-values, and effect sizes. Post-hoc tests identify which specific groups differ significantly from each other.

Task 4: Crosstabulation and Chi-Square Tests

For analyzing relationships between categorical variables, PSPP's CROSSTABS command creates contingency tables and performs chi-square tests. The pspp/examples/07_crosstabs.sps file demonstrates creating two-way and multi-way tables, calculating row and column percentages, and computing chi-square statistics with effect size measures like Phi and Cramer's V.

This is particularly useful for survey data analysis where you need to understand how responses vary across demographic groups or conditions.

Task 5: Regression Analysis

PSPP supports both linear and logistic regression through the REGRESSION and LOGISTIC REGRESSION commands. The pspp/examples/08_regression.sps file shows how to fit regression models, interpret coefficients and standard errors, examine R-squared and adjusted R-squared values, and generate predicted values.

For linear regression, PSPP provides detailed output including parameter estimates, significance tests, and model fit statistics. Diagnostic plots can be created using Python's matplotlib after extracting the data.


Tips & Best Practices

Workflow Optimization

When working with PSPP in JupyterLab, keep your PSPP syntax in separate .sps files and load them programmatically. This allows for better version control and reusability. Use Python to pre-process data with pandas before passing it to PSPP when complex data manipulation is needed. Capture PSPP output and parse it in Python for further analysis or visualization. Consider using pure Python alternatives like scipy.stats and statsmodels when interactive iteration and result manipulation are important.

Store PSPP syntax files alongside your notebooks for reproducibility. Comment your syntax files liberally using asterisks or the COMMENT command to explain your analytical decisions.

Data Preparation

PSPP is particular about data formats and missing values. Clean your data in Python first using pandas before exporting to CSV for PSPP analysis. Code missing values consistently and declare them explicitly in PSPP syntax. Ensure variable names follow PSPP conventions with no spaces or special characters except underscores. Convert date variables to appropriate formats before analysis.

The pspp/examples/09_data_preparation.py file demonstrates preparing data in pandas and exporting it in PSPP-compatible formats.

Choosing Between PSPP and Python

Use PSPP when you have existing SPSS syntax that needs to run without modification, when generating standardized statistical reports in familiar formats, when working with colleagues who use SPSS and need compatible output formats, or when you need exact SPSS-compatible procedures. Use Python statistical libraries when you need custom visualizations integrated with analyses, when working with complex data structures or large datasets, when you want interactive exploration in notebooks, or when integrating statistics with machine learning workflows.

Many analyses can be performed with either approach. The pspp/workflows/01_comparison_pspp_python.ipynb notebook demonstrates the same analysis performed both ways.

Output Management

PSPP generates text output that can be verbose. Use Python to parse output and extract relevant statistics for display. Save PSPP output to files using the OUTPUT EXPORT command for later reference. Create visualizations in Python using matplotlib or seaborn based on PSPP results. Consider generating automated reports that combine PSPP statistical output with Python-generated tables and figures.


Example Workflows

Example 1: Survey Data Analysis

Objective: Analyze survey responses with descriptive statistics, crosstabulations, and chi-square tests to identify significant patterns.

Launch JupyterLab with 4 cores and 8 GB memory. Create a new Python notebook and follow the workflow in pspp/workflows/02_survey_analysis.ipynb from our GitHub repository.

The workflow demonstrates loading survey data in Python, recoding variables and handling missing values, exporting to CSV for PSPP analysis, writing PSPP syntax for frequency tables and crosstabs, executing PSPP and capturing output, and creating visualizations in Python based on the statistical results.

This integrated approach leverages both PSPP's statistical capabilities and Python's data manipulation and visualization strengths.

Example 2: Experimental Data Analysis

Objective: Compare treatment groups using ANOVA and post-hoc tests, checking assumptions and reporting effect sizes.

The pspp/workflows/03_experimental_analysis.ipynb workflow shows how to check normality assumptions using both PSPP and Python, perform one-way ANOVA with PSPP, conduct post-hoc tests to identify specific group differences, calculate effect sizes, create publication-quality plots of group means with error bars, and generate a combined report with statistical output and visualizations.

If assumptions are violated, the workflow demonstrates using non-parametric alternatives like the Kruskal-Wallis test.

Example 3: Regression Modeling

Objective: Build and evaluate regression models to predict outcomes from multiple predictors.

Follow the steps in pspp/workflows/04_regression_modeling.ipynb which demonstrates exploratory data analysis with correlation matrices, fitting multiple linear regression models in PSPP, checking regression assumptions with diagnostic plots, comparing nested models, interpreting standardized and unstandardized coefficients, and making predictions on new data.

The workflow shows how to extract coefficients from PSPP output and use them in Python for prediction and visualization.

Example 4: Hybrid Python-PSPP Workflow

Objective: Combine the strengths of both tools for a complete analysis pipeline.

The pspp/workflows/05_hybrid_workflow.ipynb example demonstrates using pandas for initial data exploration and cleaning, using PSPP for standard statistical tests and reports, parsing PSPP output programmatically in Python, creating custom visualizations with seaborn and matplotlib, and generating an integrated report with nbconvert that includes both statistical output and visual analysis.

This approach is particularly effective for complex projects that benefit from both PSPP's standardized procedures and Python's flexibility.


Troubleshooting

Problem: PSPP command not found

If running PSPP from a notebook returns a "command not found" error, the PSPP module may not be loaded in your environment. Open a terminal in JupyterLab and check if PSPP is available by typing pspp --version. If not available, contact KENET support to request PSPP installation or module configuration. As an alternative, use Python's statistical libraries like scipy.stats and statsmodels which are pre-installed in most JupyterLab environments.

Problem: Syntax errors in PSPP commands

PSPP syntax errors typically result from missing periods at the end of commands, incorrect variable names, or unsupported syntax. Check that every command ends with a period. Verify variable names match exactly what is in your data file. Ensure you are not using SPSS syntax that differs from PSPP. The PSPP manual at https://www.gnu.org/software/pspp/manual/html_node/ provides complete syntax documentation.

Problem: Cannot read data file

File reading errors usually indicate incorrect path specifications or incompatible data formats. Use absolute paths to your data files starting with /home/username/ or /scratch/username/. Ensure CSV files use standard delimiters and have consistent formatting. For SPSS .sav files, verify they are not from a very new SPSS version that PSPP might not support. Test with a small sample file first.

Problem: Missing values not handled correctly

PSPP requires explicit declaration of missing values. Use the MISSING VALUES command to specify which values should be treated as missing. In CSV files, use consistent notation for missing data like NA or empty cells. When preparing data in Python, use pandas' to_csv with na_rep parameter to control missing value representation.

Problem: Output not displaying in notebook

If PSPP output is not appearing in your notebook cells, check that you are capturing stdout from the subprocess call. Use capture_output=True, text=True parameters in subprocess.run. Print the result.stdout to display output. If output is very long, consider writing it to a file and reading only relevant sections. For better formatting, consider parsing the output and displaying it as formatted text or pandas DataFrames.


Additional Resources

The official PSPP documentation is available at https://www.gnu.org/software/pspp/ and provides comprehensive information about all commands and procedures. The PSPP manual at https://www.gnu.org/software/pspp/manual/html_node/ includes detailed syntax reference and examples. For statistical analysis with Python, consult the scipy documentation at https://docs.scipy.org/doc/scipy/ and the statsmodels documentation at https://www.statsmodels.org/.

KENET HPC usage guidelines are documented at https://training.kenet.or.ke/index.php/HPC_Usage. For learning statistics, the free textbook "OpenIntro Statistics" at https://www.openintro.org/book/os/ provides excellent coverage of fundamental concepts.

Code Examples Repository: All code examples referenced in this tutorial are available at https://github.com/Materials-Modelling-Group/training-examples

For support, contact KENET at support@kenet.or.ke, consult the documentation at https://training.kenet.or.ke, or access the Open OnDemand portal at https://ondemand.vlab.ac.ke.


Version Information

This tutorial documents PSPP version 1.6.2 or later running within JupyterLab on the KENET HPC cluster. Python 3.9 or later is recommended for optimal subprocess integration. This tutorial was last updated on 2026-01-09 and is currently at version 1.0.


Feedback

If you encounter issues or have suggestions for improving this tutorial, please contact KENET support or submit feedback through the Open OnDemand interface using the feedback button.


Back to: Easy HPC access with KENET Open OnDemand