Logistic Regression – Predicting Prescriptions Using Logistic Regression

Introduction

This guide introduces logistic regression through a real-world healthcare analytics example: predicting which prescribers are most likely to issue prescriptions for Ozempic (generic name Semaglutide). You’ll follow a full modeling workflow — from cleaning and preparing the data to interpreting the final model’s coefficients, evaluating predictive accuracy, and making new predictions.

By walking through this case step by step, you’ll learn how logistic regression can reveal the relationships between features like prescriber specialty, region, and cost patterns, and how these factors influence the likelihood of prescribing Semaglutide.

Project Objectives

This project focuses on using logistic regression to uncover the patterns behind Semaglutide prescriptions in U.S. Medicare data. You’ll explore which prescriber features most strongly predict whether a prescription involves Semaglutide.

Learning Goals
  • Explore and prepare real-world healthcare data.

  • Build binary indicators and engineered features.

  • Split and preprocess data for model training and validation.

  • Detect and reduce multicollinearity using VIF.

  • Select variables using AIC-based forward selection.

  • Interpret results with odds ratios and evaluate performance with ROC/AUC.

Dataset Overview

We’ll use a subset of the Centers for Medicare & Medicaid Services (CMS) 2023 Medicare Part D Prescriber Public Use File. Each row represents a prescriber and a specific diabetes drug.

Dataset path
/anvil/projects/tdm/data/CMS/diabetes_prescriber_data.csv
PrescriberImage

This dataset contains real-world prescribing behavior. Using logistic regression, we’ll explore questions such as: - Which prescribers are most likely to prescribe Semaglutide? - Are there regional or specialty-based trends? - What financial or volume indicators influence prescription likelihood?

Key Variables

Column Description

Prscrbr_State_Abrvtn

Prescriber’s state (two-letter code)

Prscrbr_Type

Detailed specialty (e.g., Family Practice, Endocrinology)

Brnd_Name

Brand name of the drug

Gnrc_Name

Generic name of the drug

Tot_Clms

Total number of claims

Tot_30day_Fills

Approximate number of 30-day fills

Tot_Day_Suply

Total days of medication supplied

Tot_Drug_Cst

Total drug cost across all claims

Total_Patients

Unique patients who received the drug

Prscrbr_Type_Grouped

Grouped specialty category (e.g., Primary Care)

1. Data Cleaning and Feature Engineering

Before building the model, we’ll create our target variable, add informative features, and prepare categorical groupings.

Let’s begin by loading the dataset and exploring its first few rows.

import pandas as pd
diabetes_prescriber_data = pd.read_csv("/anvil/projects/tdm/data/CMS/diabetes_prescriber_data.csv")
diabetes_prescriber_data.head()
Prscrbr_State_Abrvtn Prscrbr_Type Brnd_Name Tot_Clms Tot_30day_Fills Tot_Day_Suply Tot_Drug_Cst Total_Patients Gnrc_Name Prscrbr_Type_Grouped

KY

Family Practice

Ozempic

18

19.7

568

18681.38

NaN

Semaglutide

Primary Care

CA

Endocrinology

Ozempic

113

230.7

6828

236624.18

25.0

Semaglutide

Endocrinology

MA

Rheumatology

Ozempic

11

11.9

342

10413.53

NaN

Semaglutide

GI/Renal/Rheum

TX

Internal Medicine

Ozempic

107

130.3

3765

128932.39

20.0

Semaglutide

Primary Care

OH

Certified Clinical Nurse Specialist

Ozempic

21

29.0

840

29104.96

NaN

Semaglutide

Advanced Practice

We’ll now define our target variable: Semaglutide_drug, which equals 1 when the generic name is Semaglutide and 0 otherwise.

Semaglutide_drug count

0

40065

1

8421

diabetes_prescriber_data["Semaglutide_drug"] = (diabetes_prescriber_data["Gnrc_Name"] == "Semaglutide").astype(int)
print(diabetes_prescriber_data["Semaglutide_drug"].value_counts())

Next, we’ll engineer new columns that reveal more about prescriber behavior. One useful metric is cost per claim:

diabetes_prescriber_data['Cost_per_claim'] = (
    diabetes_prescriber_data['Tot_Drug_Cst'] / diabetes_prescriber_data['Tot_Clms']
)
diabetes_prescriber_data[['Tot_Drug_Cst','Tot_Clms','Cost_per_claim']].head()
Tot_Drug_Cst Tot_Clms Cost_per_claim

18681.38

18

1037.854444

236624.18

113

2094.019292

10413.53

11

946.684545

128932.39

107

1204.975607

29104.96

21

1385.950476

We can also group prescribers by U.S. region for easier interpretation.

state_region_map = {
    "CT": "Northeast","ME": "Northeast","MA": "Northeast","NH": "Northeast",
    "NJ": "Northeast","NY": "Northeast","PA": "Northeast","RI": "Northeast","VT": "Northeast",
    "IL": "Midwest","IN": "Midwest","IA": "Midwest","KS": "Midwest","MI": "Midwest","MN": "Midwest",
    "MO": "Midwest","NE": "Midwest","ND": "Midwest","OH": "Midwest","SD": "Midwest","WI": "Midwest",
    "AL": "South","AR": "South","DE": "South","DC": "South","FL": "South","GA": "South","KY": "South",
    "LA": "South","MD": "South","MS": "South","NC": "South","OK": "South","SC": "South","TN": "South",
    "TX": "South","VA": "South","WV": "South",
    "AK": "West","AZ": "West","CA": "West","CO": "West","HI": "West","ID": "West","MT": "West",
    "NV": "West","NM": "West","OR": "West","UT": "West","WA": "West","WY": "West",
    "PR": "Territory","VI": "Territory","GU": "Territory","MP": "Territory","AS": "Territory",
    "AA": "Military","AE": "Military","AP": "Military","ZZ": "Unknown"
}
diabetes_prescriber_data["Prscrbr_State_Region"] = (
    diabetes_prescriber_data["Prscrbr_State_Abrvtn"].map(state_region_map)
)
diabetes_prescriber_data["Prscrbr_State_Region"].value_counts(dropna=False)
Prscrbr_State_Region count

South

17785

Midwest

10571

West

10170

Northeast

9559

Territory

391

Military

7

Unknown

3

At this stage, we’ve created a structured dataset suitable for exploratory data analysis.

2. Exploratory Data Analysis

Before modeling, we’ll explore patterns, missing values, and relationships between variables.

Check missing numeric data:

numeric_cols = ['Cost_per_claim','Tot_30day_Fills','Tot_Day_Suply','Total_Patients']
print(diabetes_prescriber_data[numeric_cols].isna().sum())

Next, compare group averages to spot differences between prescribers who do and don’t prescribe Semaglutide.

summary_stats = diabetes_prescriber_data.groupby("Semaglutide_drug")[numeric_cols].mean()
summary_stats
Semaglutide_drug Cost_per_claim Tot_30day_Fills Tot_Day_Suply

Total_Patients

0

151.209056

105.228559

2694.060552

34.224819

1

1293.350409

Let’s visualize correlation between numeric variables using a heatmap.

import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = diabetes_prescriber_data[numeric_cols].corr()
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Matrix of Numeric Features")
plt.tight_layout()
plt.show()
cormatrixpres

Finally, explore geographic patterns:

plt.figure(figsize=(10,5))
sns.countplot(data=diabetes_prescriber_data, x='Prscrbr_State_Region', hue='Semaglutide_drug')
plt.title('Semaglutide Prescriptions by Region')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
barplotpres

3. Splitting Data for Modeling

To evaluate the model fairly, we’ll split data into train, validation, and test sets.

from sklearn.model_selection import train_test_split

model_features = ["Tot_30day_Fills","Tot_Day_Suply","Cost_per_claim","Total_Patients",
                  "Prscrbr_State_Region","Prscrbr_Type_Grouped"]

X = diabetes_prescriber_data[model_features]
y = diabetes_prescriber_data["Semaglutide_drug"]

X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.25, stratify=y_train_val, random_state=42)

Check that class proportions remain balanced across subsets:

for name, arr in [("Train", y_train), ("Validation", y_val), ("Test", y_test)]:
    print(f"\n{name} rows:", len(arr))
    print(arr.value_counts(normalize=True))
Dataset Rows Class Proportion (Semaglutide_drug)

Train

29,091

0: 0.826304

1: 0.173696

Validation

9,697

0: 0.826338

1: 0.173662

Test

9,698

0: 0.826356

1: 0.173644

4. Data Preprocessing

Now we’ll clean and standardize features for modeling.

  1. Fill missing values for categorical and numeric columns.

  2. Apply one-hot encoding to categorical columns.

  3. Standardize numeric variables.

  4. Ensure train/validation/test sets share identical column structures.

categorical_cols = ['Prscrbr_State_Region','Prscrbr_Type_Grouped']

for df in [X_train, X_val, X_test]:
    for col in categorical_cols:
        df[col] = df[col].fillna("Missing")

X_train = pd.get_dummies(data=X_train, columns=categorical_cols, drop_first=True)
X_test = pd.get_dummies(data=X_test, columns=categorical_cols, drop_first=True)
X_val  = pd.get_dummies(data=X_val,  columns=categorical_cols, drop_first=True)

encoded_columns = X_train.columns
X_test = X_test.reindex(columns=encoded_columns, fill_value=0)
X_val  = X_val.reindex(columns=encoded_columns, fill_value=0)

Standardize numeric variables:

from sklearn.preprocessing import StandardScaler

numeric_cols = ['Tot_30day_Fills','Tot_Day_Suply','Cost_per_claim','Total_Patients']
scaler = StandardScaler()

X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_val[numeric_cols]   = scaler.transform(X_val[numeric_cols])
X_test[numeric_cols]  = scaler.transform(X_test[numeric_cols])

Finally, confirm column consistency:

print(X_train.shape, X_val.shape, X_test.shape)

X_train shape: (29091, 26)

X_val shape: (9697, 26)

X_test shape: (9698, 26)

5. Model Building and Evaluation

We’ll now fit a logistic regression model, interpret coefficients, and assess performance.

Checking Multicollinearity

import numpy as np
from sklearn.linear_model import LinearRegression

def calculate_vif(X):
    vif_dict = {}
    for feature in X.columns:
        y = X[feature]
        X_pred = X.drop(columns=feature)
        r2 = LinearRegression().fit(X_pred, y).score(X_pred, y)
        vif_dict[feature] = np.inf if r2 == 1 else 1/(1 - r2)
    return pd.Series(vif_dict, name="VIF").sort_values(ascending=False)

vif_values = calculate_vif(X_train.select_dtypes(include=[np.number]))
vif_values
Feature VIF

Tot_30day_Fills

948.471582

Total_Patients

832.139064

Tot_Day_Suply

51.847664

Prscrbr_State_Region_South

1.704191

Prscrbr_State_Region_West

1.553571

Prscrbr_State_Region_Northeast

1.532947

Prscrbr_Type_Grouped_Primary Care

1.452644

Prscrbr_Type_Grouped_Dental

1.223323

Prscrbr_Type_Grouped_Missing

1.213235

Prscrbr_Type_Grouped_Dermatology/Ophthalmology

1.138811

Prscrbr_Type_Grouped_Surgery

1.111818

Prscrbr_Type_Grouped_Neuro/Psych

1.094020

Prscrbr_Type_Grouped_Cardiology

1.091790

Prscrbr_Type_Grouped_GI/Renal/Rheum

1.074066

Prscrbr_Type_Grouped_Other

1.071768

Cost_per_claim

1.058779

Prscrbr_Type_Grouped_Endocrinology

1.058377

Prscrbr_Type_Grouped_Women’s Health

1.050068

Prscrbr_Type_Grouped_Oncology/Hematology

1.043538

Prscrbr_State_Region_Territory

1.034220

Prscrbr_Type_Grouped_Pulmonary/Critical Care

1.021193

Prscrbr_Type_Grouped_Rehabilitation

1.019222

Prscrbr_Type_Grouped_Anesthesia/Pain

1.014930

Prscrbr_Type_Grouped_Palliative Care

1.002206

Prscrbr_State_Region_Military

1.001128

Prscrbr_State_Region_Unknown

1.000924

Remove highly collinear variables (VIF > 10), except Tot_Day_Suply which we’ll retain for interpretability.

features_after_vif = [col for col in X_train.columns if col in (
'Tot_Day_Suply','Cost_per_claim','Prscrbr_State_Region_South',
'Prscrbr_Type_Grouped_Primary Care','Prscrbr_Type_Grouped_Endocrinology')]
X_train = X_train[features_after_vif]

Forward Selection with AIC

import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")

def forward_selection(X, y, aic_threshold=20):
    included = []
    current_score = np.inf
    while True:
        changed=False
        excluded = list(set(X.columns) - set(included))
        scores = []
        for new_col in excluded:
            try:
                model = sm.Logit(y, sm.add_constant(X[included + [new_col]])).fit(disp=0)
                scores.append((model.aic, new_col))
            except:
                continue
        if not scores: break
        scores.sort()
        best_new_score, best_candidate = scores[0]
        if current_score - best_new_score >= aic_threshold:
            included.append(best_candidate)
            current_score = best_new_score
            changed=True
        if not changed: break
    return included

selected_features = forward_selection(X_train, y_train)
selected_features

Selected features: ['Cost_per_claim', 'Prscrbr_Type_Grouped_Primary Care', 'Prscrbr_Type_Grouped_GI/Renal/Rheum', 'Tot_Day_Suply', 'Prscrbr_Type_Grouped_Endocrinology', 'Prscrbr_Type_Grouped_Neuro/Psych', 'Prscrbr_Type_Grouped_Missing', 'Prscrbr_Type_Grouped_Surgery', 'Prscrbr_Type_Grouped_Other', "Prscrbr_Type_Grouped_Women’s Health", 'Prscrbr_State_Region_South']

Fitting the Final Model

X_train_final = sm.add_constant(X_train[selected_features])
final_model = sm.Logit(y_train, X_train_final).fit()
print(final_model.summary())
print(f"AIC: {final_model.aic}")
Logit Regression Results
Term coef std err z P> z

[0.025, 0.975]

const

-2.3807

0.038

-63.081

0.000

[-2.455, -2.307]

Cost_per_claim

2.4191

0.036

66.890

0.000

[2.348, 2.490]

Prscrbr_Type_Grouped_Primary Care

1.2536

0.045

28.164

0.000

[1.166, 1.341]

Prscrbr_Type_Grouped_GI/Renal/Rheum

-4.8014

0.357

-13.449

0.000

[-5.501, -4.102]

Tot_Day_Suply

-0.7831

0.044

-17.731

0.000

[-0.870, -0.697]

Prscrbr_Type_Grouped_Endocrinology

1.8776

0.148

12.667

0.000

[1.587, 2.168]

Prscrbr_Type_Grouped_Neuro/Psych

-6.7406

0.959

-7.027

0.000

[-8.621, -4.861]

Prscrbr_Type_Grouped_Missing

-1.4901

0.136

-10.951

0.000

[-1.757, -1.223]

Prscrbr_Type_Grouped_Surgery

-2.5203

0.343

-7.348

0.000

[-3.193, -1.848]

Prscrbr_Type_Grouped_Other

-1.6574

0.268

-6.196

0.000

[-2.182, -1.133]

Prscrbr_Type_Grouped_Women’s Health

-2.0414

0.451

-4.524

0.000

[-2.926, -1.157]

Prscrbr_State_Region_South

0.2295

0.043

5.301

Optimization terminated successfully. Model Information: - Dependent Variable: Semaglutide_drug - Observations: 29,091 - Pseudo R²: 0.4312 - Log-Likelihood: -7640.2 - LL-Null: -13431 - AIC: 15304.48 - Converged: True - Covariance Type: nonrobust - LLR p-value: 0.000

Interpreting Coefficients with Odds Ratios

odds_ratios = pd.DataFrame({
    "Odds Ratio": np.exp(final_model.params),
    "P-value": final_model.pvalues
})
odds_ratios["Direction"] = odds_ratios["Odds Ratio"].apply(
    lambda x: "Increases Odds" if x > 1 else ("Decreases Odds" if x < 1 else "No Effect"))
odds_ratios.round(3)
Feature Odds Ratio P-value Direction

Cost_per_claim

11.236

0.0

Increases Odds

Prscrbr_Type_Grouped_Endocrinology

6.538

0.0

Increases Odds

Prscrbr_Type_Grouped_Primary Care

3.503

0.0

Increases Odds

Prscrbr_State_Region_South

1.258

0.0

Increases Odds

Tot_Day_Suply

0.457

0.0

Decreases Odds

Prscrbr_Type_Grouped_Missing

0.225

0.0

Decreases Odds

Prscrbr_Type_Grouped_Other

0.191

0.0

Decreases Odds

Prscrbr_Type_Grouped_Women’s Health

0.130

0.0

Decreases Odds

const

0.092

0.0

Decreases Odds

Prscrbr_Type_Grouped_Surgery

0.080

0.0

Decreases Odds

Prscrbr_Type_Grouped_GI/Renal/Rheum

0.008

0.0

Decreases Odds

Prscrbr_Type_Grouped_Neuro/Psych

0.001

0.0

Decreases Odds

Confusion Matrices

from sklearn.metrics import confusion_matrix

def display_cm(y_true, y_pred, label):
    cm = confusion_matrix(y_true, y_pred)
    df = pd.DataFrame(cm, index=["Actual 0","Actual 1"], columns=["Pred 0","Pred 1"])
    print(f"\n{label} Confusion Matrix:\n", df)

X_val_final = sm.add_constant(X_val[selected_features])
X_test_final = sm.add_constant(X_test[selected_features])

train_pred = (final_model.predict(X_train_final) >= 0.5).astype(int)
val_pred   = (final_model.predict(X_val_final)   >= 0.5).astype(int)
test_pred  = (final_model.predict(X_test_final)  >= 0.5).astype(int)

display_cm(y_train, train_pred, "Train")
display_cm(y_val, val_pred, "Validation")
display_cm(y_test, test_pred, "Test")
Train Confusion Matrix Predicted 0 Predicted 1

Actual 0

23608

430

Actual 1

1260

3793

Validation Confusion Matrix Predicted 0 Predicted 1

Actual 0

7875

138

Actual 1

421

1263

Test Confusion Matrix Predicted 0 Predicted 1

Actual 0

7865

149

Actual 1

424

1260

ROC Curves and AUC

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

def plot_roc(y_true, y_proba, label):
    fpr, tpr, _ = roc_curve(y_true, y_proba)
    auc = roc_auc_score(y_true, y_proba)
    plt.plot(fpr, tpr, label=f"{label} (AUC={auc:.2f})")

plt.figure(figsize=(8,6))
plot_roc(y_train, final_model.predict(X_train_final), "Train")
plot_roc(y_val, final_model.predict(X_val_final), "Validation")
plot_roc(y_test, final_model.predict(X_test_final), "Test")
plt.plot([0,1],[0,1],'k--')
plt.title("ROC Curves - Train, Validation, and Test")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.grid(True)
plt.show()

print("Train AUC:", roc_auc_score(y_train, final_model.predict(X_train_final)))
print("Validation AUC:", roc_auc_score(y_val, final_model.predict(X_val_final)))
print("Test AUC:", roc_auc_score(y_test, final_model.predict(X_test_final)))
roccurvepres

6. Making Predictions on New Prescribers

At this point, our model is trained and evaluated. Let’s see how it performs on new, unseen prescribers.

We’ll sample 20 new records from the test set and compute their predicted probability of prescribing Semaglutide.

sample_prescribers = X_test.sample(n=20, random_state=42)
sample_final = sm.add_constant(sample_prescribers[selected_features])
sample_final = sample_final[final_model.params.index]
sample_preds = final_model.predict(sample_final)

scored_sample = sample_prescribers.copy()
scored_sample["Predicted_Probability"] = sample_preds

scored_sample.sort_values("Predicted_Probability", ascending=False).head(5)
Index Tot_30day_Fills Tot_Day_Suply Cost_per_claim Total_Patients Prscrbr_State_Region_Military Prscrbr_State_Region_Northeast Prscrbr_State_Region_South Prscrbr_State_Region_West Prscrbr_Type_Grouped_Primary Care Predicted_Probability

26406

-0.060228

-0.181987

1.466532

-0.018389

0

0

0

1

1

0.928453

1897

-0.050658

-0.099769

1.611421

-0.018389

0

0

1

0

0

0.861192

1442

0.009717

0.097126

0.795578

-0.039022

0

0

1

0

1

0.721310

2520

-0.146181

-0.452130

0.708982

-0.018389

0

0

1

0

0

0.479521

37937

-0.074843

-0.199071

-0.315274

-0.044180

0

0

0

0

0

…​

You can now interpret the highest probabilities. Typically, top prescribers share similar characteristics — such as being in Southern or Western regions, having higher cost per claim, or belonging to Primary Care specialties.

This consistent pattern suggests that both geographic and specialty factors influence Semaglutide prescribing behavior.

Conclusion

You’ve completed a full logistic regression pipeline: - Data cleaning and feature engineering - Exploratory analysis and visualization - Data splitting and preprocessing - Multicollinearity diagnostics and feature selection - Model fitting, evaluation, and interpretation - Predictive scoring on unseen prescribers

This workflow represents a robust and interpretable approach to understanding binary outcomes — a method widely used in both healthcare and business analytics.

By mastering this process, you can now apply logistic regression confidently to new datasets where understanding the probability of an event is key.