Logistic Regression – Predicting Prescriptions Using Logistic Regression
Introduction
This guide introduces logistic regression through a real-world healthcare analytics example: predicting which prescribers are most likely to issue prescriptions for Ozempic (generic name Semaglutide). You’ll follow a full modeling workflow — from cleaning and preparing the data to interpreting the final model’s coefficients, evaluating predictive accuracy, and making new predictions.
By walking through this case step by step, you’ll learn how logistic regression can reveal the relationships between features like prescriber specialty, region, and cost patterns, and how these factors influence the likelihood of prescribing Semaglutide.
Project Objectives
This project focuses on using logistic regression to uncover the patterns behind Semaglutide prescriptions in U.S. Medicare data. You’ll explore which prescriber features most strongly predict whether a prescription involves Semaglutide.
Dataset Overview
We’ll use a subset of the Centers for Medicare & Medicaid Services (CMS) 2023 Medicare Part D Prescriber Public Use File. Each row represents a prescriber and a specific diabetes drug.
/anvil/projects/tdm/data/CMS/diabetes_prescriber_data.csv

This dataset contains real-world prescribing behavior. Using logistic regression, we’ll explore questions such as: - Which prescribers are most likely to prescribe Semaglutide? - Are there regional or specialty-based trends? - What financial or volume indicators influence prescription likelihood?
Key Variables
Column | Description |
---|---|
|
Prescriber’s state (two-letter code) |
|
Detailed specialty (e.g., Family Practice, Endocrinology) |
|
Brand name of the drug |
|
Generic name of the drug |
|
Total number of claims |
|
Approximate number of 30-day fills |
|
Total days of medication supplied |
|
Total drug cost across all claims |
|
Unique patients who received the drug |
|
Grouped specialty category (e.g., Primary Care) |
1. Data Cleaning and Feature Engineering
Before building the model, we’ll create our target variable, add informative features, and prepare categorical groupings.
Let’s begin by loading the dataset and exploring its first few rows.
import pandas as pd
diabetes_prescriber_data = pd.read_csv("/anvil/projects/tdm/data/CMS/diabetes_prescriber_data.csv")
diabetes_prescriber_data.head()
Prscrbr_State_Abrvtn | Prscrbr_Type | Brnd_Name | Tot_Clms | Tot_30day_Fills | Tot_Day_Suply | Tot_Drug_Cst | Total_Patients | Gnrc_Name | Prscrbr_Type_Grouped |
---|---|---|---|---|---|---|---|---|---|
KY |
Family Practice |
Ozempic |
18 |
19.7 |
568 |
18681.38 |
NaN |
Semaglutide |
Primary Care |
CA |
Endocrinology |
Ozempic |
113 |
230.7 |
6828 |
236624.18 |
25.0 |
Semaglutide |
Endocrinology |
MA |
Rheumatology |
Ozempic |
11 |
11.9 |
342 |
10413.53 |
NaN |
Semaglutide |
GI/Renal/Rheum |
TX |
Internal Medicine |
Ozempic |
107 |
130.3 |
3765 |
128932.39 |
20.0 |
Semaglutide |
Primary Care |
OH |
Certified Clinical Nurse Specialist |
Ozempic |
21 |
29.0 |
840 |
29104.96 |
NaN |
Semaglutide |
Advanced Practice |
We’ll now define our target variable: Semaglutide_drug
, which equals 1 when the generic name is Semaglutide and 0 otherwise.
Semaglutide_drug | count |
---|---|
0 |
40065 |
1 |
8421 |
diabetes_prescriber_data["Semaglutide_drug"] = (diabetes_prescriber_data["Gnrc_Name"] == "Semaglutide").astype(int)
print(diabetes_prescriber_data["Semaglutide_drug"].value_counts())
Next, we’ll engineer new columns that reveal more about prescriber behavior. One useful metric is cost per claim:
diabetes_prescriber_data['Cost_per_claim'] = (
diabetes_prescriber_data['Tot_Drug_Cst'] / diabetes_prescriber_data['Tot_Clms']
)
diabetes_prescriber_data[['Tot_Drug_Cst','Tot_Clms','Cost_per_claim']].head()
Tot_Drug_Cst | Tot_Clms | Cost_per_claim |
---|---|---|
18681.38 |
18 |
1037.854444 |
236624.18 |
113 |
2094.019292 |
10413.53 |
11 |
946.684545 |
128932.39 |
107 |
1204.975607 |
29104.96 |
21 |
1385.950476 |
We can also group prescribers by U.S. region for easier interpretation.
state_region_map = {
"CT": "Northeast","ME": "Northeast","MA": "Northeast","NH": "Northeast",
"NJ": "Northeast","NY": "Northeast","PA": "Northeast","RI": "Northeast","VT": "Northeast",
"IL": "Midwest","IN": "Midwest","IA": "Midwest","KS": "Midwest","MI": "Midwest","MN": "Midwest",
"MO": "Midwest","NE": "Midwest","ND": "Midwest","OH": "Midwest","SD": "Midwest","WI": "Midwest",
"AL": "South","AR": "South","DE": "South","DC": "South","FL": "South","GA": "South","KY": "South",
"LA": "South","MD": "South","MS": "South","NC": "South","OK": "South","SC": "South","TN": "South",
"TX": "South","VA": "South","WV": "South",
"AK": "West","AZ": "West","CA": "West","CO": "West","HI": "West","ID": "West","MT": "West",
"NV": "West","NM": "West","OR": "West","UT": "West","WA": "West","WY": "West",
"PR": "Territory","VI": "Territory","GU": "Territory","MP": "Territory","AS": "Territory",
"AA": "Military","AE": "Military","AP": "Military","ZZ": "Unknown"
}
diabetes_prescriber_data["Prscrbr_State_Region"] = (
diabetes_prescriber_data["Prscrbr_State_Abrvtn"].map(state_region_map)
)
diabetes_prescriber_data["Prscrbr_State_Region"].value_counts(dropna=False)
Prscrbr_State_Region | count |
---|---|
South |
17785 |
Midwest |
10571 |
West |
10170 |
Northeast |
9559 |
Territory |
391 |
Military |
7 |
Unknown |
3 |
At this stage, we’ve created a structured dataset suitable for exploratory data analysis.
2. Exploratory Data Analysis
Before modeling, we’ll explore patterns, missing values, and relationships between variables.
Check missing numeric data:
numeric_cols = ['Cost_per_claim','Tot_30day_Fills','Tot_Day_Suply','Total_Patients']
print(diabetes_prescriber_data[numeric_cols].isna().sum())
Next, compare group averages to spot differences between prescribers who do and don’t prescribe Semaglutide.
summary_stats = diabetes_prescriber_data.groupby("Semaglutide_drug")[numeric_cols].mean()
summary_stats
Semaglutide_drug | Cost_per_claim | Tot_30day_Fills | Tot_Day_Suply |
---|---|---|---|
Total_Patients |
0 |
151.209056 |
105.228559 |
2694.060552 |
34.224819 |
1 |
1293.350409 |
Let’s visualize correlation between numeric variables using a heatmap.
import seaborn as sns
import matplotlib.pyplot as plt
corr_matrix = diabetes_prescriber_data[numeric_cols].corr()
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Matrix of Numeric Features")
plt.tight_layout()
plt.show()

Finally, explore geographic patterns:
plt.figure(figsize=(10,5))
sns.countplot(data=diabetes_prescriber_data, x='Prscrbr_State_Region', hue='Semaglutide_drug')
plt.title('Semaglutide Prescriptions by Region')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

3. Splitting Data for Modeling
To evaluate the model fairly, we’ll split data into train, validation, and test sets.
from sklearn.model_selection import train_test_split
model_features = ["Tot_30day_Fills","Tot_Day_Suply","Cost_per_claim","Total_Patients",
"Prscrbr_State_Region","Prscrbr_Type_Grouped"]
X = diabetes_prescriber_data[model_features]
y = diabetes_prescriber_data["Semaglutide_drug"]
X_train_val, X_test, y_train_val, y_test = train_test_split(
X, y, test_size=0.20, stratify=y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
X_train_val, y_train_val, test_size=0.25, stratify=y_train_val, random_state=42)
Check that class proportions remain balanced across subsets:
for name, arr in [("Train", y_train), ("Validation", y_val), ("Test", y_test)]:
print(f"\n{name} rows:", len(arr))
print(arr.value_counts(normalize=True))
Dataset | Rows | Class Proportion (Semaglutide_drug) |
---|---|---|
Train |
29,091 |
0: 0.826304 |
1: 0.173696 |
Validation |
9,697 |
0: 0.826338 |
1: 0.173662 |
Test |
9,698 |
0: 0.826356 |
1: 0.173644 |
4. Data Preprocessing
Now we’ll clean and standardize features for modeling.
-
Fill missing values for categorical and numeric columns.
-
Apply one-hot encoding to categorical columns.
-
Standardize numeric variables.
-
Ensure train/validation/test sets share identical column structures.
categorical_cols = ['Prscrbr_State_Region','Prscrbr_Type_Grouped']
for df in [X_train, X_val, X_test]:
for col in categorical_cols:
df[col] = df[col].fillna("Missing")
X_train = pd.get_dummies(data=X_train, columns=categorical_cols, drop_first=True)
X_test = pd.get_dummies(data=X_test, columns=categorical_cols, drop_first=True)
X_val = pd.get_dummies(data=X_val, columns=categorical_cols, drop_first=True)
encoded_columns = X_train.columns
X_test = X_test.reindex(columns=encoded_columns, fill_value=0)
X_val = X_val.reindex(columns=encoded_columns, fill_value=0)
Standardize numeric variables:
from sklearn.preprocessing import StandardScaler
numeric_cols = ['Tot_30day_Fills','Tot_Day_Suply','Cost_per_claim','Total_Patients']
scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_val[numeric_cols] = scaler.transform(X_val[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])
Finally, confirm column consistency:
print(X_train.shape, X_val.shape, X_test.shape)
X_train shape: (29091, 26)
X_val shape: (9697, 26)
X_test shape: (9698, 26)
5. Model Building and Evaluation
We’ll now fit a logistic regression model, interpret coefficients, and assess performance.
Checking Multicollinearity
import numpy as np
from sklearn.linear_model import LinearRegression
def calculate_vif(X):
vif_dict = {}
for feature in X.columns:
y = X[feature]
X_pred = X.drop(columns=feature)
r2 = LinearRegression().fit(X_pred, y).score(X_pred, y)
vif_dict[feature] = np.inf if r2 == 1 else 1/(1 - r2)
return pd.Series(vif_dict, name="VIF").sort_values(ascending=False)
vif_values = calculate_vif(X_train.select_dtypes(include=[np.number]))
vif_values
Feature | VIF |
---|---|
Tot_30day_Fills |
948.471582 |
Total_Patients |
832.139064 |
Tot_Day_Suply |
51.847664 |
Prscrbr_State_Region_South |
1.704191 |
Prscrbr_State_Region_West |
1.553571 |
Prscrbr_State_Region_Northeast |
1.532947 |
Prscrbr_Type_Grouped_Primary Care |
1.452644 |
Prscrbr_Type_Grouped_Dental |
1.223323 |
Prscrbr_Type_Grouped_Missing |
1.213235 |
Prscrbr_Type_Grouped_Dermatology/Ophthalmology |
1.138811 |
Prscrbr_Type_Grouped_Surgery |
1.111818 |
Prscrbr_Type_Grouped_Neuro/Psych |
1.094020 |
Prscrbr_Type_Grouped_Cardiology |
1.091790 |
Prscrbr_Type_Grouped_GI/Renal/Rheum |
1.074066 |
Prscrbr_Type_Grouped_Other |
1.071768 |
Cost_per_claim |
1.058779 |
Prscrbr_Type_Grouped_Endocrinology |
1.058377 |
Prscrbr_Type_Grouped_Women’s Health |
1.050068 |
Prscrbr_Type_Grouped_Oncology/Hematology |
1.043538 |
Prscrbr_State_Region_Territory |
1.034220 |
Prscrbr_Type_Grouped_Pulmonary/Critical Care |
1.021193 |
Prscrbr_Type_Grouped_Rehabilitation |
1.019222 |
Prscrbr_Type_Grouped_Anesthesia/Pain |
1.014930 |
Prscrbr_Type_Grouped_Palliative Care |
1.002206 |
Prscrbr_State_Region_Military |
1.001128 |
Prscrbr_State_Region_Unknown |
1.000924 |
Remove highly collinear variables (VIF > 10), except Tot_Day_Suply
which we’ll retain for interpretability.
features_after_vif = [col for col in X_train.columns if col in (
'Tot_Day_Suply','Cost_per_claim','Prscrbr_State_Region_South',
'Prscrbr_Type_Grouped_Primary Care','Prscrbr_Type_Grouped_Endocrinology')]
X_train = X_train[features_after_vif]
Forward Selection with AIC
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")
def forward_selection(X, y, aic_threshold=20):
included = []
current_score = np.inf
while True:
changed=False
excluded = list(set(X.columns) - set(included))
scores = []
for new_col in excluded:
try:
model = sm.Logit(y, sm.add_constant(X[included + [new_col]])).fit(disp=0)
scores.append((model.aic, new_col))
except:
continue
if not scores: break
scores.sort()
best_new_score, best_candidate = scores[0]
if current_score - best_new_score >= aic_threshold:
included.append(best_candidate)
current_score = best_new_score
changed=True
if not changed: break
return included
selected_features = forward_selection(X_train, y_train)
selected_features
Selected features: ['Cost_per_claim', 'Prscrbr_Type_Grouped_Primary Care', 'Prscrbr_Type_Grouped_GI/Renal/Rheum', 'Tot_Day_Suply', 'Prscrbr_Type_Grouped_Endocrinology', 'Prscrbr_Type_Grouped_Neuro/Psych', 'Prscrbr_Type_Grouped_Missing', 'Prscrbr_Type_Grouped_Surgery', 'Prscrbr_Type_Grouped_Other', "Prscrbr_Type_Grouped_Women’s Health", 'Prscrbr_State_Region_South']
Fitting the Final Model
X_train_final = sm.add_constant(X_train[selected_features])
final_model = sm.Logit(y_train, X_train_final).fit()
print(final_model.summary())
print(f"AIC: {final_model.aic}")
Term | coef | std err | z | P> | z |
---|---|---|---|---|---|
[0.025, 0.975] |
const |
-2.3807 |
0.038 |
-63.081 |
|
0.000 |
[-2.455, -2.307] |
Cost_per_claim |
2.4191 |
0.036 |
66.890 |
0.000 |
[2.348, 2.490] |
Prscrbr_Type_Grouped_Primary Care |
1.2536 |
0.045 |
28.164 |
0.000 |
[1.166, 1.341] |
Prscrbr_Type_Grouped_GI/Renal/Rheum |
-4.8014 |
0.357 |
-13.449 |
0.000 |
[-5.501, -4.102] |
Tot_Day_Suply |
-0.7831 |
0.044 |
-17.731 |
0.000 |
[-0.870, -0.697] |
Prscrbr_Type_Grouped_Endocrinology |
1.8776 |
0.148 |
12.667 |
0.000 |
[1.587, 2.168] |
Prscrbr_Type_Grouped_Neuro/Psych |
-6.7406 |
0.959 |
-7.027 |
0.000 |
[-8.621, -4.861] |
Prscrbr_Type_Grouped_Missing |
-1.4901 |
0.136 |
-10.951 |
0.000 |
[-1.757, -1.223] |
Prscrbr_Type_Grouped_Surgery |
-2.5203 |
0.343 |
-7.348 |
0.000 |
[-3.193, -1.848] |
Prscrbr_Type_Grouped_Other |
-1.6574 |
0.268 |
-6.196 |
0.000 |
[-2.182, -1.133] |
Prscrbr_Type_Grouped_Women’s Health |
-2.0414 |
0.451 |
-4.524 |
0.000 |
[-2.926, -1.157] |
Prscrbr_State_Region_South |
0.2295 |
0.043 |
5.301 |
Optimization terminated successfully.
Model Information:
- Dependent Variable: |
Interpreting Coefficients with Odds Ratios
odds_ratios = pd.DataFrame({
"Odds Ratio": np.exp(final_model.params),
"P-value": final_model.pvalues
})
odds_ratios["Direction"] = odds_ratios["Odds Ratio"].apply(
lambda x: "Increases Odds" if x > 1 else ("Decreases Odds" if x < 1 else "No Effect"))
odds_ratios.round(3)
Feature | Odds Ratio | P-value | Direction |
---|---|---|---|
Cost_per_claim |
11.236 |
0.0 |
Increases Odds |
Prscrbr_Type_Grouped_Endocrinology |
6.538 |
0.0 |
Increases Odds |
Prscrbr_Type_Grouped_Primary Care |
3.503 |
0.0 |
Increases Odds |
Prscrbr_State_Region_South |
1.258 |
0.0 |
Increases Odds |
Tot_Day_Suply |
0.457 |
0.0 |
Decreases Odds |
Prscrbr_Type_Grouped_Missing |
0.225 |
0.0 |
Decreases Odds |
Prscrbr_Type_Grouped_Other |
0.191 |
0.0 |
Decreases Odds |
Prscrbr_Type_Grouped_Women’s Health |
0.130 |
0.0 |
Decreases Odds |
const |
0.092 |
0.0 |
Decreases Odds |
Prscrbr_Type_Grouped_Surgery |
0.080 |
0.0 |
Decreases Odds |
Prscrbr_Type_Grouped_GI/Renal/Rheum |
0.008 |
0.0 |
Decreases Odds |
Prscrbr_Type_Grouped_Neuro/Psych |
0.001 |
0.0 |
Decreases Odds |
Confusion Matrices
from sklearn.metrics import confusion_matrix
def display_cm(y_true, y_pred, label):
cm = confusion_matrix(y_true, y_pred)
df = pd.DataFrame(cm, index=["Actual 0","Actual 1"], columns=["Pred 0","Pred 1"])
print(f"\n{label} Confusion Matrix:\n", df)
X_val_final = sm.add_constant(X_val[selected_features])
X_test_final = sm.add_constant(X_test[selected_features])
train_pred = (final_model.predict(X_train_final) >= 0.5).astype(int)
val_pred = (final_model.predict(X_val_final) >= 0.5).astype(int)
test_pred = (final_model.predict(X_test_final) >= 0.5).astype(int)
display_cm(y_train, train_pred, "Train")
display_cm(y_val, val_pred, "Validation")
display_cm(y_test, test_pred, "Test")
Train Confusion Matrix | Predicted 0 | Predicted 1 |
---|---|---|
Actual 0 |
23608 |
430 |
Actual 1 |
1260 |
3793 |
Validation Confusion Matrix | Predicted 0 | Predicted 1 |
---|---|---|
Actual 0 |
7875 |
138 |
Actual 1 |
421 |
1263 |
Test Confusion Matrix | Predicted 0 | Predicted 1 |
---|---|---|
Actual 0 |
7865 |
149 |
Actual 1 |
424 |
1260 |
ROC Curves and AUC
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
def plot_roc(y_true, y_proba, label):
fpr, tpr, _ = roc_curve(y_true, y_proba)
auc = roc_auc_score(y_true, y_proba)
plt.plot(fpr, tpr, label=f"{label} (AUC={auc:.2f})")
plt.figure(figsize=(8,6))
plot_roc(y_train, final_model.predict(X_train_final), "Train")
plot_roc(y_val, final_model.predict(X_val_final), "Validation")
plot_roc(y_test, final_model.predict(X_test_final), "Test")
plt.plot([0,1],[0,1],'k--')
plt.title("ROC Curves - Train, Validation, and Test")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.grid(True)
plt.show()
print("Train AUC:", roc_auc_score(y_train, final_model.predict(X_train_final)))
print("Validation AUC:", roc_auc_score(y_val, final_model.predict(X_val_final)))
print("Test AUC:", roc_auc_score(y_test, final_model.predict(X_test_final)))

6. Making Predictions on New Prescribers
At this point, our model is trained and evaluated. Let’s see how it performs on new, unseen prescribers.
We’ll sample 20 new records from the test set and compute their predicted probability of prescribing Semaglutide.
sample_prescribers = X_test.sample(n=20, random_state=42)
sample_final = sm.add_constant(sample_prescribers[selected_features])
sample_final = sample_final[final_model.params.index]
sample_preds = final_model.predict(sample_final)
scored_sample = sample_prescribers.copy()
scored_sample["Predicted_Probability"] = sample_preds
scored_sample.sort_values("Predicted_Probability", ascending=False).head(5)
Index | Tot_30day_Fills | Tot_Day_Suply | Cost_per_claim | Total_Patients | Prscrbr_State_Region_Military | Prscrbr_State_Region_Northeast | Prscrbr_State_Region_South | Prscrbr_State_Region_West | Prscrbr_Type_Grouped_Primary Care | Predicted_Probability |
---|---|---|---|---|---|---|---|---|---|---|
26406 |
-0.060228 |
-0.181987 |
1.466532 |
-0.018389 |
0 |
0 |
0 |
1 |
1 |
0.928453 |
1897 |
-0.050658 |
-0.099769 |
1.611421 |
-0.018389 |
0 |
0 |
1 |
0 |
0 |
0.861192 |
1442 |
0.009717 |
0.097126 |
0.795578 |
-0.039022 |
0 |
0 |
1 |
0 |
1 |
0.721310 |
2520 |
-0.146181 |
-0.452130 |
0.708982 |
-0.018389 |
0 |
0 |
1 |
0 |
0 |
0.479521 |
37937 |
-0.074843 |
-0.199071 |
-0.315274 |
-0.044180 |
0 |
0 |
0 |
0 |
0 |
… |
You can now interpret the highest probabilities. Typically, top prescribers share similar characteristics — such as being in Southern or Western regions, having higher cost per claim, or belonging to Primary Care specialties.
This consistent pattern suggests that both geographic and specialty factors influence Semaglutide prescribing behavior.
Conclusion
You’ve completed a full logistic regression pipeline: - Data cleaning and feature engineering - Exploratory analysis and visualization - Data splitting and preprocessing - Multicollinearity diagnostics and feature selection - Model fitting, evaluation, and interpretation - Predictive scoring on unseen prescribers
This workflow represents a robust and interpretable approach to understanding binary outcomes — a method widely used in both healthcare and business analytics.
By mastering this process, you can now apply logistic regression confidently to new datasets where understanding the probability of an event is key.