Logistic Regression – Predicting Prescriptions Using Logistic Regression

Introduction

This guide introduces logistic regression through a real-world healthcare analytics example: predicting which prescribers are most likely to issue prescriptions for Ozempic (generic name Semaglutide). You’ll follow a full modeling workflow — from cleaning and preparing the data to interpreting the final model’s coefficients, evaluating predictive accuracy, and making new predictions.

By walking through this case step by step, you’ll learn how logistic regression can reveal the relationships between features like prescriber specialty, region, and cost patterns, and how these factors influence the likelihood of prescribing Semaglutide.

This guide was originally created as a seminar project, so you may notice it follows a question-and-answer format. Even so, it should still serve as a helpful resource to guide you through the process of building your own model.

Project Objectives

This project focuses on using logistic regression to uncover the patterns behind Semaglutide prescriptions in U.S. Medicare data. You’ll explore which prescriber features most strongly predict whether a prescription involves Semaglutide.

Learning Goals
  • Explore and prepare real-world healthcare data.

  • Build binary indicators and engineered features.

  • Split and preprocess data for model training and validation.

  • Detect and reduce multicollinearity using VIF.

  • Select variables using AIC-based forward selection.

  • Interpret results with odds ratios and evaluate performance with ROC/AUC.

Dataset Overview

We’ll use a subset of the Centers for Medicare & Medicaid Services (CMS) 2023 Medicare Part D Prescriber Public Use File. Each row represents a prescriber and a specific diabetes drug.

Dataset path
/anvil/projects/tdm/data/CMS/diabetes_prescriber_data.csv
PrescriberImage

This dataset contains real-world prescribing behavior. Using logistic regression, we’ll explore questions such as:

  • Which prescribers are most likely to prescribe Semaglutide?

  • Are there regional or specialty-based trends?

  • What financial or volume indicators influence prescription likelihood?

Key Variables

Column Description

Prscrbr_State_Abrvtn

Prescriber’s state (two-letter code)

Prscrbr_Type

Detailed specialty (e.g., Family Practice, Endocrinology)

Brnd_Name

Brand name of the drug

Gnrc_Name

Generic name of the drug

Tot_Clms

Total number of claims

Tot_30day_Fills

Approximate number of 30-day fills

Tot_Day_Suply

Total days of medication supplied

Tot_Drug_Cst

Total drug cost across all claims

Total_Patients

Unique patients who received the drug

Prscrbr_Type_Grouped

Grouped specialty category (e.g., Primary Care)

1. Data Cleaning and Feature Engineering

Class Distributions

Before we can analyze or model anything, we need to clean the data and add few new features (do some feature engineering). One of the first things you will do is create a binary target variable to indicate whether a record involves Semaglutide. This kind of variable only takes on two values: 1 if the drug is Semaglutide, and 0 if it is not. In Python, you can create a binary variable using a Boolean expression and then convert the result to integers using .astype(int).

For example:

df["Semaglutide_drug"] = (df["Gnrc_Name"] == "Semaglutide").astype(int)

This creates a column where any row with Semaglutide is marked as 1, and all others are marked as 0. This will eventually serve as the outcome we try to predict.

Feature Engineering - Creating New Columns

Another important task we will perform is called feature engineering. Feature engineering is the process of transforming raw data into meaningful features that better represent the underlying problem to the predictive model. This includes creating, modifying, or grouping existing variables to enhance model performance or interpretability.

One simple but powerful example is creating a new column that provides more insight than the raw data alone. For instance, to understand how costly each prescription claim is, we can divide Tot_Drug_Cst (total drug cost) by Tot_Clms (total number of claims) to compute a new column called Cost_per_claim:

diabetes_prescriber_data["Cost_per_claim"] = diabetes_prescriber_data["Tot_Drug_Cst"] / diabetes_prescriber_data["Tot_Clms"]

This gives us a per-claim cost that may be more meaningful for comparison across prescribers or regions.

Another useful technique is mapping existing values into broader categories. For example, individual state abbreviations in the Prscrbr_State_Abrvtn column can be mapped to broader U.S. Census regions. This can help us spot regional trends more easily than looking at each state individually:

state_region_map = {
    "CT": "Northeast", "CA": "West", "TX": "South", ...
}

diabetes_prescriber_data["Prscrbr_State_Region"] = diabetes_prescriber_data["Prscrbr_State_Abrvtn"].map(state_region_map)

By grouping states into regions, we reduce granularity while retaining useful geographic patterns, a valuable preprocessing step before modeling.

However, not all states in the dataset may appear in the dictionary. In those cases, .map() returns a missing value (NaN). To handle this, you can fill in those missing values with a default label like "Missing" using .fillna("Missing").

Value Counts

Finally, it is often useful to explore the distribution of values in a column. For example, how many rows fall into each region or how many are Semaglutide prescriptions. You can do this using .value_counts() for counts and .value_counts(normalize=True) for proportions.

These basic data preparation steps: binary indicators, feature engineering, mapping, and counting are critical in setting up a dataset for modeling. Now, you are ready for exploratory analysis.

Before building the model, we’ll create our target variable, add informative features, and prepare categorical groupings.

Let’s begin by loading the dataset and exploring its first few rows.

import pandas as pd
diabetes_prescriber_data = pd.read_csv("/anvil/projects/tdm/data/CMS/diabetes_prescriber_data.csv")
diabetes_prescriber_data.head()
Prscrbr_State_Abrvtn Prscrbr_Type Brnd_Name Tot_Clms Tot_30day_Fills Tot_Day_Suply Tot_Drug_Cst Total_Patients Gnrc_Name Prscrbr_Type_Grouped

KY

Family Practice

Ozempic

18

19.7

568

18681.38

NaN

Semaglutide

Primary Care

CA

Endocrinology

Ozempic

113

230.7

6828

236624.18

25.0

Semaglutide

Endocrinology

MA

Rheumatology

Ozempic

11

11.9

342

10413.53

NaN

Semaglutide

GI/Renal/Rheum

TX

Internal Medicine

Ozempic

107

130.3

3765

128932.39

20.0

Semaglutide

Primary Care

OH

Certified Clinical Nurse Specialist

Ozempic

21

29.0

840

29104.96

NaN

Semaglutide

Advanced Practice

We’ll now define our target variable: Semaglutide_drug, which equals 1 when the generic name is Semaglutide and 0 otherwise.

Semaglutide_drug count

0

40065

1

8421

diabetes_prescriber_data["Semaglutide_drug"] = (diabetes_prescriber_data["Gnrc_Name"] == "Semaglutide").astype(int)
print(diabetes_prescriber_data["Semaglutide_drug"].value_counts())

Next, we’ll engineer new columns that reveal more about prescriber behavior. One useful metric is cost per claim:

diabetes_prescriber_data['Cost_per_claim'] = (
    diabetes_prescriber_data['Tot_Drug_Cst'] / diabetes_prescriber_data['Tot_Clms']
)
diabetes_prescriber_data[['Tot_Drug_Cst','Tot_Clms','Cost_per_claim']].head()
Tot_Drug_Cst Tot_Clms Cost_per_claim

18681.38

18

1037.854444

236624.18

113

2094.019292

10413.53

11

946.684545

128932.39

107

1204.975607

29104.96

21

1385.950476

We can also group prescribers by U.S. region for easier interpretation.

state_region_map = {
    "CT": "Northeast","ME": "Northeast","MA": "Northeast","NH": "Northeast",
    "NJ": "Northeast","NY": "Northeast","PA": "Northeast","RI": "Northeast","VT": "Northeast",
    "IL": "Midwest","IN": "Midwest","IA": "Midwest","KS": "Midwest","MI": "Midwest","MN": "Midwest",
    "MO": "Midwest","NE": "Midwest","ND": "Midwest","OH": "Midwest","SD": "Midwest","WI": "Midwest",
    "AL": "South","AR": "South","DE": "South","DC": "South","FL": "South","GA": "South","KY": "South",
    "LA": "South","MD": "South","MS": "South","NC": "South","OK": "South","SC": "South","TN": "South",
    "TX": "South","VA": "South","WV": "South",
    "AK": "West","AZ": "West","CA": "West","CO": "West","HI": "West","ID": "West","MT": "West",
    "NV": "West","NM": "West","OR": "West","UT": "West","WA": "West","WY": "West",
    "PR": "Territory","VI": "Territory","GU": "Territory","MP": "Territory","AS": "Territory",
    "AA": "Military","AE": "Military","AP": "Military","ZZ": "Unknown"
}
diabetes_prescriber_data["Prscrbr_State_Region"] = (
    diabetes_prescriber_data["Prscrbr_State_Abrvtn"].map(state_region_map)
)
diabetes_prescriber_data["Prscrbr_State_Region"].value_counts(dropna=False)
Prscrbr_State_Region count

South

17785

Midwest

10571

West

10170

Northeast

9559

Territory

391

Military

7

Unknown

3

At this stage, we’ve created a structured dataset suitable for exploratory data analysis.

2. Exploratory Data Analysis

Before we build any models, we need to take a step back and get to know our data. This is the purpose of exploratory data analysis, (EDA). You look for patterns, inconsistencies, and clues that might be helpful to understand how different variables relate to the outcome you’re trying to predict. In our case, that outcome is Semaglutide_drug, a column that tells us whether or not a prescription record involves the diabetes drug Semaglutide.

Each of the questions below helps uncover different aspects of the data structure and variability that may influence our model’s performance or interpretability.

Missing Values

We will start by checking for missing values in the numeric columns. This might seem minor, but it is actually really important. If a feature is missing data for just a few rows, we might be able to fill it in or move on. But if half the values are missing, it could be a sign that the variable is not reliable for modeling.

To calculate the percentage of missing values, we can use the .isna() method to find missing entries, then apply .mean() to calculate the proportion of missing values in each column. Multiplying this by 100 gives us the percentage. You can also use .sum() if you want the raw count.

Group Statistics: Mean and Standard Deviation

Next, we will split the data into two groups: one where Semaglutide was prescribed and one where it was not. Then we will calculate the mean (average) and standard deviation (a measure of spread) for key numeric features in each group. This gives us insight into whether, for example, Semaglutide is associated with higher costs or larger patient counts. These statistics can help us start forming hypotheses, like whether prescribers who use Semaglutide tend to prescribe more 30-day fills or serve more patients.

To do this, we can use .groupby("Semaglutide_drug") followed by .agg(["mean", "std"]) on the numeric columns. This will create a side-by-side comparison table showing the average and variability for each group.

Correlation Between Numeric Variables

We will also explore how the numeric features relate to one another using correlation. Correlation is a measure that ranges from -1 to 1 and tells us how strongly two variables move together. A value close to 1 means they move together in the same direction, while -1 indicates that one tends to decrease as the other increases. A value near 0 suggests no clear relationship.

Heatmap Visualization

To visualize these relationships, we will use a heatmap. A heatmap is a color-coded grid where darker or brighter colors represent stronger relationships. This allows us to quickly see which variables are closely linked and potentially redundant.

To create the heatmap, we first use .corr() on the numeric columns to compute all pairwise correlations. Then we pass that matrix into sns.heatmap(), a Seaborn function that creates the visualization. By setting annot=True, we can print the correlation values directly on the plot, which makes it easier to interpret.

Geographic Patterns

Lastly, we will examine regional prescribing patterns. We want to know: do certain prescribers in certain regions prescribe Semaglutide more often? A good way to check this is with a count plot, which shows how many records come from each region — and whether Semaglutide was prescribed.

Using sns.countplot(), we can plot the number of prescribers in each Prscrbr_State_Region and split the bars by Semaglutide_drug using the hue parameter. This lets us compare across regions in one chart and spot any geographic trends in prescription behavior.

Learn more about count plots and how to use hue to split categories: Seaborn countplot() Documentation

To explore relationships between numeric variables (like cost and total fills), we can use .corr() to compute pairwise correlations and sns.heatmap() to visualize them as a color-coded matrix.

See how to create heatmaps from correlation matrices: Seaborn heatmap() Documentation

Before modeling, we’ll explore patterns, missing values, and relationships between variables.

Check missing numeric data:

numeric_cols = ['Cost_per_claim','Tot_30day_Fills','Tot_Day_Suply','Total_Patients']
print(diabetes_prescriber_data[numeric_cols].isna().sum())

Next, compare group averages to spot differences between prescribers who do and don’t prescribe Semaglutide.

summary_stats = diabetes_prescriber_data.groupby("Semaglutide_drug")[numeric_cols].mean()
summary_stats
Semaglutide_drug Cost_per_claim Tot_30day_Fills Tot_Day_Suply

Total_Patients

0

151.209056

105.228559

2694.060552

34.224819

1

1293.350409

Let’s visualize correlation between numeric variables using a heatmap.

import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = diabetes_prescriber_data[numeric_cols].corr()
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Matrix of Numeric Features")
plt.tight_layout()
plt.show()
cormatrixpres

Finally, explore geographic patterns:

plt.figure(figsize=(10,5))
sns.countplot(data=diabetes_prescriber_data, x='Prscrbr_State_Region', hue='Semaglutide_drug')
plt.title('Semaglutide Prescriptions by Region')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
barplotpres

3. Splitting Data for Modeling

In predictive modeling, one of the first steps is to distinguish between predictors (also known as features or independent variables) and response (or target). The predictors are the pieces of information the model will use to make its decisions, while the response is the variable we wish to predict. In this context, we are interested in predicting whether a prescriber issued a prescription for Semaglutide which is a binary outcome that will form the basis of our classification model.

Splitting the Data

Models are not trained on entire datasets. Instead, we partition the data into multiple subsets to serve distinct roles in the model development process. The most common partitioning scheme involves three subsets:

  • Training data is what the model actually learns from. It’s used to find patterns and relationships between the features and the target.

  • Validation data helps us make decisions about the model such as choosing which features to keep or which settings (hyperparameters) work best. We use it to check how well it’s doing while we’re still building it.

  • Test data is completely held out until the very end. It gives us a final check to see how well the model is likely to perform on brand-new data it has never seen before.

Understanding the Subsets

In supervised learning, our dataset is split into predictors (X) and a target variable (y). We further divide these into training, validation, and test subsets to properly evaluate model performance and prevent overfitting.

Here is what each of these variables means:

Subset X (Predictors) y (Target Labels)

Training

X_train Contains only features such as total patients, cost per claim, prescriber type. Does not include the Semaglutide outcome.

y_train Binary values (0 or 1) indicating whether the prescription was for Semaglutide.

Validation

X_val Same features as training data. Used to evaluate the model during development.

y_val Binary outcome labels (0 or 1) used to evaluate model performance on the validation set.

Test

X_test Held-out feature data. Never seen by the model during training or tuning.

y_test Final set of target labels (0 or 1) used for unbiased model evaluation on the test set.

These splits are crucial to simulate how the model will perform in real-world settings and ensure that we’re not simply memorizing the data.

In practice, it’s recommended to use cross-validation, which provides a more reliable estimate of a model’s performance by repeatedly splitting the data into training and validation sets and averaging the results. This helps reduce the variability that can come from a single random split. However, for this project, we will only perform a single random train/validation/test split using a fixed random seed.

Stratified Sampling

One subtle but essential consideration is that we must maintain the distribution of the response variable, particularly in classification settings with imbalanced classes. To achieve this, we use stratified sampling, which ensures that the proportion of cases (e.g., Semaglutide = 1 vs. 0) remains consistent across the training, validation, and test sets. This avoids the model performing poorly simply because the subsets are not represented in the data.

Finally, it is good to inspect each of the resulting subsets. How many observations are in each split? Is the class balance preserved? These simple diagnostics are foundational checks that ensure the integrity of downstream modeling efforts which you will perform in the questions below.

To evaluate the model fairly, we’ll split data into train, validation, and test sets.

from sklearn.model_selection import train_test_split

model_features = ["Tot_30day_Fills","Tot_Day_Suply","Cost_per_claim","Total_Patients",
                  "Prscrbr_State_Region","Prscrbr_Type_Grouped"]

X = diabetes_prescriber_data[model_features]
y = diabetes_prescriber_data["Semaglutide_drug"]

X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.25, stratify=y_train_val, random_state=42)

Check that class proportions remain balanced across subsets:

for name, arr in [("Train", y_train), ("Validation", y_val), ("Test", y_test)]:
    print(f"\n{name} rows:", len(arr))
    print(arr.value_counts(normalize=True))
Dataset Rows Class Proportion (Semaglutide_drug)

Train

29,091

0: 0.826304

1: 0.173696

Validation

9,697

0: 0.826338

1: 0.173662

Test

9,698

0: 0.826356

1: 0.173644

4. Data Preprocessing

Before we can fit our logistic regression model, we need to make sure our dataset is clean and formatted correctly. This stage, called preprocessing, ensures that our features are in a numerical format, have no missing values, are properly scaled, and are aligned across all datasets. Logistic regression, like many models, assumes that the data has been prepared in a certain way. If we skip these steps or do them incorrectly, our model may perform poorly or fail to train altogether.

These question will walk you through five key preprocessing steps, some of which have partially completed code to help guide you.

Handling Missing Values in Categorical Variables

Missing values can cause errors during modeling and interfere with scaling or encoding. For categorical columns like Prscrbr_State_Region and Prscrbr_Type_Grouped, we’ll fill in missing values with the string "Missing". This way, even rows with unknown data are retained and can be captured as their own category during encoding.

For numeric columns like Tot_30day_Fills, Tot_Day_Suply, Cost_per_claim, and Total_Patients, we’ll fill missing values using the median from the training set. This is preferred over the mean because the median is less sensitive to outliers. You’ll use .fillna() to perform this replacement.

For one-hot encoded (binary) columns, we’ll fill missing values with 0. These columns represent the presence or absence of a category, so 0 safely indicates that the feature was not activated for that row.

One-Hot Encoding Categorical Variables

Machine learning models can’t interpret text categories directly. We convert them into numeric form using one-hot encoding, which creates a separate binary column for each unique category. You may hear them as dummy variables, too. For example, the column Prscrbr_State_Region might be transformed into:

  • Prscrbr_State_Region_Midwest

  • Prscrbr_State_Region_South

  • Prscrbr_State_Region_Northeast

  • etc.

We use pd.get_dummies() to apply one-hot encoding. The option drop_first=True tells pandas to omit the first category — this prevents duplicate, which is especially important in models like logistic regression.

Why We Encode Train, Validation, and Test Separately

We always apply one-hot encoding to X_train first. That’s because we want the model to learn from the structure of the training data, including which categories exist. We then apply the same process to X_val and X_test — but here’s the tricky part:

  • These datasets may contain a different set of categories (some categories might be missing, or new ones might appear).

  • If we encoded all three together, we would risk leaking information from validation or test sets into training — which we want to avoid to ensure fair model evaluation.

To resolve this, we:

  1. Encode each dataset separately using pd.get_dummies().

  2. Then use .reindex(columns=encoded_columns, fill_value=0) on X_val and X_test to ensure their columns match the training set exactly — any missing columns will be added with `0`s.

This guarantees that the model sees inputs with the same structure at all stages (training, validation, testing), even if the underlying data varies.

Standardizing Numeric Features

Features that are on very different numeric scales can cause issues for models like logistic regression. For example, Tot_Day_Suply might be in the hundreds while Cost_per_claim could be in the thousands. If we don’t scale them, the model might assign disproportionate importance to the larger features.

To address this, we use StandardScaler() from sklearn.preprocessing. This function subtracts the mean and divides by the standard deviation, resulting in a column with mean 0 and standard deviation 1. We fit the scaler on X_train[numeric_cols], and apply the transformation to X_train, X_val, and X_test.

Converting Boolean Columns

Some features may be stored as True/False. Most models, including logistic regression, expect numeric input. We use .astype(int) to convert all Boolean columns into 1/0 format, which the model can interpret as binary indicators.

Final Structure Check

After all these steps, it’s important to verify that X_train, X_val, and X_test all have the same number and order of columns. This ensures the model receives a consistent input structure during training and evaluation.

Now we’ll clean and standardize features for modeling.

  1. Fill missing values for categorical and numeric columns.

  2. Apply one-hot encoding to categorical columns.

  3. Standardize numeric variables.

  4. Ensure train/validation/test sets share identical column structures.

categorical_cols = ['Prscrbr_State_Region','Prscrbr_Type_Grouped']

for df in [X_train, X_val, X_test]:
    for col in categorical_cols:
        df[col] = df[col].fillna("Missing")

X_train = pd.get_dummies(data=X_train, columns=categorical_cols, drop_first=True)
X_test = pd.get_dummies(data=X_test, columns=categorical_cols, drop_first=True)
X_val  = pd.get_dummies(data=X_val,  columns=categorical_cols, drop_first=True)

encoded_columns = X_train.columns
X_test = X_test.reindex(columns=encoded_columns, fill_value=0)
X_val  = X_val.reindex(columns=encoded_columns, fill_value=0)

Standardize numeric variables:

from sklearn.preprocessing import StandardScaler

numeric_cols = ['Tot_30day_Fills','Tot_Day_Suply','Cost_per_claim','Total_Patients']
scaler = StandardScaler()

X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_val[numeric_cols]   = scaler.transform(X_val[numeric_cols])
X_test[numeric_cols]  = scaler.transform(X_test[numeric_cols])

Finally, confirm column consistency:

print(X_train.shape, X_val.shape, X_test.shape)

X_train shape: (29091, 26)

X_val shape: (9697, 26)

X_test shape: (9698, 26)

5. Model Building and Evaluation

Logistic Regression and the Sigmoid Function

In binary classification problems, our goal is to predict the probability of a binary outcome: such as success/failure or 1/0. Unlike linear regression, which can produce any real number, logistic regression bounds the output between 0 and 1 by applying the sigmoid function. This lets us model probabilities directly using the equation:

$p = 1 / (1 + e^{(-(\beta_0 + \beta_1 \times X)})$

where

  • $p$ is the predicted probability of success (e.g., winning)

  • $\beta_0$ is the intercept

  • $\beta_1$ is the coefficient for the input variable $X$

  • $e$ is Euler’s number (approximately 2.718)

The result is an S-shaped curve that flattens near 0 and 1, making it ideal for modeling probabilities.

sigmoid

Why can’t this equation give probabilities outside of 0 to 1?

No matter what value $X$ takes, the exponentiated term is always positive.

  • As X increases, the exponent e^-{(β₀ + β₁·X)} gets smaller, pushing p closer to 1.

  • As X decreases, the exponent grows, pushing p closer to 0.

So the sigmoid function always produces values strictly between 0 and 1.

Log Odds (Logit) Transformation

Modeling probability with a linear equation (like in linear regression) does not work because probabilities must stay between 0 and 1. To make logistic regression behave like linear regression, we apply a transformation to the probability using log-odds, or the logit function:

  • $\log\left(\dfrac{p}{1 - p}\right) = \beta_0 + \beta_1 X$ where

  • $\dfrac{p}{1 - p}$ is called the odds — the probability of success divided by the probability of failure.

  • $\log\left(\dfrac{p}{1 - p}\right)$ is the log-odds, which maps probabilities (between 0 and 1) to the entire real number line.

If odds = 4, that means the event is 4 times more likely to happen than not. In other words, the probability of success is 4× greater than the probability of failure.

Three Equivalent Forms of the Logistic Model

Form Expression

Log-odds (logit)

$\log\left(\dfrac{p}{1 - p}\right) = \beta_0 + \beta_1 X$

Odds

$\dfrac{p}{1 - p} = e^{\beta_0 + \beta_1 X}$

Probability (sigmoid)

$p = \dfrac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}$

Each form is mathematically equivalent, and which one you use depends on the context:

  • Use log-odds when modeling or interpreting coefficients.

  • Use odds when communicating risk ratios.

  • Use probability when making predictions.

Key Features of the Logistic Curve

  1. It always produces outputs between 0 and 1, making it ideal for probability modeling.

  2. The log-odds transformation allows us to model the predictors in a linear way, just like in linear regression.

How to Interpret Coefficients

In a logistic regression model, each coefficient (β) represents the change in the log-odds of the outcome for a one-unit increase in the predictor, holding all else constant.

Interpretation Type What It Means

Raw Coefficient (β)

A one-unit increase in X increases the log-odds of the outcome by β.

Exponentiated Coefficient (e^β)

A one-unit increase in X multiplies the odds of the outcome by e^β. This is called the odds ratio.

Odds Ratio > 1

The predictor increases the likelihood of the outcome.

Odds Ratio < 1

The predictor decreases the likelihood of the outcome.

Odds Ratio = 1

The predictor has no effect on the odds of the outcome.

To interpret a coefficient as an odds ratio, you must exponentiate it: Odds Ratio = e^β

This is especially helpful when explaining or interpreting the results in plain language! For example, if β = 0.75, then e^β ≈ 2.12, meaning a one-unit increase in that predictor makes the outcome about 2.1× more likely — or increases the odds by 112%.

Checking Multicollinearity with VIF

Before fitting our model, we use Variance Inflation Factor (VIF) to check for multicollinearity:

VIF(Xᵢ) = 1 / (1 – R²ᵢ)

where ${R_i}^2$ is the $R^2$ from a regression of $X_i$ onto all of other predictors. You can easily see that having ${R_i}^2$ close to one refer to collinearity and so the VIF will be large.

A VIF above 10 suggests the variable is highly collinear and may need to be removed.

Feature Selection with AIC and Forward Selection

To reduce the number of features, we use forward selection guided by Akaike Information Criterion (AIC):

AIC = 2·k – 2·log(L),

where

  • k is the number of parameters in the model

  • L is the likelihood of the model

The model with the lowest AIC fits the data by striking a balance between fit and the number of parameters (features) used. If we pick the model with the smallest AIC, we are choosing the model with a low k (fewer features) while still ensuring it has a high likelihood log(L).

Forward selection begins with no predictors and adds them one at a time, at each step choosing the variable that leads to the greatest reduction in AIC.

AIC is one of several possible criteria for feature selection. While we are using AIC in this project, you could also use:

  • : Choose features that increase the model’s explained variance.

  • p-values: Add features that are statistically significant.

  • BIC (Bayesian Information Criterion): Similar to AIC but with a stronger penalty for complexity.

Each criterion has trade-offs. AIC is popular because it balances model fit and complexity, making it a solid choice when comparing logistic regression models. For consistency, we’ll use AIC throughout this project.

Interpreting Model Coefficients with Odds Ratios

Once the model is fit, we convert coefficients into odds ratios to interpret them:

Odds Ratio = exp(β)

Odds Ratio Value Interpretation

Greater than 1

Increases odds of prescribing Semaglutide

Less than 1

Decreases odds of prescribing Semaglutide

Equal to 1

No effect on the odds

Evaluating Model Performance

Confusion Matrix

A confusion matrix compares the model’s predicted classes with the actual outcomes. It is used to calculate accuracy, precision, recall, and more.

Predicted: No (0) Predicted: Yes (1)

Actual: No (0)

True Negative (TN)

False Positive (FP)

Actual: Yes (1)

False Negative (FN)

True Positive (TP)

  • True Positives (TP): correctly predicted Semaglutide prescribers

  • False Positives (FP): predicted prescriber, but was not

  • True Negatives (TN): correctly predicted non-prescriber

  • False Negatives (FN): predicted non-prescriber, but was

From the confusion matrix, we can compute key metrics:

Metric Formula

Accuracy

(TP + TN) / Total

Precision

TP / (TP + FP)

Recall (Sensitivity)

TP / (TP + FN)

Specificity

TN / (TN + FP)

ROC Curve and AUC

A Receiver Operating Characteristic (ROC) curve plots the tradeoff between:

  • True Positive Rate (Recall)

  • False Positive Rate (1 – Specificity)

roccurve

It shows how the model performs across all classification thresholds.

The Area Under the Curve (AUC) is a summary metric:

AUC Score Interpretation

0.5

No better than random guessing

0.7–0.8

Fair performance

0.8–0.9

Strong performance

1.0

Perfect classification

AUC is threshold-independent — it evaluates how well the model ranks positive cases above negative ones, regardless of where we place the 0.5 decision boundary.

You should compute and compare AUC scores for:

  • Training set

  • Validation set

  • Test set

This helps check for overfitting, which occurs when a model learns the noise or specific quirks of the training data rather than the underlying patterns. An overfitted model may perform very well on the training set but poorly on new, unseen data (test and validation dataset!). By evaluating performance on validation and test sets, we can ensure the model generalizes well to other data.

Ready to Model

Now that you’ve reviewed the key concepts, proceed with training your logistic regression model and interpreting the results using this knowledge!

We’ll now fit a logistic regression model, interpret coefficients, and assess performance.

Checking Multicollinearity

import numpy as np
from sklearn.linear_model import LinearRegression

def calculate_vif(X):
    vif_dict = {}
    for feature in X.columns:
        y = X[feature]
        X_pred = X.drop(columns=feature)
        r2 = LinearRegression().fit(X_pred, y).score(X_pred, y)
        vif_dict[feature] = np.inf if r2 == 1 else 1/(1 - r2)
    return pd.Series(vif_dict, name="VIF").sort_values(ascending=False)

vif_values = calculate_vif(X_train.select_dtypes(include=[np.number]))
vif_values
Feature VIF

Tot_30day_Fills

948.471582

Total_Patients

832.139064

Tot_Day_Suply

51.847664

Prscrbr_State_Region_South

1.704191

Prscrbr_State_Region_West

1.553571

Prscrbr_State_Region_Northeast

1.532947

Prscrbr_Type_Grouped_Primary Care

1.452644

Prscrbr_Type_Grouped_Dental

1.223323

Prscrbr_Type_Grouped_Missing

1.213235

Prscrbr_Type_Grouped_Dermatology/Ophthalmology

1.138811

Prscrbr_Type_Grouped_Surgery

1.111818

Prscrbr_Type_Grouped_Neuro/Psych

1.094020

Prscrbr_Type_Grouped_Cardiology

1.091790

Prscrbr_Type_Grouped_GI/Renal/Rheum

1.074066

Prscrbr_Type_Grouped_Other

1.071768

Cost_per_claim

1.058779

Prscrbr_Type_Grouped_Endocrinology

1.058377

Prscrbr_Type_Grouped_Women’s Health

1.050068

Prscrbr_Type_Grouped_Oncology/Hematology

1.043538

Prscrbr_State_Region_Territory

1.034220

Prscrbr_Type_Grouped_Pulmonary/Critical Care

1.021193

Prscrbr_Type_Grouped_Rehabilitation

1.019222

Prscrbr_Type_Grouped_Anesthesia/Pain

1.014930

Prscrbr_Type_Grouped_Palliative Care

1.002206

Prscrbr_State_Region_Military

1.001128

Prscrbr_State_Region_Unknown

1.000924

Remove highly collinear variables (VIF > 10), except Tot_Day_Suply which we’ll retain for interpretability.

features_after_vif = [col for col in X_train.columns if col in (
'Tot_Day_Suply','Cost_per_claim','Prscrbr_State_Region_South',
'Prscrbr_Type_Grouped_Primary Care','Prscrbr_Type_Grouped_Endocrinology')]
X_train = X_train[features_after_vif]

Forward Selection with AIC

import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")

def forward_selection(X, y, aic_threshold=20):
    included = []
    current_score = np.inf
    while True:
        changed=False
        excluded = list(set(X.columns) - set(included))
        scores = []
        for new_col in excluded:
            try:
                model = sm.Logit(y, sm.add_constant(X[included + [new_col]])).fit(disp=0)
                scores.append((model.aic, new_col))
            except:
                continue
        if not scores: break
        scores.sort()
        best_new_score, best_candidate = scores[0]
        if current_score - best_new_score >= aic_threshold:
            included.append(best_candidate)
            current_score = best_new_score
            changed=True
        if not changed: break
    return included

selected_features = forward_selection(X_train, y_train)
selected_features

Selected features: ['Cost_per_claim', 'Prscrbr_Type_Grouped_Primary Care', 'Prscrbr_Type_Grouped_GI/Renal/Rheum', 'Tot_Day_Suply', 'Prscrbr_Type_Grouped_Endocrinology', 'Prscrbr_Type_Grouped_Neuro/Psych', 'Prscrbr_Type_Grouped_Missing', 'Prscrbr_Type_Grouped_Surgery', 'Prscrbr_Type_Grouped_Other', "Prscrbr_Type_Grouped_Women’s Health", 'Prscrbr_State_Region_South']

Fitting the Final Model

X_train_final = sm.add_constant(X_train[selected_features])
final_model = sm.Logit(y_train, X_train_final).fit()
print(final_model.summary())
print(f"AIC: {final_model.aic}")
Logit Regression Results
Term coef std err z P> z

[0.025, 0.975]

const

-2.3807

0.038

-63.081

0.000

[-2.455, -2.307]

Cost_per_claim

2.4191

0.036

66.890

0.000

[2.348, 2.490]

Prscrbr_Type_Grouped_Primary Care

1.2536

0.045

28.164

0.000

[1.166, 1.341]

Prscrbr_Type_Grouped_GI/Renal/Rheum

-4.8014

0.357

-13.449

0.000

[-5.501, -4.102]

Tot_Day_Suply

-0.7831

0.044

-17.731

0.000

[-0.870, -0.697]

Prscrbr_Type_Grouped_Endocrinology

1.8776

0.148

12.667

0.000

[1.587, 2.168]

Prscrbr_Type_Grouped_Neuro/Psych

-6.7406

0.959

-7.027

0.000

[-8.621, -4.861]

Prscrbr_Type_Grouped_Missing

-1.4901

0.136

-10.951

0.000

[-1.757, -1.223]

Prscrbr_Type_Grouped_Surgery

-2.5203

0.343

-7.348

0.000

[-3.193, -1.848]

Prscrbr_Type_Grouped_Other

-1.6574

0.268

-6.196

0.000

[-2.182, -1.133]

Prscrbr_Type_Grouped_Women’s Health

-2.0414

0.451

-4.524

0.000

[-2.926, -1.157]

Prscrbr_State_Region_South

0.2295

0.043

5.301

Optimization terminated successfully. Model Information: - Dependent Variable: Semaglutide_drug - Observations: 29,091 - Pseudo R²: 0.4312 - Log-Likelihood: -7640.2 - LL-Null: -13431 - AIC: 15304.48 - Converged: True - Covariance Type: nonrobust - LLR p-value: 0.000

Interpreting Coefficients with Odds Ratios

odds_ratios = pd.DataFrame({
    "Odds Ratio": np.exp(final_model.params),
    "P-value": final_model.pvalues
})
odds_ratios["Direction"] = odds_ratios["Odds Ratio"].apply(
    lambda x: "Increases Odds" if x > 1 else ("Decreases Odds" if x < 1 else "No Effect"))
odds_ratios.round(3)
Feature Odds Ratio P-value Direction

Cost_per_claim

11.236

0.0

Increases Odds

Prscrbr_Type_Grouped_Endocrinology

6.538

0.0

Increases Odds

Prscrbr_Type_Grouped_Primary Care

3.503

0.0

Increases Odds

Prscrbr_State_Region_South

1.258

0.0

Increases Odds

Tot_Day_Suply

0.457

0.0

Decreases Odds

Prscrbr_Type_Grouped_Missing

0.225

0.0

Decreases Odds

Prscrbr_Type_Grouped_Other

0.191

0.0

Decreases Odds

Prscrbr_Type_Grouped_Women’s Health

0.130

0.0

Decreases Odds

const

0.092

0.0

Decreases Odds

Prscrbr_Type_Grouped_Surgery

0.080

0.0

Decreases Odds

Prscrbr_Type_Grouped_GI/Renal/Rheum

0.008

0.0

Decreases Odds

Prscrbr_Type_Grouped_Neuro/Psych

0.001

0.0

Decreases Odds

Confusion Matrices

from sklearn.metrics import confusion_matrix

def display_cm(y_true, y_pred, label):
    cm = confusion_matrix(y_true, y_pred)
    df = pd.DataFrame(cm, index=["Actual 0","Actual 1"], columns=["Pred 0","Pred 1"])
    print(f"\n{label} Confusion Matrix:\n", df)

X_val_final = sm.add_constant(X_val[selected_features])
X_test_final = sm.add_constant(X_test[selected_features])

train_pred = (final_model.predict(X_train_final) >= 0.5).astype(int)
val_pred   = (final_model.predict(X_val_final)   >= 0.5).astype(int)
test_pred  = (final_model.predict(X_test_final)  >= 0.5).astype(int)

display_cm(y_train, train_pred, "Train")
display_cm(y_val, val_pred, "Validation")
display_cm(y_test, test_pred, "Test")
Train Confusion Matrix Predicted 0 Predicted 1

Actual 0

23608

430

Actual 1

1260

3793

Validation Confusion Matrix Predicted 0 Predicted 1

Actual 0

7875

138

Actual 1

421

1263

Test Confusion Matrix Predicted 0 Predicted 1

Actual 0

7865

149

Actual 1

424

1260

ROC Curves and AUC

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

def plot_roc(y_true, y_proba, label):
    fpr, tpr, _ = roc_curve(y_true, y_proba)
    auc = roc_auc_score(y_true, y_proba)
    plt.plot(fpr, tpr, label=f"{label} (AUC={auc:.2f})")

plt.figure(figsize=(8,6))
plot_roc(y_train, final_model.predict(X_train_final), "Train")
plot_roc(y_val, final_model.predict(X_val_final), "Validation")
plot_roc(y_test, final_model.predict(X_test_final), "Test")
plt.plot([0,1],[0,1],'k--')
plt.title("ROC Curves - Train, Validation, and Test")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.grid(True)
plt.show()

print("Train AUC:", roc_auc_score(y_train, final_model.predict(X_train_final)))
print("Validation AUC:", roc_auc_score(y_val, final_model.predict(X_val_final)))
print("Test AUC:", roc_auc_score(y_test, final_model.predict(X_test_final)))
roccurvepres

6. Making Predictions on New Prescribers

Now that you’ve trained your final model, let’s use it to predict how likely new prescribers are to prescribe Semaglutide.

The Sigmoid Function and Likelihood in Logistic Regression (Semaglutide Example)

Let’s say we have a prescriber with the following characteristics:

  • 22 total 30-day fills

  • $450 per claim

  • Practices in the South

  • Internal Medicine specialist

Using the sigmoid function, the model calculates p = 0.83, meaning there’s an 83% chance this provider prescribes Semaglutide.

So:

  • If this provider did prescribe Semaglutide, their likelihood = 0.83

  • If this provider did not prescribe Semaglutide, their likelihood = 1 – 0.83 = 0.17

Maximum Likelihood Estimation (MLE)

To train the model, we use maximum likelihood estimation.

This means we:

  • Calculate the likelihood for each prescriber in the data (based on whether or not they prescribed Semaglutide).

  • Multiply all the individual likelihoods together to get a total likelihood.

  • Adjust the coefficients (β₀, β₁, etc.) to maximize that total likelihood.

Example:

If the model predicts:

  • Prescriber A: p = 0.75, and they did prescribe → likelihood = 0.75

  • Prescriber B: p = 0.20, and they did not prescribe → likelihood = 0.80

  • Prescriber C: p = 0.55, and they did prescribe → likelihood = 0.55

Then the total likelihood is:

Likelihood = 0.75 × 0.80 × 0.55

We want to find the coefficients that maximize this product across all rows in the dataset.

Logistic regression finds the coefficients that maximize the likelihood of the observed outcomes.

That’s how we “fit” the model and it’s also how we estimate the best values for the βs in the sigmoid equation.

At this point, our model is trained and evaluated. Let’s see how it performs on new, unseen prescribers.

We’ll sample 20 new records from the test set and compute their predicted probability of prescribing Semaglutide.

sample_prescribers = X_test.sample(n=20, random_state=42)
sample_final = sm.add_constant(sample_prescribers[selected_features])
sample_final = sample_final[final_model.params.index]
sample_preds = final_model.predict(sample_final)

scored_sample = sample_prescribers.copy()
scored_sample["Predicted_Probability"] = sample_preds

scored_sample.sort_values("Predicted_Probability", ascending=False).head(5)
Index Tot_30day_Fills Tot_Day_Suply Cost_per_claim Total_Patients Prscrbr_State_Region_Military Prscrbr_State_Region_Northeast Prscrbr_State_Region_South Prscrbr_State_Region_West Prscrbr_Type_Grouped_Primary Care Predicted_Probability

26406

-0.060228

-0.181987

1.466532

-0.018389

0

0

0

1

1

0.928453

1897

-0.050658

-0.099769

1.611421

-0.018389

0

0

1

0

0

0.861192

1442

0.009717

0.097126

0.795578

-0.039022

0

0

1

0

1

0.721310

2520

-0.146181

-0.452130

0.708982

-0.018389

0

0

1

0

0

0.479521

37937

-0.074843

-0.199071

-0.315274

-0.044180

0

0

0

0

0

…​

You can now interpret the highest probabilities. Typically, top prescribers share similar characteristics — such as being in Southern or Western regions, having higher cost per claim, or belonging to Primary Care specialties.

This consistent pattern suggests that both geographic and specialty factors influence Semaglutide prescribing behavior.

Conclusion

You’ve completed a full logistic regression pipeline:

  • Data cleaning and feature engineering

  • Exploratory analysis and visualization

  • Data splitting and preprocessing

  • Multicollinearity diagnostics and feature selection

  • Model fitting, evaluation, and interpretation

  • Predictive scoring on unseen prescribers

This workflow represents a robust and interpretable approach to understanding binary outcomes — a method widely used in both healthcare and business analytics.

By mastering this process, you can now apply logistic regression confidently to new datasets where understanding the probability of an event is key.