Diabetes — логистическая регрессия и coef_
Бинарная классификация: есть диабет (Outcome = 1) или нет (0). Задача из учебных курсов ML — после Insurance и перед Titanic.
Зависимости: pip install pandas scikit-learn tensorflow
1. Загрузка
import pandas as pd
cols = [
"Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
"Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome",
]
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
df = pd.read_csv(url, names=cols)
print(df["Outcome"].value_counts())
2. Логистическая регрессия
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
X = df.drop(columns="Outcome")
y = df["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
clf = pipe.named_steps["clf"]
for name, coef in zip(X.columns, clf.coef_[0]):
print(f"{name:28s} {coef:+.4f}")
Самопроверка: какой признак по модулю coef_ сильнее всего связан с диабетом? Часто это Glucose и BMI.
3. Нейросеть (Keras) — сравнение
import os
os.environ.setdefault("TF_CPP_MIN_LOG_LEVEL", "2")
import tensorflow as tf
import numpy as np
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_train).astype("float32")
X_te = scaler.transform(X_test).astype("float32")
nn = tf.keras.Sequential([
tf.keras.layers.Input(shape=(X_tr.shape[1],)),
tf.keras.layers.Dense(16, activation="relu"),
tf.keras.layers.Dense(8, activation="relu"),
tf.keras.layers.Dense(1, activation="sigmoid"),
])
nn.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
nn.fit(X_tr, y_train, epochs=50, batch_size=32, verbose=0)
y_nn = (nn.predict(X_te, verbose=0).ravel() >= 0.5).astype(int)
print("Logistic accuracy:", accuracy_score(y_test, y_pred))
print("Keras NN accuracy:", accuracy_score(y_test, y_nn))
На малых табличных данных NN не всегда лучше логистической регрессии — это нормальный результат.
4. Дерево решений (опционально)
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)
print("Tree accuracy:", accuracy_score(y_test, tree.predict(X_test)))
Визуализация — plot_tree.