Complete API documentation for the FactCheck fake news detection system.
Location: src/preprocessing.py
Text preprocessing pipeline for news articles.
from src.preprocessing import TextPreprocessor
preprocessor = TextPreprocessor(
remove_stopwords=True,
lemmatize=True,
min_word_length=2
)| Parameter | Type | Default | Description |
|---|---|---|---|
remove_stopwords |
bool | True | Remove English stopwords |
lemmatize |
bool | True | Apply lemmatization |
min_word_length |
int | 2 | Minimum word length to keep |
Apply text cleaning operations.
clean = preprocessor.clean_text("Check out http://example.com!")
# Returns: "check out"Apply the full preprocessing pipeline.
processed = preprocessor.preprocess("The quick brown foxes were running.")
# Returns: "quick brown fox running"Preprocess an entire DataFrame column.
df = preprocessor.preprocess_dataframe(df, 'content', 'clean_content')TF-IDF feature extraction.
from src.preprocessing import FeatureExtractor
extractor = FeatureExtractor(
method='tfidf',
max_features=10000,
ngram_range=(1, 2)
)| Parameter | Type | Default | Description |
|---|---|---|---|
method |
str | 'tfidf' | 'tfidf' or 'count' |
max_features |
int | 10000 | Maximum vocabulary size |
ngram_range |
tuple | (1, 2) | N-gram range |
min_df |
int | 3 | Minimum document frequency |
max_df |
float | 0.95 | Maximum document frequency |
Fit and transform texts to feature matrix.
Transform texts using fitted vectorizer.
Get feature names from vectorizer.
Load and prepare the fake news dataset.
df = load_and_prepare_data('dataset/Fake.csv', 'dataset/True.csv')Create train/validation/test splits.
splits = create_data_splits(df, test_size=0.2, val_size=0.1)
X_train = splits['X_train']
y_train = splits['y_train']Location: src/models.py
Main classifier wrapper.
from src.models import FakeNewsClassifier
classifier = FakeNewsClassifier(model_type='logistic_regression')
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)| Parameter | Type | Default | Description |
|---|---|---|---|
model_type |
str | 'logistic_regression' | Model type to use |
custom_params |
dict | None | Custom model parameters |
logistic_regressionrandom_forestlinear_svmnaive_bayesgradient_boostingadaboostmlp
Train the classifier.
Make predictions.
Get prediction probabilities.
Evaluate model on test data.
metrics = classifier.evaluate(X_test, y_test)
print(metrics['accuracy']) # 0.98Save model to disk.
Load model from disk.
Voting ensemble combining multiple classifiers.
from src.models import EnsembleModel
ensemble = EnsembleModel(voting='soft')
ensemble.fit(X_train, y_train)
predictions = ensemble.predict(X_test)| Parameter | Type | Default | Description |
|---|---|---|---|
models |
list | None | List of (name, model) tuples |
voting |
str | 'soft' | 'soft' or 'hard' voting |
Factory for creating model instances.
from src.models import ModelFactory
model = ModelFactory.get_model('random_forest')
all_models = ModelFactory.get_all_models()Evaluation utilities.
from src.models import ModelEvaluator
metrics = ModelEvaluator.calculate_metrics(y_true, y_pred, y_proba)
report = ModelEvaluator.get_classification_report(y_true, y_pred)
cm = ModelEvaluator.get_confusion_matrix(y_true, y_pred)Train and evaluate all available models.
results = train_all_models(X_train, y_train, X_test, y_test)
for name, data in results.items():
print(f"{name}: {data['metrics']['accuracy']:.2%}")Extract feature importance from trained model.
fake_features, real_features = get_feature_importance(model, features)Location: src/visualization.py
Plot class distribution bar chart.
from src.visualization import plot_class_distribution
plot_class_distribution(y, save_path='class_dist.png')Plot confusion matrix heatmap.
plot_confusion_matrix(y_test, predictions, save_path='cm.png')Compare model performance.
plot_model_comparison(results, metric='f1', save_path='comparison.png')Plot feature importance scores.
plot_feature_importance(fake_features, 'Fake News Indicators')Generate word cloud visualization.
plot_wordcloud(fake_texts, 'Fake News Word Cloud')Create comprehensive results visualization.
create_results_summary(results, save_path='summary.png')Location: src/utils.py
Project configuration.
from src.utils import Config
Config.MAX_FEATURES = 15000 # Modify setting
config_dict = Config.to_dict()Save model to disk using pickle.
Load model from disk.
Save dictionary to JSON.
Load JSON to dictionary.
Create necessary project directories.
Print formatted banner.
Print metrics in formatted table.
Main training script.
# Train all models
python train.py
# Train specific model
python train.py --model gradient_boosting
# Custom output directory
python train.py --output-dir custom_models/| Argument | Description |
|---|---|
--model, -m |
Specific model to train |
--all, -a |
Train all models |
--save-plots |
Save visualizations |
--output-dir, -o |
Output directory |
Prediction script.
# Direct prediction
python predict.py "News article text..."
# From file
python predict.py --file article.txt
# Interactive mode
python predict.py --interactive| Argument | Description |
|---|---|
text |
Article text to classify |
--file, -f |
Path to article file |
--interactive, -i |
Interactive mode |
--model, -m |
Path to model file |
# Model not trained
try:
predictions = classifier.predict(X)
except ValueError as e:
print("Model must be fitted before prediction")
# File not found
try:
predictor = FakeNewsPredictor()
except FileNotFoundError:
print("Please run train.py first")All functions include type hints for better IDE support:
def predict(self, text: str) -> dict:
...
def calculate_metrics(y_true: np.ndarray,
y_pred: np.ndarray,
y_proba: Optional[np.ndarray] = None) -> Dict[str, float]:
...