Skip to content

Latest commit

 

History

History
460 lines (294 loc) · 8.78 KB

File metadata and controls

460 lines (294 loc) · 8.78 KB

API Reference

Complete API documentation for the FactCheck fake news detection system.

Table of Contents

  1. Preprocessing Module
  2. Models Module
  3. Visualization Module
  4. Utilities Module
  5. Scripts

Preprocessing Module

Location: src/preprocessing.py

TextPreprocessor

Text preprocessing pipeline for news articles.

from src.preprocessing import TextPreprocessor

preprocessor = TextPreprocessor(
    remove_stopwords=True,
    lemmatize=True,
    min_word_length=2
)

Parameters

Parameter Type Default Description
remove_stopwords bool True Remove English stopwords
lemmatize bool True Apply lemmatization
min_word_length int 2 Minimum word length to keep

Methods

clean_text(text: str) -> str

Apply text cleaning operations.

clean = preprocessor.clean_text("Check out http://example.com!")
# Returns: "check out"
preprocess(text: str, full_pipeline: bool = True) -> str

Apply the full preprocessing pipeline.

processed = preprocessor.preprocess("The quick brown foxes were running.")
# Returns: "quick brown fox running"
preprocess_dataframe(df, text_column, output_column) -> DataFrame

Preprocess an entire DataFrame column.

df = preprocessor.preprocess_dataframe(df, 'content', 'clean_content')

FeatureExtractor

TF-IDF feature extraction.

from src.preprocessing import FeatureExtractor

extractor = FeatureExtractor(
    method='tfidf',
    max_features=10000,
    ngram_range=(1, 2)
)

Parameters

Parameter Type Default Description
method str 'tfidf' 'tfidf' or 'count'
max_features int 10000 Maximum vocabulary size
ngram_range tuple (1, 2) N-gram range
min_df int 3 Minimum document frequency
max_df float 0.95 Maximum document frequency

Methods

fit_transform(texts) -> sparse matrix

Fit and transform texts to feature matrix.

transform(texts) -> sparse matrix

Transform texts using fitted vectorizer.

get_feature_names() -> array

Get feature names from vectorizer.


Helper Functions

load_and_prepare_data(fake_path, true_path, combine_title_text=True)

Load and prepare the fake news dataset.

df = load_and_prepare_data('dataset/Fake.csv', 'dataset/True.csv')

create_data_splits(df, text_column, label_column, test_size, val_size)

Create train/validation/test splits.

splits = create_data_splits(df, test_size=0.2, val_size=0.1)
X_train = splits['X_train']
y_train = splits['y_train']

Models Module

Location: src/models.py

FakeNewsClassifier

Main classifier wrapper.

from src.models import FakeNewsClassifier

classifier = FakeNewsClassifier(model_type='logistic_regression')
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

Parameters

Parameter Type Default Description
model_type str 'logistic_regression' Model type to use
custom_params dict None Custom model parameters

Available Model Types

  • logistic_regression
  • random_forest
  • linear_svm
  • naive_bayes
  • gradient_boosting
  • adaboost
  • mlp

Methods

fit(X_train, y_train, X_val=None, y_val=None)

Train the classifier.

predict(X) -> array

Make predictions.

predict_proba(X) -> array

Get prediction probabilities.

evaluate(X, y) -> dict

Evaluate model on test data.

metrics = classifier.evaluate(X_test, y_test)
print(metrics['accuracy'])  # 0.98
save(filepath)

Save model to disk.

load(filepath) -> FakeNewsClassifier

Load model from disk.


EnsembleModel

Voting ensemble combining multiple classifiers.

from src.models import EnsembleModel

ensemble = EnsembleModel(voting='soft')
ensemble.fit(X_train, y_train)
predictions = ensemble.predict(X_test)

Parameters

Parameter Type Default Description
models list None List of (name, model) tuples
voting str 'soft' 'soft' or 'hard' voting

ModelFactory

Factory for creating model instances.

from src.models import ModelFactory

model = ModelFactory.get_model('random_forest')
all_models = ModelFactory.get_all_models()

ModelEvaluator

Evaluation utilities.

from src.models import ModelEvaluator

metrics = ModelEvaluator.calculate_metrics(y_true, y_pred, y_proba)
report = ModelEvaluator.get_classification_report(y_true, y_pred)
cm = ModelEvaluator.get_confusion_matrix(y_true, y_pred)

Helper Functions

train_all_models(X_train, y_train, X_test, y_test, verbose=True)

Train and evaluate all available models.

results = train_all_models(X_train, y_train, X_test, y_test)
for name, data in results.items():
    print(f"{name}: {data['metrics']['accuracy']:.2%}")

get_feature_importance(model, feature_names, top_n=20)

Extract feature importance from trained model.

fake_features, real_features = get_feature_importance(model, features)

Visualization Module

Location: src/visualization.py

Plotting Functions

plot_class_distribution(labels, class_names=None, save_path=None)

Plot class distribution bar chart.

from src.visualization import plot_class_distribution

plot_class_distribution(y, save_path='class_dist.png')

plot_confusion_matrix(y_true, y_pred, class_names=None, save_path=None)

Plot confusion matrix heatmap.

plot_confusion_matrix(y_test, predictions, save_path='cm.png')

plot_model_comparison(results, metric='accuracy', save_path=None)

Compare model performance.

plot_model_comparison(results, metric='f1', save_path='comparison.png')

plot_feature_importance(features, title, top_n=15, save_path=None)

Plot feature importance scores.

plot_feature_importance(fake_features, 'Fake News Indicators')

plot_wordcloud(texts, title, max_words=100, save_path=None)

Generate word cloud visualization.

plot_wordcloud(fake_texts, 'Fake News Word Cloud')

create_results_summary(results, save_path=None)

Create comprehensive results visualization.

create_results_summary(results, save_path='summary.png')

Utilities Module

Location: src/utils.py

Config Class

Project configuration.

from src.utils import Config

Config.MAX_FEATURES = 15000  # Modify setting
config_dict = Config.to_dict()

Functions

save_model(model, filepath) -> str

Save model to disk using pickle.

load_model(filepath) -> Any

Load model from disk.

save_json(data, filepath) -> str

Save dictionary to JSON.

load_json(filepath) -> dict

Load JSON to dictionary.

ensure_directories()

Create necessary project directories.

print_banner(text, char='=', width=60)

Print formatted banner.

print_metrics(metrics, title='Metrics')

Print metrics in formatted table.


Scripts

train.py

Main training script.

# Train all models
python train.py

# Train specific model
python train.py --model gradient_boosting

# Custom output directory
python train.py --output-dir custom_models/

Arguments

Argument Description
--model, -m Specific model to train
--all, -a Train all models
--save-plots Save visualizations
--output-dir, -o Output directory

predict.py

Prediction script.

# Direct prediction
python predict.py "News article text..."

# From file
python predict.py --file article.txt

# Interactive mode
python predict.py --interactive

Arguments

Argument Description
text Article text to classify
--file, -f Path to article file
--interactive, -i Interactive mode
--model, -m Path to model file

Error Handling

Common Exceptions

# Model not trained
try:
    predictions = classifier.predict(X)
except ValueError as e:
    print("Model must be fitted before prediction")

# File not found
try:
    predictor = FakeNewsPredictor()
except FileNotFoundError:
    print("Please run train.py first")

Type Hints

All functions include type hints for better IDE support:

def predict(self, text: str) -> dict:
    ...

def calculate_metrics(y_true: np.ndarray,
                     y_pred: np.ndarray,
                     y_proba: Optional[np.ndarray] = None) -> Dict[str, float]:
    ...