API Reference

Complete API documentation for the FactCheck fake news detection system.

Preprocessing Module
Models Module
Visualization Module
Utilities Module
Scripts

Preprocessing Module

Location: src/preprocessing.py

TextPreprocessor

Text preprocessing pipeline for news articles.

from src.preprocessing import TextPreprocessor

preprocessor = TextPreprocessor(
    remove_stopwords=True,
    lemmatize=True,
    min_word_length=2
)

Parameters

Parameter	Type	Default	Description
`remove_stopwords`	bool	True	Remove English stopwords
`lemmatize`	bool	True	Apply lemmatization
`min_word_length`	int	2	Minimum word length to keep

Methods

`clean_text(text: str) -> str`

Apply text cleaning operations.

clean = preprocessor.clean_text("Check out http://example.com!")
# Returns: "check out"

`preprocess(text: str, full_pipeline: bool = True) -> str`

Apply the full preprocessing pipeline.

processed = preprocessor.preprocess("The quick brown foxes were running.")
# Returns: "quick brown fox running"

`preprocess_dataframe(df, text_column, output_column) -> DataFrame`

Preprocess an entire DataFrame column.

df = preprocessor.preprocess_dataframe(df, 'content', 'clean_content')

FeatureExtractor

TF-IDF feature extraction.

from src.preprocessing import FeatureExtractor

extractor = FeatureExtractor(
    method='tfidf',
    max_features=10000,
    ngram_range=(1, 2)
)

Parameters

Parameter	Type	Default	Description
`method`	str	'tfidf'	'tfidf' or 'count'
`max_features`	int	10000	Maximum vocabulary size
`ngram_range`	tuple	(1, 2)	N-gram range
`min_df`	int	3	Minimum document frequency
`max_df`	float	0.95	Maximum document frequency

Methods

`fit_transform(texts) -> sparse matrix`

Fit and transform texts to feature matrix.

`transform(texts) -> sparse matrix`

Transform texts using fitted vectorizer.

`get_feature_names() -> array`

Get feature names from vectorizer.

Helper Functions

`load_and_prepare_data(fake_path, true_path, combine_title_text=True)`

Load and prepare the fake news dataset.

df = load_and_prepare_data('dataset/Fake.csv', 'dataset/True.csv')

`create_data_splits(df, text_column, label_column, test_size, val_size)`

Create train/validation/test splits.

splits = create_data_splits(df, test_size=0.2, val_size=0.1)
X_train = splits['X_train']
y_train = splits['y_train']

Models Module

Location: src/models.py

FakeNewsClassifier

Main classifier wrapper.

from src.models import FakeNewsClassifier

classifier = FakeNewsClassifier(model_type='logistic_regression')
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

Parameters

Parameter	Type	Default	Description
`model_type`	str	'logistic_regression'	Model type to use
`custom_params`	dict	None	Custom model parameters

Available Model Types

logistic_regression
random_forest
linear_svm
naive_bayes
gradient_boosting
adaboost
mlp

Methods

`fit(X_train, y_train, X_val=None, y_val=None)`

Train the classifier.

`predict(X) -> array`

Make predictions.

`predict_proba(X) -> array`

Get prediction probabilities.

`evaluate(X, y) -> dict`

Evaluate model on test data.

metrics = classifier.evaluate(X_test, y_test)
print(metrics['accuracy'])  # 0.98

`save(filepath)`

Save model to disk.

`load(filepath) -> FakeNewsClassifier`

Load model from disk.

EnsembleModel

Voting ensemble combining multiple classifiers.

from src.models import EnsembleModel

ensemble = EnsembleModel(voting='soft')
ensemble.fit(X_train, y_train)
predictions = ensemble.predict(X_test)

Parameters

Parameter	Type	Default	Description
`models`	list	None	List of (name, model) tuples
`voting`	str	'soft'	'soft' or 'hard' voting

ModelFactory

Factory for creating model instances.

from src.models import ModelFactory

model = ModelFactory.get_model('random_forest')
all_models = ModelFactory.get_all_models()

ModelEvaluator

Evaluation utilities.

from src.models import ModelEvaluator

metrics = ModelEvaluator.calculate_metrics(y_true, y_pred, y_proba)
report = ModelEvaluator.get_classification_report(y_true, y_pred)
cm = ModelEvaluator.get_confusion_matrix(y_true, y_pred)

Helper Functions

`train_all_models(X_train, y_train, X_test, y_test, verbose=True)`

Train and evaluate all available models.

results = train_all_models(X_train, y_train, X_test, y_test)
for name, data in results.items():
    print(f"{name}: {data['metrics']['accuracy']:.2%}")

`get_feature_importance(model, feature_names, top_n=20)`

Extract feature importance from trained model.

fake_features, real_features = get_feature_importance(model, features)

Visualization Module

Location: src/visualization.py

Plotting Functions

`plot_class_distribution(labels, class_names=None, save_path=None)`

Plot class distribution bar chart.

from src.visualization import plot_class_distribution

plot_class_distribution(y, save_path='class_dist.png')

`plot_confusion_matrix(y_true, y_pred, class_names=None, save_path=None)`

Plot confusion matrix heatmap.

plot_confusion_matrix(y_test, predictions, save_path='cm.png')

`plot_model_comparison(results, metric='accuracy', save_path=None)`

Compare model performance.

plot_model_comparison(results, metric='f1', save_path='comparison.png')

`plot_feature_importance(features, title, top_n=15, save_path=None)`

Plot feature importance scores.

plot_feature_importance(fake_features, 'Fake News Indicators')

`plot_wordcloud(texts, title, max_words=100, save_path=None)`

Generate word cloud visualization.

plot_wordcloud(fake_texts, 'Fake News Word Cloud')

`create_results_summary(results, save_path=None)`

Create comprehensive results visualization.

create_results_summary(results, save_path='summary.png')

Utilities Module

Location: src/utils.py

Config Class

Project configuration.

from src.utils import Config

Config.MAX_FEATURES = 15000  # Modify setting
config_dict = Config.to_dict()

Functions

`save_model(model, filepath) -> str`

Save model to disk using pickle.

`load_model(filepath) -> Any`

Load model from disk.

`save_json(data, filepath) -> str`

Save dictionary to JSON.

`load_json(filepath) -> dict`

Load JSON to dictionary.

`ensure_directories()`

Create necessary project directories.

`print_banner(text, char='=', width=60)`

Print formatted banner.

`print_metrics(metrics, title='Metrics')`

Print metrics in formatted table.

Scripts

train.py

Main training script.

# Train all models
python train.py

# Train specific model
python train.py --model gradient_boosting

# Custom output directory
python train.py --output-dir custom_models/

Arguments

Argument	Description
`--model, -m`	Specific model to train
`--all, -a`	Train all models
`--save-plots`	Save visualizations
`--output-dir, -o`	Output directory

predict.py

Prediction script.

# Direct prediction
python predict.py "News article text..."

# From file
python predict.py --file article.txt

# Interactive mode
python predict.py --interactive

Arguments

Argument	Description
`text`	Article text to classify
`--file, -f`	Path to article file
`--interactive, -i`	Interactive mode
`--model, -m`	Path to model file

Error Handling

Common Exceptions

# Model not trained
try:
    predictions = classifier.predict(X)
except ValueError as e:
    print("Model must be fitted before prediction")

# File not found
try:
    predictor = FakeNewsPredictor()
except FileNotFoundError:
    print("Please run train.py first")

Type Hints

All functions include type hints for better IDE support:

def predict(self, text: str) -> dict:
    ...

def calculate_metrics(y_true: np.ndarray,
                     y_pred: np.ndarray,
                     y_proba: Optional[np.ndarray] = None) -> Dict[str, float]:
    ...

FilesExpand file tree

api.md

Latest commit

History

api.md

File metadata and controls

API Reference

Table of Contents

Preprocessing Module

TextPreprocessor

Parameters

Methods

clean_text(text: str) -> str

preprocess(text: str, full_pipeline: bool = True) -> str

preprocess_dataframe(df, text_column, output_column) -> DataFrame

FeatureExtractor

Parameters

Methods

fit_transform(texts) -> sparse matrix

transform(texts) -> sparse matrix

get_feature_names() -> array

Helper Functions

load_and_prepare_data(fake_path, true_path, combine_title_text=True)

create_data_splits(df, text_column, label_column, test_size, val_size)

Models Module

FakeNewsClassifier

Parameters

Available Model Types

Methods

fit(X_train, y_train, X_val=None, y_val=None)

predict(X) -> array

predict_proba(X) -> array

evaluate(X, y) -> dict

save(filepath)

load(filepath) -> FakeNewsClassifier

EnsembleModel

Parameters

ModelFactory

ModelEvaluator

Helper Functions

train_all_models(X_train, y_train, X_test, y_test, verbose=True)

get_feature_importance(model, feature_names, top_n=20)

Visualization Module

Plotting Functions

plot_class_distribution(labels, class_names=None, save_path=None)

plot_confusion_matrix(y_true, y_pred, class_names=None, save_path=None)

plot_model_comparison(results, metric='accuracy', save_path=None)

plot_feature_importance(features, title, top_n=15, save_path=None)

plot_wordcloud(texts, title, max_words=100, save_path=None)

create_results_summary(results, save_path=None)

Utilities Module

Config Class

Functions

save_model(model, filepath) -> str

load_model(filepath) -> Any

save_json(data, filepath) -> str

load_json(filepath) -> dict

ensure_directories()

print_banner(text, char='=', width=60)

print_metrics(metrics, title='Metrics')

Scripts

train.py

Arguments

predict.py

Arguments

Error Handling

Common Exceptions

Type Hints

`clean_text(text: str) -> str`

`preprocess(text: str, full_pipeline: bool = True) -> str`

`preprocess_dataframe(df, text_column, output_column) -> DataFrame`

`fit_transform(texts) -> sparse matrix`

`transform(texts) -> sparse matrix`

`get_feature_names() -> array`

`load_and_prepare_data(fake_path, true_path, combine_title_text=True)`

`create_data_splits(df, text_column, label_column, test_size, val_size)`

`fit(X_train, y_train, X_val=None, y_val=None)`

`predict(X) -> array`

`predict_proba(X) -> array`

`evaluate(X, y) -> dict`

`save(filepath)`

`load(filepath) -> FakeNewsClassifier`

`train_all_models(X_train, y_train, X_test, y_test, verbose=True)`

`get_feature_importance(model, feature_names, top_n=20)`

`plot_class_distribution(labels, class_names=None, save_path=None)`

`plot_confusion_matrix(y_true, y_pred, class_names=None, save_path=None)`

`plot_model_comparison(results, metric='accuracy', save_path=None)`

`plot_feature_importance(features, title, top_n=15, save_path=None)`

`plot_wordcloud(texts, title, max_words=100, save_path=None)`

`create_results_summary(results, save_path=None)`

`save_model(model, filepath) -> str`

`load_model(filepath) -> Any`

`save_json(data, filepath) -> str`

`load_json(filepath) -> dict`

`ensure_directories()`

`print_banner(text, char='=', width=60)`

`print_metrics(metrics, title='Metrics')`