Skip to content

BrentWilkins/FakeSentences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FakeSentences

Generates plausible-sounding nonsense sentences by training a Markov chain on any plain-text corpus.

Words are stored as nodes in a weighted directed graph. Each node tracks how often one word follows another, and whether the word ever ended a sentence in the training data (IsLeaf, IsNotLeaf, or IsMaybeLeaf). Sentences are generated by walking the graph from a random starting word, choosing each next word weighted by frequency, and stopping at leaf nodes or probabilistically at maybe-leaf nodes.

Requirements

Building

dotnet build FakeSentences.sln

Running

dotnet run --project FakeSentences/FakeSentences.csproj

The app prompts for one or more plain-text files to train on, then generates sentences from the combined graph. Two Project Gutenberg texts are included:

File Contents
FakeSentences/pg11.txt Alice's Adventures in Wonderland — Lewis Carroll
FakeSentences/pg2591.txt Grimms' Fairy Tales — Brothers Grimm

How it works

Training on three sentences — "The cat sat.", "The cat ran.", "The dog sat." — builds this graph:

graph LR
    ROOT(["ROOT\n(sentence start)"])
    ROOT -->|"3"| the
    the -->|"2"| cat
    the -->|"1"| dog
    cat -->|"1"| sat_a["sat (leaf)"]
    cat -->|"1"| ran["ran (leaf)"]
    dog -->|"1"| sat_b["sat (leaf)"]

    style ROOT fill:#ddd,stroke:#999
    style sat_a fill:#ffe0b2,stroke:#e65100
    style ran   fill:#ffe0b2,stroke:#e65100
    style sat_b fill:#ffe0b2,stroke:#e65100
Loading

Edge weights are counts. (leaf) marks leaf nodes — words that ended a sentence in training. A word seen both mid-sentence and at the end becomes a maybe-leaf and generation stops there with 50% probability, producing shorter and more varied output.

Multiple files train into the same graph — edges from later files simply increment counts on existing nodes, so word-pair frequencies blend across all corpora.

To generate a sentence the program:

  1. Picks a starting word from ROOT's children, weighted by count
  2. Follows edges to the next word, weighted by count
  3. Stops at a leaf, or probabilistically at a maybe-leaf

Sample run

Training on both included texts (Alice in Wonderland + Grimms' Fairy Tales):

Enter training files one per line, then press Enter to start:
FakeSentences/pg11.txt
  -> 'FakeSentences/pg11.txt'
Whole file was read!
Done processing training data
FakeSentences/pg2591.txt
  -> 'FakeSentences/pg2591.txt'
Whole file was read!
Done processing training data

Top 5 most common sentence starting words:
  1: the             (3608)
  2: and             (1525)
  3: a               (992)
  4: he              (941)
  5: you             (745)
The the owner of when the quicker she would that and when the the king the forest. You the father. The workmanship there you as the the dwarf and on the the king the wand and at the the sun. As what you but give the the other. Many a copy a dormouse was sitting between them fast asleep and a piece.

Running tests

dotnet test FakeSentences.Tests/FakeSentences.Tests.csproj

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages