diff --git a/lessons/lesson_07.ipynb b/lessons/lesson_07.ipynb index c2c5863..8c9c01b 100644 --- a/lessons/lesson_07.ipynb +++ b/lessons/lesson_07.ipynb @@ -18,7 +18,6 @@ " - gzip\n", " - argparse\n", " - math\n", - " - re\n", " - numpy\n", " - pandas\n", "- tidy data" @@ -252,134 +251,6 @@ "print(\"Gamma:\", math.gamma(3)) # Gamma function at x" ] }, - { - "cell_type": "markdown", - "id": "e44e0357", - "metadata": {}, - "source": [ - "# Library re\n", - "- `re` stands for `regular expression`, aka `regex`\n", - "- concept for text pattern matching\n", - "- Python converts the search pattern into a bytestring to search very efficiently in a memory object" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b1616d23", - "metadata": {}, - "outputs": [], - "source": [ - "import re\n", - "\n", - "# This text comes from the book \"20,000 Leagues Under the Sea\" by Jules Verne, published in 1870.\n", - "\n", - "story = \"\"\"On the 6th of November, 1867, the frigate Abraham Lincoln departed at 3:00 PM from Brooklyn pier.\n", - "The crew numbered 307 men and officers.\n", - "Captain Farragut had placed a reward of $2,000 for whoever first sighted the creature.\n", - "Professor Aronnax, a marine biologist from Paris, stood at the bow scanning the horizon.\n", - "The animal, if it exists, must be of considerable size — perhaps 200 feet in length.\n", - "The sea was calm; visibility extended roughly 15 nautical miles.\n", - "At latitude 31° 15' N, longitude 136° 42' E, they found nothing.\n", - "After 3 weeks with no sightings, the crew grew restless.\n", - "Then, on November 28th at 11:17 PM, the lookout cried: Object sighted — bearing 315 degrees!\n", - "The creature emitted a pale phosphorescent light and moved at approximately 40 knots.\n", - "Aronnax estimated its mass at no less than 1,500 tons.\n", - "Impossible, said Conseil quietly, and yet — there it is.\"\"\"\n", - "\n", - "# create pattern\n", - "pattern = \".*\\d{1,2}:\\d{2} PM.*\"\n", - "regex = re.compile(pattern)\n", - "\n", - "# search for pattern and print each line with the pattern in it\n", - "print(regex.findall(story))" - ] - }, - { - "cell_type": "markdown", - "id": "93a3e5a6", - "metadata": {}, - "source": [ - "- certain strings have specific meanings:\n", - " - `.*` = any number of any character before/after our pattern except `\\n`, including 0 observations\n", - " - `*` = any number of any character within our pattern, including 0 observations\n", - " - `\\d{1,2}` = one or two digits\n", - "- when compiling the search pattern, you can include certain flags\n", - " - `re.IGNORECASE` to have case insensitive matching\n", - " - `re.DOTALL` to have the `.` match all characters incl. the line end character `\\n`\n", - " - `re.MULTILINE` to handle multiple lines in a string separately, relevant for:\n", - " - `^` = beginning of a string / line\n", - " - `$` = end of a string/line\n", - "- to combine multiple flags, use the vertical line `|`, i.e. `re.DOTALL | re.MULTILINE`" - ] - }, - { - "cell_type": "markdown", - "id": "5e10a102", - "metadata": {}, - "source": [ - "- some special characters in search pattern:\n", - "\n", - "| Character | Meaning |\n", - "| :---: | :--- |\n", - "| . | any character except new line '\\n' |\n", - "| ^ | at the beginning of a string |\n", - "| $ | at the end of a string |\n", - "| * | multiplier >= 0 |\n", - "| + | multiplier >=1 |\n", - "| ? | multiplier 0-1 |\n", - "| {m} | specific multiplier, i.e. {3} |\n", - "| {m,n} | multiplier range, i.e. {2,4}, also {,4} or {4,} for half-open ranges |\n", - "| [ ] | character set to choose from, i.e. [ACGT], special characters become normal characters, i.e. [ab*] |\n", - "| [a-z] | a single lower case letter |\n", - "| [0-9] | a single digit |\n", - "| \\ | escape character, i.e. \\* is an asterisk and not a multiplier |\n", - "| \\| | logical or when combining |\n", - " " - ] - }, - { - "cell_type": "markdown", - "id": "8964ea64", - "metadata": {}, - "source": [ - "- several subfunctions are available for a pattern object\n", - "- below is an overview of the search functions and their result\n", - "- all expect a compiled pattern via `re.compile()` and the string to search in, flags can always be added after the string\n", - "\n", - "\n", - "| Subfunction | Description |\n", - "| :--- | :--- |\n", - "| `pattern.search(string)` | first match object |\n", - "| `pattern.match(string)` | matching object, but tests only the beginning of the string |\n", - "| `pattern.fullmatch(string)` | matching object only if whole string matches, otherwise returns RE |\n", - "| `pattern.findall(string)` | list of match |\n", - "| `pattern.finditer(string)` | iterator over match objects, similar to list of `.findall()` |\n", - "| `pattern.split(string,maxsplit=0)` | splits string based on occurance of the pattern, limited by maxsplit |\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "423d4b35", - "metadata": {}, - "outputs": [], - "source": [ - "sequence = \"\"\"ATGGATAAGAAATACTCAATAGGCTTAGATATCGGCACAAATAGCGTCGGATGGGCGGTGATCACTGATG\n", - "AATATAAGGTTCCGTCTAAAAAGTTCAAGGTTCTGGGAAATACAGACCGCCACAGTATCAAAAAAAATCT\n", - "TATAGGGGCTCTTTTATTTGACAGTGGAGAGACAGCGGAAGCGACTCGTCTCAAACGGACAGCTCGTAGA\"\"\"\n", - "pattern = re.compile(\n", - " \"AT[ACT]GG[ACGT]\"\n", - ") # represents AA sequence 'IG' = Isoleucine + Glycine\n", - "\n", - "print(\"first match\", pattern.search(sequence))\n", - "print(\"match beginning\", pattern.match(sequence))\n", - "print(\"whole string match\", pattern.fullmatch(sequence))\n", - "print(\"list of matches\", pattern.findall(sequence))\n", - "print(\"iterator for matches\", pattern.finditer(sequence))\n", - "print(\"split at matches\", pattern.split(sequence))" - ] - }, { "cell_type": "markdown", "id": "fad61f15", @@ -604,7 +475,7 @@ "metadata": {}, "source": [ "## Why to use numpy?\n", - "- numpy (and scipy) are fast, really fast\n", + "- numpy (also scipy) are fast, really fast\n", "- for demonstration purposes, we will create 10K random numbers and add them together. We will repeat the step for the addition several times and test the performance with a (Jupyter) built-in function `%timeit`\n", "- we will compare numpy with a for loop" ] @@ -1127,24 +998,6 @@ "outputs": [], "source": [] }, - { - "cell_type": "markdown", - "id": "1aa2b2d8", - "metadata": {}, - "source": [ - "- look for any stop codons in your sequence using regular expressions\n", - "- stop codons: UAA, UAG, UGA\n", - "- print to screen the positions for each stop codon" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c56a8cf0", - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "markdown", "id": "8c660920", diff --git a/lessons/lesson_08.ipynb b/lessons/lesson_08.ipynb index 691802b..e4bdccc 100644 --- a/lessons/lesson_08.ipynb +++ b/lessons/lesson_08.ipynb @@ -19,7 +19,6 @@ " - gzip\n", " - argparse\n", " - math\n", - " - re\n", " - numpy\n", " - pandas\n", "- tidy data" diff --git a/solutions/solutions_07.ipynb b/solutions/solutions_07.ipynb index db3d1e5..ab49d2f 100644 --- a/solutions/solutions_07.ipynb +++ b/solutions/solutions_07.ipynb @@ -35,35 +35,6 @@ "random_sequence = \"\".join(random.choices(list(\"ACGU\"), k=length))" ] }, - { - "cell_type": "markdown", - "id": "1aa2b2d8", - "metadata": {}, - "source": [ - "- look for any stop codons in your sequence using regular expressions\n", - "- stop codons: UAA, UAG, UGA\n", - "- print to screen the positions for each stop codon" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c56a8cf0", - "metadata": {}, - "outputs": [], - "source": [ - "import re\n", - "\n", - "pattern = re.compile(\"UAA|UAG|UGA\")\n", - "\n", - "positions = []\n", - "\n", - "for p in re.finditer(pattern, random_sequence):\n", - " positions.append(p.span())\n", - "\n", - "print(positions)" - ] - }, { "cell_type": "markdown", "id": "8c660920",