trying to create script to extract and check URLs

Topics relating to MX Docs and MX Videos
Message
Author
User avatar
Buck Fankers
Posts: 744
Joined: Sat Mar 10, 2018 9:06 pm

Re: trying to create script to extract and check URLs

#11 Post by Buck Fankers »

I know, this bellow is not even close, and you are asking for bash script, but since I'm doing my first learning baby steps in Python I though to give it a try with only partial success. So, just for fun, here is a script.
Your original text is stored in file: before.txt and result is in: after.txt

Code: Select all

before = 'before.txt'
after = 'after.txt'

new_file = []

with open(before, 'r') as rf:
    with open(after, 'w') as wf:
        for line in rf:
            elements = line.split(' ')
            for each in elements:
                if 'http' in each:
                    print(each)
                    new_file.append(each)
                    wf.write(each)
I mostly get URL's but there are parenthesis on each side and word "platforms." in some of them, that I have no clue how to get rid... Well, this is the result, not useful, but what the heck, had to give it a try ;-)

Code: Select all

(https://en.wikipedia.org/wiki/32-bit)
(https://antixlinux.com/)
(http://en.wikipedia.org/wiki/Physical_Address_Extension)
(https://en.wikipedia.org/wiki/64-bit_computing)
platforms.http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed1.1http://scripts.sil.org/OFL
platforms.http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed1.1http://scripts.sil.org/OFL
platforms.http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed1.1http://scripts.sil.org/OFL
platforms.http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed1.1http://scripts.sil.org/OFL
platforms.http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed1.1http://scripts.sil.org/OFL
platforms.http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed1.1http://scripts.sil.org/OFL
platforms.http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed1.1http://scripts.sil.org/OFL
platforms.http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed1.1http://scripts.sil.org/OFL
(https://www.techsupportalert.com/content/32-bit-and-64-bit-explained.htm)
(https://support.apple.com/en-us/HT201948)
(https://support.microsoft.com/en-us/kb/827218)
(https://mxlinux.org/download-links)
(https://mxlinux.org/create-mx-live-usb-windows-using-mx-monthly-snapshot-iso)
(http://www.mxlinux.org/download-links)
(http://en.wikipedia.org/wiki/ISO_9660)
(https://mxlinux.org/videos/torrent)
(https://mxlinux.org/download-links)
(http://en.wikipedia.org/wiki/BitTorrent)
(https://mxlinux.org/wiki/system/iso-download-mirrors)
(https://mxlinux.org/download-links)
(https://en.wikipedia.org/wiki/SHA-2)
(http://www.winmd5.com/)
(https://mxlinux.org/wiki/system/dd-command)
(https://rufus.akeo.ie/)
(https://mxlinux.org/wiki/system/signed-iso-files)
(http://forums.debian.net/)
(https://wiki.debian.org/InstallingDebianOn/Apple)
(http://www.nirsoft.net/utils/product_cd_key_viewer.html)
(http://mxlinux.org/wiki/help-files/help-disk-manager)
(https://www.youtube.com/watch?v=khg6_sdrOBQ)
(https://www.youtube.com/watch?v=lf8eXhCKghg)
(http://gparted.org/display-doc.php?name=help-manual)
(https://en.wikipedia.org/wiki/Universally_unique_identifier)
(http://www.plop.at/)
(https://mxlinux.org/wiki/system/boot-parameters)
(https://mxlinux.org/wiki/system/uefi)
(https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)
(https://mxlinux.org/uefi-boot-issues-and-some-settings-check)
(https://www.mxlinux.org/user_manual_mx15/MX_bootloader.html)
(https://www.mxlinux.org/wiki/system/uefi)
(http://en.wikipedia.org/wiki/Linux_startup_process)
(https://mxlinux.org/wiki/system/boot-parameters)
(http://docs.xfce.org/xfce/getting-started)
(https://mxlinux.org/node/177)
(http://gottcode.org/xfce4-whiskermenu-plugin)
(https://wiki.xfce.org/faq)
(http://www.xfce.org/about)
(https://mxlinux.org/my-home-folder-setup-and-disk-manager)
(https://mxlinux.org/mx-linux-17-mx-17-installation-overview-oracle-virtualbox-2017)
(https://mxlinux.org/wiki/system/hibernate)
(https://en.wikipedia.org/wiki/S.M.A.R.T.)
(https://mxlinux.org/wiki/help-files/help-disk-manager)
(https://mxlinux.org/wiki/system/boot-parameters)
(https://mxlinux.org/wiki/system/gnome-keyring)

User avatar
Jerry3904
Administrator
Posts: 21939
Joined: Wed Jul 19, 2006 6:13 am

Re: trying to create script to extract and check URLs

#12 Post by Jerry3904 »

not useful
Quite the opposite, thanks. I only specified bash b/c I have an inkling how it goes, but I suspect from a web search that python is the better way to do it.
Production: 5.10, MX-23 Xfce, AMD FX-4130 Quad-Core, GeForce GT 630/PCIe/SSE2, 16 GB, SSD 120 GB, Data 1TB
Personal: Lenovo X1 Carbon with MX-23 Fluxbox and Windows 10
Other: Raspberry Pi 5 with MX-23 Xfce Raspberry Pi Respin

User avatar
Buck Fankers
Posts: 744
Joined: Sat Mar 10, 2018 9:06 pm

Re: trying to create script to extract and check URLs

#13 Post by Buck Fankers »

Jerry3904 wrote: Sun Feb 24, 2019 6:08 pm
not useful
Quite the opposite, thanks. I only specified bash b/c I have an inkling how it goes, but I suspect from a web search that python is the better way to do it.
Well in that case i'm replacing my clumsy code with regular expression line I found on the internet. Looks like code now works as it should. (and it is also shorter) ;-) Here is the script:

Code: Select all

#!/usr/bin/env python
import re

before = 'before.txt'
after = 'after.txt'

new_file = []

with open(before, 'r') as rf:
    with open(after, 'w') as wf:
        for line in rf:
            url = re.search("(?P<url>https?://[^\s]+)", line).group("url") + '\n'
            new_file.append(url)
            wf.write(url)
Result is now like this: (as in previous example i used file "after.txt" to store results into it)

Code: Select all

https://en.wikipedia.org/wiki/32-bit)
https://antixlinux.com/)
http://en.wikipedia.org/wiki/Physical_Address_Extension)
https://en.wikipedia.org/wiki/64-bit_computing)
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
https://www.techsupportalert.com/content/32-bit-and-64-bit-explained.htm)
https://support.apple.com/en-us/HT201948)
https://support.microsoft.com/en-us/kb/827218)
https://mxlinux.org/download-links)
https://mxlinux.org/create-mx-live-usb-windows-using-mx-monthly-snapshot-iso)
http://www.mxlinux.org/download-links)
http://en.wikipedia.org/wiki/ISO_9660)
https://mxlinux.org/videos/torrent)
https://mxlinux.org/download-links)
http://en.wikipedia.org/wiki/BitTorrent)
https://mxlinux.org/wiki/system/iso-download-mirrors)
https://mxlinux.org/download-links)
https://en.wikipedia.org/wiki/SHA-2)
http://www.winmd5.com/)
https://mxlinux.org/wiki/system/dd-command)
https://rufus.akeo.ie/)
https://mxlinux.org/wiki/system/signed-iso-files)
http://forums.debian.net/)
https://wiki.debian.org/InstallingDebianOn/Apple)
http://www.nirsoft.net/utils/product_cd_key_viewer.html)
http://mxlinux.org/wiki/help-files/help-disk-manager)
https://www.youtube.com/watch?v=khg6_sdrOBQ)
https://www.youtube.com/watch?v=lf8eXhCKghg)
http://gparted.org/display-doc.php?name=help-manual)
https://en.wikipedia.org/wiki/Universally_unique_identifier)
http://www.plop.at/)
https://mxlinux.org/wiki/system/boot-parameters)
https://mxlinux.org/wiki/system/uefi)
https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)
https://mxlinux.org/uefi-boot-issues-and-some-settings-check)
https://www.mxlinux.org/user_manual_mx15/MX_bootloader.html)
https://www.mxlinux.org/wiki/system/uefi)
http://en.wikipedia.org/wiki/Linux_startup_process)
https://mxlinux.org/wiki/system/boot-parameters)
http://docs.xfce.org/xfce/getting-started)
https://mxlinux.org/node/177)
http://gottcode.org/xfce4-whiskermenu-plugin)
https://wiki.xfce.org/faq)
http://www.xfce.org/about)
https://mxlinux.org/my-home-folder-setup-and-disk-manager)
https://mxlinux.org/mx-linux-17-mx-17-installation-overview-oracle-virtualbox-2017)
https://mxlinux.org/wiki/system/hibernate)
https://en.wikipedia.org/wiki/S.M.A.R.T.)
https://mxlinux.org/wiki/help-files/help-disk-manager)
https://mxlinux.org/wiki/system/boot-parameters)
https://mxlinux.org/wiki/system/gnome-keyring)
I'm sure you know how to handle Python script, but for anyone who may not and would like to use it, here are directions:

1 - save upper script as text file with ending .py Use any name, but for the sake of this example, lets use name: urls.py
2 - in the same folder where you put script with the name urls.py, put also your txt file that you would like to extract url's from and name it: before.txt
3 - where you have saved script urls.py, open the terminal and type: python3 urls.py

That's it, results will be stored in file: after.txt

User avatar
Jerry3904
Administrator
Posts: 21939
Joined: Wed Jul 19, 2006 6:13 am

Re: trying to create script to extract and check URLs

#14 Post by Jerry3904 »

Yup, that works, thanks.

1) For anybody else, remember that after you create the *.py file you have to right-click > Permissions, and check the box to allow it to run
2) In MX we have a python launcher that works with a single click, so the user just has to click once on the script file.
3) The output has a variety of errors (e.g., missing letters, close parentheses, etc.) and inclusions (e.g., extra http://www. at the end of some lines), so I'm not quite there yet.

But this is progress!
Production: 5.10, MX-23 Xfce, AMD FX-4130 Quad-Core, GeForce GT 630/PCIe/SSE2, 16 GB, SSD 120 GB, Data 1TB
Personal: Lenovo X1 Carbon with MX-23 Fluxbox and Windows 10
Other: Raspberry Pi 5 with MX-23 Xfce Raspberry Pi Respin

User avatar
Buck Fankers
Posts: 744
Joined: Sat Mar 10, 2018 9:06 pm

Re: trying to create script to extract and check URLs

#15 Post by Buck Fankers »

Jerry3904 wrote: Sun Feb 24, 2019 7:57 pm 2) In MX we have a python launcher that works with a single click, so the user just has to click once on the script file.
Ahh, didn't know this, thanks. That what "Py-Loader" probably is for, i was wondering. ;-) It is not working for me, but because of my fault. Installation of Anaconda probably did it. Well i don't need it, I'm running short scripts from /bin/ in home folder so all is OK but it is good to know.

User avatar
Buck Fankers
Posts: 744
Joined: Sat Mar 10, 2018 9:06 pm

Re: trying to create script to extract and check URLs

#16 Post by Buck Fankers »

Jerry3904 wrote: Sun Feb 24, 2019 7:57 pm 3) The output has a variety of errors (e.g., missing letters, close parentheses, etc.) and inclusions (e.g., extra http://www. at the end of some lines), so I'm not quite there yet.
The only thing I could change/improve (after +35 tries, lol I'm such a beginner) is, I finally got rid of those ")" at the end of most URL's. (it took me a while to realize that "end of line" is also considered as a character/symbol, no wonder I couldn't isolate url's that had ")" at the end...)

Well, now there are no more ")" at the end of most of URL's: (phew!!!) lol

Code: Select all

#!/usr/bin/env python
import re

before = 'before.txt'
after = 'after.txt'

new_file = []

with open(before, 'r') as rf:
    with open(after, 'w') as wf:
        for line in rf:
            url = re.search("(?P<url>https?://[^\s]+)", line).group("url") + '\n'
            if url.rstrip().endswith(")"):
                url = url.replace(")", "")
            new_file.append(url)
            wf.write(url)
result now is:

Code: Select all

https://en.wikipedia.org/wiki/32-bit
https://antixlinux.com/
http://en.wikipedia.org/wiki/Physical_Address_Extension
https://en.wikipedia.org/wiki/64-bit_computing
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
http://www.ascendercorp.com/http://www.ascendercorp.com/typedesigners.htmlLicensed
https://www.techsupportalert.com/content/32-bit-and-64-bit-explained.htm
https://support.apple.com/en-us/HT201948
https://support.microsoft.com/en-us/kb/827218
https://mxlinux.org/download-links
https://mxlinux.org/create-mx-live-usb-windows-using-mx-monthly-snapshot-iso
http://www.mxlinux.org/download-links
http://en.wikipedia.org/wiki/ISO_9660
https://mxlinux.org/videos/torrent
https://mxlinux.org/download-links
http://en.wikipedia.org/wiki/BitTorrent
https://mxlinux.org/wiki/system/iso-download-mirrors
https://mxlinux.org/download-links
https://en.wikipedia.org/wiki/SHA-2
http://www.winmd5.com/
https://mxlinux.org/wiki/system/dd-command
https://rufus.akeo.ie/
https://mxlinux.org/wiki/system/signed-iso-files
http://forums.debian.net/
https://wiki.debian.org/InstallingDebianOn/Apple
http://www.nirsoft.net/utils/product_cd_key_viewer.html
http://mxlinux.org/wiki/help-files/help-disk-manager
https://www.youtube.com/watch?v=khg6_sdrOBQ
https://www.youtube.com/watch?v=lf8eXhCKghg
http://gparted.org/display-doc.php?name=help-manual
https://en.wikipedia.org/wiki/Universally_unique_identifier
http://www.plop.at/
https://mxlinux.org/wiki/system/boot-parameters
https://mxlinux.org/wiki/system/uefi
https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface
https://mxlinux.org/uefi-boot-issues-and-some-settings-check
https://www.mxlinux.org/user_manual_mx15/MX_bootloader.html
https://www.mxlinux.org/wiki/system/uefi
http://en.wikipedia.org/wiki/Linux_startup_process
https://mxlinux.org/wiki/system/boot-parameters
http://docs.xfce.org/xfce/getting-started
https://mxlinux.org/node/177
http://gottcode.org/xfce4-whiskermenu-plugin
https://wiki.xfce.org/faq
http://www.xfce.org/about
https://mxlinux.org/my-home-folder-setup-and-disk-manager
https://mxlinux.org/mx-linux-17-mx-17-installation-overview-oracle-virtualbox-2017
https://mxlinux.org/wiki/system/hibernate
https://en.wikipedia.org/wiki/S.M.A.R.T.
https://mxlinux.org/wiki/help-files/help-disk-manager
https://mxlinux.org/wiki/system/boot-parameters
https://mxlinux.org/wiki/system/gnome-keyring

User avatar
Buck Fankers
Posts: 744
Joined: Sat Mar 10, 2018 9:06 pm

Re: trying to create script to extract and check URLs

#17 Post by Buck Fankers »

One more khm, "improvement".
It may not be improvement though.
This version simple skip all those double url's.
Unfortunately you lose all of them (they are all the same), I don't know how to easy keep one single ande add it to the rest of them:

These links are now removed:
I understand that i should split and keep one and add it to the others, but at this moment I don't know how and I'm unfortunately out of time now.

Code: Select all

#!/usr/bin/env python
import re

before = 'before.txt'
after = 'after.txt'

new_file = []

with open(before, 'r') as rf:
    with open(after, 'w') as wf:
        for line in rf:
            url = re.search("(?P<url>https?://[^\s]+)", line).group("url") + '\n'
            if url.rstrip().endswith(")"):
                url = url.replace(")", "")
            if 'com/http:' in url:
                continue
            new_file.append(url)
            wf.write(url)

User avatar
Jerry3904
Administrator
Posts: 21939
Joined: Wed Jul 19, 2006 6:13 am

Re: trying to create script to extract and check URLs

#18 Post by Jerry3904 »

Thanks again. I'm done with this for now and appreciate your working on it.
Production: 5.10, MX-23 Xfce, AMD FX-4130 Quad-Core, GeForce GT 630/PCIe/SSE2, 16 GB, SSD 120 GB, Data 1TB
Personal: Lenovo X1 Carbon with MX-23 Fluxbox and Windows 10
Other: Raspberry Pi 5 with MX-23 Xfce Raspberry Pi Respin

Post Reply

Return to “Documentation and videos”