lightnovelworld and cloudflare

lightnovelworld is rather a simple scraper as it doesn't get anything fancy as covers, synopsis, ratings, comments.

Pestilence

Recently I've read a manhwa that was unfinished so I've decided to read the novel. It's quite a short one as it has around 200 chapters and is complete.

I've searched my data from novellive which has 1768649 files and weights 15GB. After finding it i've noticed that some chapters don't have all paragraphs saved, and that's not surprising as I didn't test much of the scripts output, being satisfied with a couple of novels.

Instead of fixing an old script I've decided to make a new one for some other site. I've found the webnovel.com and the bad interface with multitude of api calls combined with ridiculous "protection" from users accessing the dev tools was enough to deter me.

I've just wanted to make a quick and pleasant to write script. Later I've found the lightnovelworld.com.

The access to it was blocked to me on firefox by cloudflare making endless verification loop. This issue subsided on private mode. Interface of it had a lot of information and didn't use any api calls.

Knowing that such aggressive cloudflare would mean that I've would have to copy cookies from the browser any time I were to use script so I've searched github for existing scrapers.

I've found the Novel-Grabber and even though I've wanted to download just one novel I could not. cloudflare disease blocked any requests, I've tried copying the cookies from the browser through their terrible interface but it didn't help.

Forced to drastic measures I've quickly made the lightnovelworld_old script in bash. It had a lot of problem's with being blocked so I've made waiting time of around 7 seconds for each request. Still 10% of requests was getting randomly blocked, most of them also got stretched by long connection time, and then after 60 of them I needed to get new cookies.

Chapter files have to be named reasonably to be useful, but to get the title you have to download the page. So if some chapters failed in between you had to slowly redownload the whole thing anew. This was fixed by using the hash of url as file name and after downloading, setting their names to the first line in them which is the title.

After downloading 80% of chapters cloudflare got even more aggressive and worked only for 4 requests at a time.

Being unable to progress further I've rewritten it to python. Curl being executed as a command has no way of saving the connection so it has to reconnect for every request, and that was triggering the protection.

As i was already having problem with detection I've initially used the curl_cffi instead of requests to avoid tls fingerprinting, however it didn't make connection stay alive so I've switched back to requests.

To automatically get cookies from browser I've used the browser_cookie3 lib, unfortunately it cannot get cookies from the private mode so I had to fix the endless verification loop on firefox. After defaulting the settings and turning the add-ons I've found that the problem was only with Universal Bypass since cloudflare for some reason really cares that my browser has the same user agent as the one it serves. It's weird that something like Privacy Badger or NoScript wasn't the cause.

I don't know why they protect those novels so much, each one of those sites stole it from the other anyway and now they care if some actual browser made 2 requests too quickly.

Now script doesn't have problems with downloading novels without any interruptions, although the waiting time is quite conservative.

If you want your chapters to be named by their title run this after downloading:

find . -type f -regextype egrep -regex '.*/[0-9a-z]{64}' | xargs -I {} /bin/sh -c 'mv "{}" "$(dirname "{}")/$(head -n1 "{}")"'

usage

Download chapter, novel, or pages of novels from URLs to DIR

lightnovelworld.py --directory DIR URL1 URL2 URL3

Explicitly treat URLs to be of certain types, URL5 and URL6 will be guessed

lightnovelworld.py --chapter URL1 --novel URL2 --chapters URL3 --pages URL4 URL5 URL6

Download a novel, this will create directory Shadow Slave, and inside of it chapters will be written named by sha256 of their urls. Running this command in directory where it has already been downloaded will ommit downloaded parts.

lightnovelworld.py 'https://www.lightnovelworld.com/novel/shadow-slave-05122222'

Download guessing from URL with waiting 2.5 seconds and randomly waiting up to 1500 miliseconds for each request

lightnovelworld.py --wait 2.5 --wait-random 1500 URL

Download from URL using 4 retries and waiting 60 seconds between them

lightnovelworld.py --retries 4 --retry-wait 60 URL

Download URL with timeout set to 60 seconds and custom user-agent

lightnovelworld.py --timeout 60 --user-agent 'I AM NOT A BOT'

Choose browser from which cookies will be extracted (same names as for browser_cookie3 lib)

lightnovelworld.py --browser firefox URL

Get some help

lightnovelworld.py --help