Sunday, May 12, 2024

Data Scraping

Six months from now, Potions for Muggles will be ten years old. When I started blogging, I didn’t think I would keep it up as long I as have. There were multiple occasions when I felt like throwing in the towel, but then after a week or two I would find something interesting I wanted to write about so that I could search for it should the need arise. I guess I wrote my blog so I could retrieve my own thoughts.

 

Several weeks ago I was talking to students in my P-Chem class about A.I., machine learning and data scraping. We were learning some python for P-Chem and I had told the students that ChatGPT is a good place to get snippets of code for whatever you were trying to do. With a little understanding of scripting you could modify the code for the specific task you wanted to carry out. That led to me pontificating about data scrapers and why I thought that large language models are starting to plateau. The amount of data needed for a substantial improvement is exponential. Much of the free data has been scraped. I’m sure that deep-pocketed tech companies will be willing to spend money for paywall data and there’s likely to be an arms race. I also mused on the potential problems of having an A.I. generate data to train another A.I.

 

All this made me think that my blog has likely been scraped several times over. I suppose I could scrape my own blog to train an A.I. that will spout aphorisms or make proclamations in my (written) voice. After I’m dead and gone, someone could still consult the oracle of me that has survived as an interlocutor bot. Personally, I’m not sure I’m all that interesting to talk to. That being said, I do think that I’ve shared some interesting ideas on my blog that are not my intellectual property – that’s what one should expect with a public facing free blog. I have considered stopping this blog. Why give away my good ideas for free to data scrapers? And maybe I will at some point. If this turns out to be my last blog post, then I guess that’s what I decided to do. For now.

 

Humans are fickle. I’m no exception. Also, our memories fade and reorganize over time. That’s not a bad thing. Our brains repackage our thoughts and ideas every time we access them. A large language model generating text is a re-packager of sorts. It’s an intelligence of sorts. My ideas are a molecular drop in a mole of data. Likely insignificant to a data scraping operation. I suppose I still get more out of my blog than a tech company would, and if someone wanted my ideas, they’d actually have to read through and understand my writing. Perhaps that was the whole point of writing my thoughts in the first place.

No comments:

Post a Comment