Advanced Search
Search Results
156 total results found
Large Scale Multi-Label Learning
The Keras website has a tutorial on how to do multi-label learning with a large number of labels: https://keras.io/examples/nlp/multi_label_classification/ (mirror)
Online Reading and Feeds
RSS I use FreshRSS to manage my feeds for me and the associated Android client for on the go. On the desktop I use NewsFlash reader which can also subscribe to FreshRSS Read it Later App I use Wallabag to store articles I want to read later and the accompanyin...
Data Wrangling
DuckDB DuckDB is a lightweight OLAP type database system written in C++ and designed to be used for EDA style activities: From their website: advice on when to use and not to use DuckDB Polars Polars is a rust-based data frames library with Python bindings H...
ML Best Practices
Machine learning is a complex and multifaceted activity that requires the combination of a number of success factors in order to work. In order to execute machine learning well, it is important to have a good understanding of the processes and variables that f...
Hugo Static Site Generation
I use Hugo to maintain most of my websites. Extended Edition Hugo has an extended version which includes hooks for building SASS. Hugo recommend using snaps to manage and install versions of their tool rather than relying on debian packages since these can oft...
Core Scientific Concepts (CoreSC)
Core Scientific Concepts (CoreSC) is an annotation scheme used to delineate different parts of scientific discourse in a scientific paper. There are 11 categories: Background Conclusion Experiment Goal Hypothesis Method Model Motivation Object Observation Res...
Gaming
One of my hobbies is video gaming. I recently got my hands on a Steam Deck which I would describe as a "Nintendo Switch Pro". I've been very impressed with the capabilities of the system. Watch List The Store is Closed: Infinite Furniture Store Survival Game...
Times and Dates in Python
The built in datetime library in Python can be a bit rubbish/difficult to use. Pendulum provides an API kind of similar to moment.js although the parsing of text dates is not quite as flexible/powerful.
Webmentions
Webmentions are a way for IndieWeb folks to notify each other that something has happened, they use microformats internally. WebMention.App provides an API for sending web mentions automatically but you have to know which page you want to send them from. I wil...
Learning In Public
Learning Exhaust This blog post by swyx highlights the benefits of learning in public: You already know that you will never be done learning. But most people “learn in private”, and lurk. They consume content without creating any themselves. Again, that’s fin...
Batch Iterating in Pandas
BATCH_SIZE=32 for k,grp in df.groupby(np.arange(len(df))//BATCH_SIZE): # grp is a tiny dataframe BATCH_SIZE rows long print(k,grp) References python - How to iterate over consecutive chunks of Pandas dataframe efficiently - Stack Overflow
Logging and Winston
Winston is a fancy logging library for node. Using Common Loggers Between Packages As per this stackoverflow post (mirror): Declare and export your winston logger object and from different locations within your app.
Stratified Sampling in Pandas
Use groupby on the label column to create sub-frames for each label and then use the sample() function. Passing an integer gives an exact sample (e.g. sample(5) gives 5 rows). Passing frac=0.1 gives a percentage (i.e. 10%) Remember to set random_state for rep...
From Crowd Ratings to Predictive Models of Newsworthiness to Support Science Journalism
Paper Link Authors: Sachita Nishal Nicholas Diakopoulos Notes Their work comes at the problem from the scientific paper - essentially they are trying to predict whether or not a scientific article might make an interesting news article (as opposed to ...
Steam Deck
Proton Use ProtonUp to install custom versions of proton on the deck. You can find this in the Software store on the deck's KDE desktop (open up Discover and search protonup-qt) Heroic Game Launcher Heroic is a GUI that wraps both Epic and GOG allowing install...
CRON No MTA installed discarding output
Answer from here Linux uses mail for sending notifications to the user. Most Linux distributions have a mail service including an MTA (Mail Transfer Agent) installed. Ubuntu doesn't though. You can install a mail service, postfix for example, to solve this pr...
IndieWeb
I've been interested in IndieWeb since I encountered the concept and owning your own data for a long time. My own site Brainsteam uses micropub and microsub and can receive webmentions. I use my own hand-rolled micropub endpoint in combination with the Hugo st...
Hypothes.is
Hypothesis is a web annotation tool - you can annotate any page and your comments are then public for others to see (or you can privately annotate stuff) Data Ownership Hypothes.is is an open source project run by a non-profit. They consider all annotations ma...
DBT
DBT is a data transformation tool with a SaaS platform and an open-core command line tool. The tool is widely used to put the T in ELT. Robin Moffat has written a walkthrough/guide on how he used DBT with DuckDB
Model Quantization
Deploying models that are performant (obviously statistically but in this context I primarily mean computationally) is challenging when you are working with large models such as BERT etc. Quantization involves compressing model weights into smaller, more effi...