About: Data-scientist who loves to use #datascienceforgood, especially in ecology, energy and the environment. Bonsai, gardening, bikes and music when I'm not at a keyboard.
Location:
Cardiff, Wales
Joined:
Mar 29, 2019
Copilot for your GitHub stars
Publish Date: Nov 19 '23
5 9
How do you use your GitHub stars?
I'd guess if you've been programming for a few years you've probably hit the star button at the top of a few of your favourite repos. I know some people I follow have done it thousands of times. Do you go back to them though? Do you review them for inspiration for your next project or go to them when you're stuck on a partictular problem?
Inspiration
I've always assumed I would use them but I never have. I found myself doing some research recently into how to build software that uses LLMs, with the deliberate goal of building an as yet undefined side-project. I wanted to build something I hadn't built before, something that was hopefully a little original, and maybe even useful! So yet again I was starring repos like LangChain and Chroma, swearing this time would be different.
As I was running through blog posts and diligently smashing the star buttons I realised that I had just hit on exactly what I wanted to try. I wanted to bring my GitHub stars right into my editor. I wanted to be able to have them next to me as I was working and get a sensible set of suggestions on what might be useful for my needs at that moment, and I had just been starring the exact repos that could make this happen!
The original idea
Use a dataset of your personal stars to inform retrival augmented generation for a question and answer large language model deployed in a command line interface
I thought this would be useful for a few reasons:
By having it in the CLI its available right in my editor, and to every project.
By having a set of your personal stars the suggestions are already curated by your interests and preferences. Mine are all Python and R librarys, wierd data bases and charting libraries. Yours might mostly be Ruby gems, or web frameworks, or tools for embedded systems.
By using a large language model the tool might be more capable of understanding the intentention of your goals, for instance the query "Suggest how to build a web app" might be able to infer that you'd likely want a front end component, a backend component and a data storage component, and might even deal with servers and deployment.
By using large language models the tool might be more capable of semantic search rather than keyword matching which suits this problem as there is no strong standard on how a library describes it self through it's topics, description and documentation.
Semantic vs keyword
Keyword
A keyword search looks for the exact letters in a string, or potentially a partial match. As an example the query "Data Science" would find things that exactly matched the charcters in the string "Data Science" and maybe also ["Data", "Science", "DS"].
Semantic
A semantic search looks for the conceptual similarity between things, so in this context "Data Science" would find things that matched the vector embedding of "Data Science" as well as maybe also the vector embeddings that are associated with ["Machine Learning", "Artificial Intelligence"]
Retrieval Augmented Generation (RAG) is a technique used by large language models to cope with some of the limitations inherent in what are also sometimes referred to as 'Foundational' models.
When a model like GPT3 is trained, it is fed large amounts of textual data written by humans. These get translated into 'weights' in a nueral net. To overly simplify, these weights tell the model what the next most likely text is that follows the text it has already been shown.
However, these models don't know much about what has happened recently, what other programming resources really exist rather than what just sounds like it should exist, or where to exactly get a specific repo or webpage.
Retrieval augmented generation solves this by allowing you to feed the large language model with known real, up to date and relevant information.
Vectorstores
A type of data base called a vectorstore is commonly used for this because they are deliberately optimised towards a similarity search use case. They achieve this in a few ways:
Vectorstores store what you pass them as a 'vector embedding'. A vector embedding takes data (like text or images) and converts them to a list like representation of numbers.
Vectorstores keep similar vector embeddings close together in memory. This means that they are as fast as possible at returning lots of documents that have similar semantic meaning, because they are all clustered together.
Vectorstores have APIs that are specifically designed for these use cases, with querying methods that lean towards semantic searches more than sql queries, and loading techniques that integrate tightly into other systems that generate these vector embeddings from large language models.
Designing a system
With this set of goals and new knowledge I got to work working out which puzzle pieces I needed and how to fit them together. This time I did go through my stars (and a few other things), though maybe this is for the last time!
typer is a pretty trendy framework for building CLI tools in python right now. It embraces typing, uses function decorators to magically turn your functions into CLI commands, and has relatively clear documention.
I chose typer specifically because:
I wanted to see what the hype was about
I think typing helps write better code
I found the documentation really helpful to get started easily
⚡ Building applications with LLMs through composability ⚡
🦜️🔗 LangChain
⚡ Building applications with LLMs through composability ⚡
Looking for the JS/TS library? Check out LangChain.js.
To help you ship LangChain apps to production faster, check out LangSmithLangSmith is a unified developer platform for building, testing, and monitoring LLM applications
Fill out this form to get off the waitlist or speak with our sales team.
Quick Install
With pip:
pip install langchain
With conda:
conda install langchain -c conda-forge
🤔 What is LangChain?
LangChain is a framework for developing applications powered by language models. It enables applications that:
Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)
Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)
langchain is the most mature and well embraced large language model orchestration framework. Langchain itself doesn't supply you with any specific llm or vector store or embedding approach. Instead it is deliberately 'vendor agnostic'. It provides a common set of APIs and abstractions across a staggering number of vector data bases, large language models and embedding engines.
I chose langchain because:
It is the most established tool in a brand new space
I wasn't really sure which suppliers of vectorstores and large languge models made the most sense for my use case
I found the documentation really helpful to get started
importchromadb# setup Chroma in-memory, for easy prototyping. Can add persistence easily!client=chromadb.Client()
# Create collection. get_collection, get_or_create_collection, delete_collection also available!collection=client.create_collection("all-my-documents")
# Add docs to the collection. Can also update and delete. Row-based API coming soon!collection.add(
documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as wellmetadatas=[{"source":
GPT4All is made possible by our compute partner Paperspace
Run on an M1 macOS Device (not sped up!)
GPT4All: An ecosystem of open-source on-edge large language models.
Important
GPT4All v2.5.0 and newer only supports models in GGUF format (.gguf). Models used with a previous version of GPT4All (.bin extension) will no longer work.
GPT4All is an ecosystem to run powerful and customized large language models that work locally on consumer grade CPUs and any GPU. Note that your CPU needs to support AVX or AVX2 instructions.
A GPT4All model is a 3GB - 8GB file that you can download and plug into the GPT4All open-source ecosystem software. Nomic AI supports and maintains this software ecosystem to…
PyGitHub is a Python library to access the GitHub REST API
This library enables you to manage GitHub resources such as repositories, user profiles, and organizations in your Python applications.
Install
pip install PyGithub
Simple Demo
fromgithubimportGithub# Authentication is defined via github.AuthfromgithubimportAuth# using an access tokenauth=Auth.Token("access_token")
# First create a Github instance:# Public Web Githubg=Github(auth=auth)
# Github Enterprise with custom hostnameg=Github(base_url="https://{hostname}/api/v3", auth=auth)
# Then play with your Github objects:forrepoing.get_user().get_repos():
print(repo.name)
# To close connections after useg.close()
Soon after this I realised that pygithub would be an easy way to go to GitHub to get the information I needed and bring it back into starpilot to load into the vectorstore. I had initially thought I might be able to use the GitHub Document Loader built into langchain, though once I sat down to really work it out I realised that this doesn't give access to a users stars, so I needed an alternative.
The other way to build
There were alternatives in all these choices. I think these are all totally viable parts to build effectively the same system:
Click is a Python package for creating beautiful command line interfaces
in a composable way with as little code as necessary. It's the "Command
Line Interface Creation Kit". It's highly configurable but comes with
sensible defaults out of the box.
It aims to make the process of writing command line tools quick and fun
while also preventing any frustration caused by the inability to
implement an intended CLI API.
importclick@click.command()@click.option("--count", default=1, help="Number of greetings.")@click.option("--name", prompt="Your name", help="The person to greet.")defhello
I actually am using click, sort of. typer is built ontop of click, but to be honest I didn't really know that before I'd mostly decided. click looks like a really great project, but it wasn't as clear how to get started.
llama_index is probably a great project, but I only found it late in my thinking on this project. If I start a different project it's suitable for any time soon I'm definately going to try it out as a comparison.
A library for efficient similarity search and clustering of dense vectors.
Faiss
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Meta's Fundamental AI Research group.
News
See CHANGELOG.md for detailed information about latest features.
Introduction
Faiss contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also…
I'd used faiss in a tutorial on vectorstores before. It didn't strike me as hugely intuitive to use or as simple to set up (it's recommended installation path is via conda). I also don't particularly like Facebook so I'm happy to use an alternative.
The OpenAI Python library provides convenient access to the OpenAI REST API from any Python 3.7+
application. The library includes type definitions for all request params and response fields
and offers both synchronous and asynchronous clients powered by httpx.
The SDK was rewritten in v1, which was released November 6th 2023. See the v1 migration guide, which includes scripts to automatically update your code.
pip install openai
Usage
The full API of this library can be found in api.md.
importosfromopenaiimportOpenAIclient=OpenAI(
# This is the default and can be omittedapi_key=os.environ.get("OPENAI_API_KEY"),
)
chat_completion=client.chat.completions.create(
messages=[
I'd used openai for a handful of tutorials and notebook experiments already and been very happy with it. However for a project like this I wasn't really sure what the operational costs would be, and if they would be worth it for the benefit the tool provides. That combined with the requirement to have network connectivity while using the tool pushed me towards experimenting with alternatives. Luckily with langchain I should be able to provide it as an optional backend in the future?
What state is starpilot now?
"actively developed", "v0.1.0", "untested" and "it runs on my machine" are good descriptions of the project right now.
I've spent a few evenings this month on it, and see myself at least spending a few more on it next month. The API is getting breaking changes almost everytime I open the project. It's got 0 real tests. It should get some soon though. It requires a few manual installation steps that are documented in README.md but haven't yet even been attempted on another machine other than the one I'm on right now.
It also doesn't yet achieve exactly what I want it to, but I see no reason yet that it can't with some more development time.
Current features
starpilot read MyCoolUserName
This will connect to Github and read the starred repos of the user MyCoolUserName. Then it will go to each of those repos and get the topics and descriptions (and optionally the readmes) and load these into chroma which is persisted on the local hard drive.
starpilot shoot "insert topic here"
This will spin up the chroma database and perform a semantic similarity search on the string given in the command, then return the documents that seem to be the most relevant.
starpilot fortuneteller "Insert a question here"
This will perform the exact same search as the shoot command, but then spin up a large language model and pass the results into the large language model for processing. It then returns the documents it found as well as the response from the LLM
So....
That's where this project is at. I've learnt a tonne about the available tools and relevant techniques in this space already, which was really the main goal of starting to begin with!
That said the progress I've made so far only makes me more curious about what else can be done with this and what else can be solved towards the vision of "Making your GitHub stars more valuable in your daily coding". Here's some ideas that I've found exciting while getting my hands dirty that might show up in the future. These are along with the obvious things like any testing at all, a simpler way to set up the project on your machine, better error handling, a more sensible way to update the vectorstore than drop everything and rebuild each time, etc.
Inspecting the current projects description (both it's loose goals as well as more specific things like what packages it already uses) so that things that are already used aren't suggested and are instead used to inform the response.
Dynamically creating a GitHub list of similar starred repos for your user (though that would probably rely on this suggestion to extend the GitHub API) so that you naturally have some ways of saving and sharing your starred repos that solve a specific problem between sessions in your terminal
Building starpilot into a research agent that can perform actions such as installing the selected suggestion into the current project or be sent to GitHub to find new projects that solve the current goals that you haven't starred yet
What do you think?
Does this sound like something intersting to you, maybe even something useful? Did this just spark inspiration in you for a new project? Does this actually already exist somewhere and I'm just being an idiot? Let me know :)
Thanks, let's take advantage of those github 🌟️ ;-)
I was thinking about something along the same lines, but, to be honest, due to my lack of time and background in machine learning, this was probably never going to happen.
Yes you are quite close, basically, read my stars, put the found repos in an index, answer questions using the indexed content, or recommend related repos
This was actually originally going to be a way to automate the creation of GitHub lists, until I discovered there wasn't a user facing API for those objects yet :(
Thanks, let's take advantage of those github 🌟️ ;-)
I was thinking about something along the same lines, but, to be honest, due to my lack of time and background in machine learning, this was probably never going to happen.